Prosecution Insights
Last updated: May 29, 2026
Application No. 16/383,429

EMBEDDING MULTIMODAL CONTENT IN A COMMON NON-EUCLIDEAN GEOMETRIC SPACE

Final Rejection §103§112
Filed
Apr 12, 2019
Priority
Apr 20, 2018 — provisional 62/660,863
Examiner
LO, ANN J
Art Unit
2100
Tech Center
2100 — Computer Architecture & Software
Assignee
Sri International
OA Round
8 (Final)
45%
Grant Probability
Moderate
9-10
OA Rounds
0m
Est. Remaining
71%
With Interview

Examiner Intelligence

Grants 45% of resolved cases
45%
Career Allowance Rate
100 granted / 224 resolved
-10.4% vs TC avg
Strong +27% interview lift
Without
With
+26.7%
Interview Lift
resolved cases with interview
Typical timeline
4y 7m
Avg Prosecution
3 currently pending
Career history
238
Total Applications
across all art units

Statute-Specific Performance

§101
1.2%
-38.8% vs TC avg
§103
73.7%
+33.7% vs TC avg
§102
23.6%
-16.4% vs TC avg
§112
1.1%
-38.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 224 resolved cases

Office Action

§103 §112
DETAILED ACTION Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Response to Arguments Applicant’s arguments with respect to rejections under 35 USC 103 have been considered, see the response below. In response to applicant’ s arguments that the asserted combination does not teach “training the semantic embedding space by using a loss function of a deep convolutional neural network (DCNN) that includes at least a reconstruction loss for at least one of the first modality or the second modality implemented as a fully connected layer of the DCNN and a user-only ranking loss”, examiner disagrees. First, the specification is silent as to the newly claimed amendments “reconstruction loss for at least one of the first modality or the second modality and a user-only ranking loss” Since there is no support found for the claimed limitation, even in paragraphs [0050-0052] as stated by the applicants, examiner has interpreted the claim to be “training the semantic embedding space by using a loss function of a deep convolutional neural network (DCNN) that includes at least a reconstruction loss for at least one of the first modality or the second modality implemented as a fully connected layer of the DCNN and a user-only ranking loss” Sun, Page 6 Section 3.3 discloses “Although we present the domain-specific encoders and decoders as regular (fully connected) neural networks in the description of our deep fusion framework, there is no technical constraint that limits the type of the neural networks being used. One can use convolutional networks or other types of neural networks. Our deep fusion is an unified framework for multimodal embedding that can incorporate heterogeneous domain encoders (decoders). The capability of combining different types of encoders (decoders) is particularly pertinent when we fuse image domain with other domains. Many studies have shown that convolutional neural networks render much better performance with images than regular fully connected networks [28,43]. As a result, it is more suitable to first apply a convolutional network that extracts domain-specific features from the images, and then use these features in multimodal fusion, where they will be combined with features from other domains to generate high-level features that involve all the domains. Since it is a general understanding that deep models require large datasets to train, a deep convolutional network can be a good choice for the domain-specific encoder (decoder) if there is a large amount of data.” Sun, Page 7 Section 3.4: “We use a training process that minimizes a combination of reconstruction loss and rating loss to obtain the values for the parameters ... The rating loss for a single user-item pair (i, j) is given by” Examiner notes that here, Sun discloses the use of a fully connected DCNN that is trained with a reconstruction loss and a user loss (“rating loss for a single user-item pair”) Specification The specification is objected to as failing to provide proper antecedent basis for the claimed subject matter. See 37 CFR 1.75(d)(1) and MPEP § 608.01(o). Correction of the following is required: reconstruction loss for at least one of the first modality or the second modality and a user-only ranking loss. Claim Rejections - 35 USC § 112 Claims 1-8, 10, 12-22 rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA 35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention. Claims 1, 13, and 17 recite “reconstruction loss for at least one of the first modality or the second modality and a user-only ranking loss.” Support has not been found in the specification for this limitation. In light of the specification, examiner has interpreted “reconstruction loss for at least one of the first modality or the second modality” as “reconstruction loss” and “user-only ranking loss” is interpreted as “user loss” Dependent claims are also rejected because they inherit the deficiencies of the base claims. The following is a quotation of 35 U.S.C. 112(b): (b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention. The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph: The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention. Claims 1-8, 10, 12-22 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as failing to set forth the subject matter which the inventor or a joint inventor, or for applications subject to pre-AIA 35 U.S.C. 112, the applicant regards as the invention. Claims 1, 13, and 17 recite “reconstruction loss for at least one of the first modality or the second modality and a user-only ranking loss.” Since support has not been found in the specification for this limitation, it is unclear what is meant by this limitation. For purposes of examination, examiner interprets “reconstruction loss for at least one of the first modality or the second modality” as “reconstruction loss” and “user-only ranking loss” is interpreted as “user loss” Dependent claims are also rejected because they inherit the deficiencies of the base claims. Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 1, 6-7, and 10 are rejected under 35 U.S.C. 103 as being unpatentable over Sun et al. (“A Multi-Modality Deep Network for Cold-Start Recommendation”; hereinafter “Sun”) in view of Amir et al. (“Modelling Context with User Embeddings for Sarcasm Detection in Social Media”; hereinafter “Amir”), and further in view of Yu et. al. (“User Embedding for Scholarly Microblog Recommendation”; hereinafter “Yu”). As per Claim 1, Sun teaches for each of a plurality of content of the multimodal content, creating a respective, first modality feature vector representative of content of the multimodal content having a first modality; for each of a plurality of content of the multimodal content, creating a respective, second modality feature vector representative of content of the multimodal content having a second modality (Sun, Page 4 Section 3.2: “We propose a general deep fusion framework for multimodal embedding (feature extraction). (Note that we use embedding and features/feature extraction interchangeably since they all refer to finding a representation for the data). In multimodal embedding, data from multiple domains are available to describe an object. We seek an embedding vector (feature vector) that combines information from different domains to represent the item and achieve better performance than single-view learning by using this representation. In this sense, our embedding is also task related because different tasks may involve different aspects of the item. In this paper, we target our multimodal embedding for rating prediction … As shown in Figure 1, for each domain, our framework has a sub-network that extracts domain-specific features (e.g., those corresponding to zp or zs) from the domain input. These features are then fed to (the first half of) the central fusion network that combines the features from multiple domains and further extracts fused high-level features (e.g., those corresponding to z).” Examiner notes that here, Sun discloses, for first and second modalities (“different domains”), creating corresponding first and second feature vectors (“extracts domain-specific features”, “zp or zs”)). embedding the feature vectors of the first modality and the second modality and the [user] feature vectors in the semantic embedding space (Sun, Page 4 Section 3.2: “These features are then fed to (the first half of) the central fusion network that combines the features from multiple domains and further extracts fused high-level features (e.g., those corresponding to z).” Examiner notes that here, Sun discloses that the feature vectors of the first and second modality are jointly embedded in the semantic embedding space by being fused together into a fused vector. Examiner also further notes that Sun does not limit the quantity of modalities to 2 (“multiple domains”), and so there may be a third fused feature vector; however, Sun does not explicitly disclose that one of the features may be a user source. This will be taught by another reference. training the semantic embedding space by using a loss function of a deep convolutional neural network (DCNN) that includes at least a reconstruction loss for at least one of the first modality or the second modality implemented as a fully connected layer of the DCNN and a user-only ranking loss (Sun, Page 6 Section 3.3: “Although we present the domain-specific encoders and decoders as regular (fully connected) neural networks in the description of our deep fusion framework, there is no technical constraint that limits the type of the neural networks being used. One can use convolutional networks or other types of neural networks. Our deep fusion is an unified framework for multimodal embedding that can incorporate heterogeneous domain encoders (decoders). The capability of combining different types of encoders (decoders) is particularly pertinent when we fuse image domain with other domains. Many studies have shown that convolutional neural networks render much better performance with images than regular fully connected networks [28,43]. As a result, it is more suitable to first apply a convolutional network that extracts domain-specific features from the images, and then use these features in multimodal fusion, where they will be combined with features from other domains to generate high-level features that involve all the domains. Since it is a general understanding that deep models require large datasets to train, a deep convolutional network can be a good choice for the domain-specific encoder (decoder) if there is a large amount of data.” Sun, Page 7 Section 3.4: “We use a training process that minimizes a combination of reconstruction loss and rating loss to obtain the values for the parameters ... The rating loss for a single user-item pair (i, j) is given by” Examiner notes that here, Sun discloses the use of a fully connected DCNN that is trained with a reconstruction loss and a user loss (“rating loss for a single user-item pair”)). and which enables the DCNN to jointly learn an embedding location of the feature vectors of at least one of the first modality or the second modality[, and the user source] in the semantic embedding space and to minimize a distance between related embedded feature vectors of the first modality, the second modality, [and the user] feature vectors in the semantic embedding space (Sun, Bottom of Page 5: “During the training, the model is given corrupted domain inputs, and is trained to predict the original inputs as a result of decoding … In an unsupervised setting, one may minimize the reconstruction loss PNG media_image1.png 52 180 media_image1.png Greyscale to train the model and use the trained model to obtain multimodal embedding for the data. However, as we have discussed earlier, embedding is task related. Therefore, we consider a semi-supervised model where the training involves both the reconstruction loss and a task-specific loss.” Here, Examiner notes that this enables the DNN to jointly learn an embedding in the embedding space (“obtain multimodal embedding for the data”) and this minimizes a distance between related feature vectors of the first and second modalities (these are fused vectors that represent both modalities) in the embedding space (Sun shows a reconstruction loss between an original and corrupted feature fused feature vector (of the first and second modalities), which are related, wherein the reconstruction loss is a Euclidean distance)). However, Sun does not explicitly teach for each of the plurality of content of the multimodal content having the first modality and the second modality, creating a respective user feature vector representative of an identity of a user source of respective multimodal content; minimize a distance between related embedded feature vectors of the first modality, the second modality, and the user feature vectors in the semantic embedding space Amir teaches for each of the plurality of content of the multimodal content having the first modality and the second modality, creating a respective user feature vector representative of an identity of a user source of respective multimodal content (Sun as shown above discloses multimodal content having two modalities. Amir, Page 9 Section 7 disclose joint learning of contents and users: “Our model jointly learns and exploits embeddings for the content and users, thus integrating information about the speaker and what he or she has said.” Amir also discloses a loss function in Page 3 Right Column Para 2: “To learn meaningful user embeddings, we seek representations that are predictive of individual word-usage patterns. In light of this motivation, we approximate P(wijuj) via the following hinge-loss objective which we aim to minimize.” Amir, Page 3 Para 3, discloses: “Given a sentence S = {w1, . . . , wN} where wi denotes a word drawn from a vocabulary V, we aim to maximize the following probability … Where C(wi) denotes the set of words in a prespecified window around word wi, ek ∈ Rd and uj ∈ Rd denote the embeddings of word k and user j, respectively. This objective function encodes the notion that the occurrence of a word w, depends both on the author of S and it’s neighbouring words.” Here, Amir discloses jointly embedding the modality vector (“embeddings of word k”) and the user vector (“embeddings of … user j”) in the same geometric space (both vectors have dimension d: “ek ∈ Rd and uj ∈ Rd”)). Amir is analogous art because it is in the field of endeavor of leveraging vector embeddings for analysis of content. It would have been obvious before the effective filing date of the claimed invention to combine the multimodal embedding of Sun (who does not limit the modalities to two, but states “multiple domains”), with the user embedding of Amir, which can be a third “domain” of Sun. One of ordinary skill in the art would be motivated to do so in order to better place related multimodal content in the embedding space by fully capturing the intention of the multimedia content (Amir, Page 9 End of Section 6.2: “In Figure 5, we show these examples along with the predicted probabilities of being a sarcastic post, when no user information is considered and when the author is taken into account. We can see that the predictions drastically change when contextual information is available and that two of the authors trigger similar responses on both examples. This example provides evidence that our model captures the intuition that the same utterance can be interpreted as sarcastic or not, depending on the speaker.”) However, the combination of Sun and Amir does not teach minimize a distance between related embedded feature vectors of the first modality, the second modality, and the user feature vectors in the semantic embedding space Yu teaches minimize a distance between related embedded feature vectors of the first modality, the second modality, and the user feature vectors in the semantic embedding space (Sun as shown above discloses minimizing a distance between vectors representing the first and second modalities, and using a loss function. Yu, Page 451 Section 3.6, discloses: “When recommending microblogs, given a microblog dj and a user uk, we compute the cosine distance between their vector representations, and use the cosine distance to determine whether dj should be recommended to uk or not.”) Yu is analogous art because it is in the field of endeavor of leveraging vector embeddings for analysis of content. Amir and Yu also both take similar approaches to learning the user vector (Amir, Page 3 Above Eq 1: “we aim to maximize the following probability” and Yu, Page451 Section 3.5: “In this framework, the average log probability we want to maximize”). It would have been obvious before the effective filing date of the claimed invention to combine the multimodal and user embedding of Sun and Amir with the cosine distance between users and content of Yu. The combination would result in a vector for a first modality, a vector for a second modality, and a vector for a user to all be embedded in a common geometric space and to be able to use cosine distance to determine how much any set of these vectors is related, and this would allow one to make recommendations of multimedia content to users. One of ordinary skill in the art would be motivated to use Yu’s user-content recommendation method in order to increase accuracy over previous methods of simple average embedding methods (Yu, Page 9 End of Section 6.2: “As we can see, the two proposed joint learning methods outperform the simple average embedding method and the two other baselines”). Examiner note: Examiner points out that Amir themselves in their following paper (“Quantifying Mental Health from Social Media with Neural User Embeddings”) point out that they and Yu “use essentially the same approach”: “Recently proposed methods to learn user representations use essentially the same approach, associating users with parameter vectors, and optimizing these to accurately predict observable attributes or the words used in previous posts written by said user (Amir et al., 2016; Yu et al., 2016). User embeddings induced by Amir et al. (2016) using only the previous posts from a user were shown to capture latent individual attributes (e.g. political leanings) and a soft notion of ‘homophily’ — i.e., similar users were generally associated with relatively nearby vectors. Further, the embeddings improved a downstream model for sarcasm detection in tweets. Similarly, Yu et al. (2016) improved a microblog recommendation system by including user representations.” As per Claim 6, the combination of Sun, Amir, and Yu teaches the method of Claim 1. Sun teaches embedded, combined multimodal feature vector (Sun, Page 4 Section 3.2: “These features are then fed to (the first half of) the central fusion network that combines the features from multiple domains and further extracts fused high-level features (e.g., those corresponding to z).” However, Sun does not teach appending content-related information, including at least one of user information and user grouping information, to at least one embedded, first modality feature vector, one embedded, second modality feature vector and one embedded, combined multimodal feature vector Amir teaches appending content-related information, including at least one of user information and user grouping information, to at least one embedded, first modality feature vector, one embedded, second modality feature vector and one embedded, combined multimodal feature vector (Amir, Top Right Page 4, discloses: “where ⊕ denotes concatenation” and “The output of all the filters is then combined to form the final representation c = [f 1 ⊕ f 2 ⊕ f 3]. We will denote this feature vector of a specific sentence S by cS.” Then below above Eq 6 Amir states: “Letting Uu denote the user embedding of author u, we formulate our sarcasm detection model as follows: PNG media_image2.png 200 319 media_image2.png Greyscale Above, Amir discloses concatenation of user vector with a content vector. This is also shown on Page 5 Figure 2: PNG media_image3.png 345 734 media_image3.png Greyscale As disclosed above, Sun teaches that a content vector may be any of a first modality vector, second modality vector, or a combined multimodal vector. Thus, the combination of Sun and Amir suggests appending a user vector to at least one of each of these types of vectors.) It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Amir with Sun for at least the reasons recited in the rejection to Claim 1. As per Claim 7, the combination of Sun, Amir, and Yu teaches the method of Claim 1. Sun teaches combined multimodal feature vector (Sun, Page 4 Section 3.2: “These features are then fed to (the first half of) the central fusion network that combines the features from multiple domains and further extracts fused high-level features (e.g., those corresponding to z).” However, Sun does not teach wherein content-related information comprises at least one of agent information or agent grouping information for at least one embedded, first modality feature vector, one embedded, second modality feature vector and one embedded, combined multimodal feature vector. Amir teaches wherein content-related information comprises at least one of agent information or agent grouping information for at least one embedded, first modality feature vector, one embedded, second modality feature vector and one embedded, combined multimodal feature vector. (Amir teaches agent information, as Amir, Page 3 Section 3, Para 2, discloses a training set including content and author: “To induce the user embeddings, we adopt an approach similar to that described in the preliminary work of Li et al. (2015). In particular, we capture relations between users and the content they produce by optimizing the conditional probability of texts, given their authors (or, more precisely, given the vector representations of their authors). As shown above, Sun teaches that a content vector may be any of a first modality vector, second modality vector, or a combined multimodal vector. Thus, the combination of Sun and Amir suggests agent information for at least one of each of these types of vectors.) It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Amir with Sun for at least the reasons recited in the rejection to Claim 1. As per Claim 10, the combination of Sun, Amir, and Yu teaches the method of Claim 1. Sun teaches multimodal content (Sun, Page 4 Section 3.2: “These features are then fed to (the first half of) the central fusion network that combines the features from multiple domains and further extracts fused high-level features (e.g., those corresponding to z).” However, Sun does not teach posted by an agent on a social media network, wherein the agent comprises at least one of a computer, robot, a person with a social media account, and a participant in a social media network. Amir teaches posted by an agent on a social media network, wherein the agent comprises at least one of a computer, robot, a person with a social media account, and a participant in a social media network Amir, Page 9 Section 7, begins: “We have introduced CUE-CNN, a novel, deep neural network for automatically recognizing sarcastic utterances on social media.”) It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Amir with Sun for at least the reasons recited in the rejection to Claim 1. Claims 2-3 and 5 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Sun, Amir, and Yu and further in view of Vukotic et. al. (US 2017/0061250 A1; hereinafter “Vukotic”) As per Claim 2, the combination of Sun, Amir, and Yu teaches the method of Claim 1. Sun teaches for each of a plurality of first modality feature vector and second modality feature vector content pairs, forming a combined multimodal feature vector from the first modality feature vector and the second modality feature vector (Sun, Page 4 Section 3.2: “These features are then fed to (the first half of) the central fusion network that combines the features from multiple domains and further extracts fused high-level features (e.g., those corresponding to z).” Examiner notes that this is done a plurality of times, and thus for a plurality of content pairs, as Sun recites a training process with reconstruction loss.) However, Sun does not teach semantically embedding the respective, combined multimodal feature vectors in the semantic embedding space to capture relationships between at least two of the embedded, first modality feature vectors, the embedded, second modality feature vectors and the embedded combined multimodal feature vectors. To further explain, Sun, as shown in Figure 1, does not use the semantic embedding space to capture relationships between multimodal vectors. Instead, Sun multiplies the resulting multimodal feature vector by a transposed vector representing a user, resulting in a scalar rating. To teach the matter of semantically embedding a plurality of multimodal content, and keeping it in the semantic embedding space in order to capture relationships between them, another reference is needed. Secondly, Sun does not necessarily teach that the individual feature vectors and combined feature vectors have the same dimension (while Sun does not disclose any specific dimensions, the dimensions appear different in Sun Figure 1), and therefore relationships cannot be captured between them in the semantic embedding space. For these reasons, another reference is needed. Vukotic teaches semantically embedding the respective, combined multimodal feature vectors in the semantic embedding space to capture relationships between at least two of the embedded, first modality feature vectors, the embedded, second modality feature vectors and the embedded combined multimodal feature vectors (Vukotic Figure 2: PNG media_image4.png 474 589 media_image4.png Greyscale Vukotic, End of Page 40, discloses capturing relationships between multimodal vectors: “Finally, segments are then compared as illustrated in Figure 3: for each video segment, the two modalities are taken (embedded automatic transcripts with either embedded visual concepts or embedded CNN representations) and a multimodal embedding is created with a bidirectional deep neural network. The two multimodal embeddings are then simply compared with a cosine distance to obtain a similarity measure.” PNG media_image5.png 247 391 media_image5.png Greyscale Vukotic, in addition to comparing two multimodal feature vectors, also discloses comparing first or second individual modality vectors with a multimodal feature vector. Vukotic, Page 40 Right Column Second Dash, discloses: “When one modality is available and the other is not (either only transcripts or only visual information), the available modality is presented to its respective input of the network and the activations are propagated. The central layer is then used to generate an embedding by being duplicated, thus still generating an embedding of the same size while allowing to transparently compare video segments regardless of modality availability (either with only one or both modalities).”) Vukotic is analogous art because it is in the field of endeavor of leveraging vector embeddings for analysis of content. It would have been obvious before the effective filing date of the claimed invention to combine the multimodal embedding of Sun with the same-dimension multimodal embedding and individual modality embeddings of Vukotic in the same multimodal embedding space. One of ordinary skill in the art would have been motivated to do so in order to be able to identify relevant multimodal media (Vukotic Page 41 Section 3.1: “Anchors represent segments of interest within videos that a user would like to know more about. Targets represent potential segments of interests that might or might not be related with a specific anchor. The goal is to hyperlink relevant targets for each anchor by using multimodal approaches.”) and to be able to do this even when an item has only one modality (Vukotic Page 40 Right Column: “When one modality is available and the other is not (either only transcripts or only visual information), the available modality is presented to its respective input of the network and the activations are propagated. The central layer is then used to generate an embedding by being duplicated, thus still generating an embedding of the same size while allowing to transparently compare video segments regardless of modality availability (either with only one or both modalities).”) As per Claim 3, the combination of Sun, Amir, Yu, and Vukotic teaches the method of Claim 2. Amir teaches semantically embedding content-related information, including user grouping information in the semantic embedding space (Amir discloses that user grouping information is captured in the semantic embedding space at the top of Page 9: “Moreover, the embeddings seem to uncover a notion of homophily, i.e. similar users tend to occupy neighbouring regions of the embedding space.” Amir Page 8 Figure 4 shows user grouping information: PNG media_image6.png 380 583 media_image6.png Greyscale It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Amir with Sun for at least the reasons recited in the rejection to Claim 1. However, Amir does not teach semantically embedding content-related information in the semantic embedding space based upon a relationship between the content-related information and at least one embedded, first modality feature vector, one embedded, second modality feature vector and one embedded combined multimodal feature vector (Vukotic, End of Page 40, discloses capturing relationships between multimodal vectors: “Finally, segments are then compared as illustrated in Figure 3: for each video segment, the two modalities are taken (embedded automatic transcripts with either embedded visual concepts or embedded CNN representations) and a multimodal embedding is created with a bidirectional deep neural network. The two multimodal embeddings are then simply compared with a cosine distance to obtain a similarity measure.” Here, Vukotic takes a comparison content and embeds it in the common geometric space based on its relationship with the multimodal vector of existing content, which is also based on the existing content’s two individual modality vectors.) It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Vukotic with Sun, Amir, and Yu for at least the reasons recited in the rejection to Claim 1. As per Claim 5, the combination of Sun, Amir, and Yu teaches the method of Claim 1. However, the combination does not teach wherein a second modality feature vector representative of content of the multimodal content having a second modality is created using information relating to respective content having a first modality Vukotic teaches wherein a second modality feature vector representative of content of the multimodal content having a second modality is created using information relating to respective content having a first modality (Vukotic, Page 40 Section 2.2.3, discloses: “In bidirectional deep neural networks, learning is performed in both directions: one modality is presented as an input and the other as the expected output while at the same time the second one is presented as input and the first one as expected output.” Here, the second modality feature vector is produced from multimodal content also having a first modality.) Vukotic is analogous art because it is in the field of endeavor of leveraging vector embeddings for analysis of content. It would have been obvious before the effective filing date of the claimed invention to combine the multimodal embedding of Sun with the same-dimension multimodal embedding and individual modality embeddings of Vukotic in the same multimodal embedding space. One of ordinary skill in the art would have been motivated to do so in order to be able to identify relevant multimodal media (Vukotic Page 41 Section 3.1: “Anchors represent segments of interest within videos that a user would like to know more about. Targets represent potential segments of interests that might or might not be related with a specific anchor. The goal is to hyperlink relevant targets for each anchor by using multimodal approaches.”) and to be able to do this even when an item has only one modality (Vukotic Page 40 Right Column: “When one modality is available and the other is not (either only transcripts or only visual information), the available modality is presented to its respective input of the network and the activations are propagated. The central layer is then used to generate an embedding by being duplicated, thus still generating an embedding of the same size while allowing to transparently compare video segments regardless of modality availability (either with only one or both modalities).”) Claims 4, 8, and 12 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Sun, Amir, and Yu and further in view of Nickel et. al. (“Poincaré Embeddings for Learning Hierarchical Representations”; hereinafter “Nickel”). As per Claim 4, the combination of Sun, Amir, and Yu teaches the method of Claim 1. However, the combination does not teach projecting at least one of content, content-related information, and an event into the semantic embedding space; and determining at least one embedded feature vector in the semantic embedding space close to the projection as being related to the projected at least one of the content, the content-related information, and the event Nickel teaches projecting at least one of content, content-related information, and an event into the semantic embedding space; and determining at least one embedded feature vector in the semantic embedding space close to the projection as being related to the projected at least one of the content, the content-related information, and the event (Nickel, Page 6: “We then learn embeddings of all symbols in D such that related objects are close in the embedding space.”) Nickel is analogous art because it is in the field of endeavor of content embeddings. It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the multimodal and user feature semantic embedding of the combination of Sun, Amir, and Yu, with the embedding in hyperbolic space of Nickel. The modification would have been obvious because one of ordinary skill in the art would be motivated to capture hierarchical properties and outperform Euclidean embeddings (Nickel, Abstract: “Representation learning has become an invaluable approach for learning from symbolic data such as text and graphs. However, while complex symbolic datasets often exhibit a latent hierarchical structure, state-of-the-art methods typically learn embeddings in Euclidean vector spaces, which do not account for this property. For this purpose, we introduce a new approach for learning hierarchical representations of symbolic data by embedding them into hyperbolic space — or more precisely into an n-dimensional Poincaré ball. Due to the underlying hyperbolic geometry, this allows us to learn parsimonious representations of symbolic data by simultaneously capturing hierarchy and similarity. We introduce an efficient algorithm to learn the embeddings based on Riemannian optimization and show experimentally that Poincaré embeddings outperform Euclidean embeddings significantly on data with latent hierarchies, both in terms of representation capacity and in terms of generalization ability.”) As per Claim 8, the combination of Sun, Amir, and Yu teaches the method of Claim 1. However, the combination does not teach wherein the common geometric space comprises a non-Euclidean space including at least one of a hyperbolic, a Lorentzian, and a Poincare ball. Nickel teaches wherein the common geometric space comprises a non-Euclidean space including at least one of a hyperbolic, a Lorentzian, and a Poincare ball. (Nickel, Abstract, discloses: “Representation learning has become an invaluable approach for learning from symbolic data such as text and graphs. However, while complex symbolic datasets often exhibit a latent hierarchical structure, state-of-the-art methods typically learn embeddings in Euclidean vector spaces, which do not account for this property. For this purpose, we introduce a new approach for learning hierarchical representations of symbolic data by embedding them into hyperbolic space — or more precisely into an n-dimensional Poincaré ball.” Here, Nickel discloses an alternative to Euclidean space, “hyperbolic space — or more precisely into an n-dimensional Poincaré ball”). Nickel is analogous art because it is in the field of endeavor of content embeddings. It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the multimodal and user feature semantic embedding of the combination of Sun, Amir, and Yu, with the embedding in hyperbolic space of Nickel. The modification would have been obvious because one of ordinary skill in the art would be motivated to capture hierarchical properties and outperform Euclidean embeddings (Nickel, Abstract: “Representation learning has become an invaluable approach for learning from symbolic data such as text and graphs. However, while complex symbolic datasets often exhibit a latent hierarchical structure, state-of-the-art methods typically learn embeddings in Euclidean vector spaces, which do not account for this property. For this purpose, we introduce a new approach for learning hierarchical representations of symbolic data by embedding them into hyperbolic space — or more precisely into an n-dimensional Poincaré ball. Due to the underlying hyperbolic geometry, this allows us to learn parsimonious representations of symbolic data by simultaneously capturing hierarchy and similarity. We introduce an efficient algorithm to learn the embeddings based on Riemannian optimization and show experimentally that Poincaré embeddings outperform Euclidean embeddings significantly on data with latent hierarchies, both in terms of representation capacity and in terms of generalization ability.”) As per Claim 12, the combination of Sun, Amir, and Yu teaches the method of Claim 1. However, the combination does not teach inferring information for feature vectors embedded in the semantic embedding space based on a proximity of the feature vectors to at least one other feature vector embedded in the semantic embedding space Nickel teaches inferring information for feature vectors embedded in the semantic embedding space based on a proximity of the feature vectors to at least one other feature vector embedded in the semantic embedding space (Nickel, Page 6: “We then learn embeddings of all symbols in D such that related objects are close in the embedding space.”) Nickel is analogous art because it is in the field of endeavor of content embeddings. It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the multimodal and user feature semantic embedding of the combination of Sun, Amir, and Yu, with the embedding in hyperbolic space of Nickel. The modification would have been obvious because one of ordinary skill in the art would be motivated to capture hierarchical properties and outperform Euclidean embeddings (Nickel, Abstract: “Representation learning has become an invaluable approach for learning from symbolic data such as text and graphs. However, while complex symbolic datasets often exhibit a latent hierarchical structure, state-of-the-art methods typically learn embeddings in Euclidean vector spaces, which do not account for this property. For this purpose, we introduce a new approach for learning hierarchical representations of symbolic data by embedding them into hyperbolic space — or more precisely into an n-dimensional Poincaré ball. Due to the underlying hyperbolic geometry, this allows us to learn parsimonious representations of symbolic data by simultaneously capturing hierarchy and similarity. We introduce an efficient algorithm to learn the embeddings based on Riemannian optimization and show experimentally that Poincaré embeddings outperform Euclidean embeddings significantly on data with latent hierarchies, both in terms of representation capacity and in terms of generalization ability.”) Claims 13 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Sun, Amir, and Yu in view of Gao et. al. (US 2017/0061250 A1; hereinafter “Gao”). As per Claim 13, this is an apparatus claim corresponding to method Claim 1. The difference is it recites a processor and a memory. The combination of Sun, Amir, and Yu does not explicitly teach a processor and a memory. Gao teaches a processor and a memory (Gao, Para [0036], discloses “Alternately, some or all of the above-referenced data and/or instructions can be stored on separate memories 214 on board one or more processing unit(s) 202 such as a memory on board a CPU-type processor, a GPU-type processor, an FPGA-type accelerator, a DSP-type accelerator, and/or another accelerator.”) For the remaining limitations, Claim 13 is rejected for the same reasons as Claim 1. Gao is analogous art because it is in the field of endeavor of multimodal content embedding. It would have been obvious before the effective filing date of the claimed invention to combine the computer of Gao with the multimodal embedding of Sun, Amir, and Yu. One of ordinary skill in the art would be motivated to do so in order to “accelerate” the process (Gao [0036]: “an FPGA-type accelerator, a DSP-type accelerator, and/or another accelerator.”) As per Claim 17, this is a non-transitory computer-readable medium claim corresponding to method Claim 1. The difference is it recites a processor and a non-transitory computer-readable medium. The combination of Sun, Amir, and Yu does not explicitly teach a processor and a non-transitory computer-readable medium. Gao teaches a processor and a non-transitory computer-readable medium (Gao, Para [0097], discloses “A device comprising: a processor; and a computer-readable medium”). For the remaining limitations, Claim 17 is rejected for the same reasons as Claim 1. Gao is analogous art because it is in the field of endeavor of multimodal content embedding. It would have been obvious before the effective filing date of the claimed invention to combine the computer of Gao with the multimodal embedding of Sun, Amir, and Yu. One of ordinary skill in the art would be motivated to do so in order to gain the efficiency of using a computer over doing impractical hand calculations, to thereby “accelerate” the process (Gao [0036]: “an FPGA-type accelerator, a DSP-type accelerator, and/or another accelerator.”) Claims 14-15 and 18-19 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Sun, Amir, Yu, and Vukotic in view of Gao et. al. (US 2017/0061250 A1; hereinafter “Gao”). As per Claim 14, this is an apparatus claim corresponding to method Claim 2. The difference is it recites a processor and a memory. The combination of Sun, Amir, Yu, and Vukotic does not explicitly teach a processor and a memory. Gao teaches a processor and a memory (Gao, Para [0036], discloses “Alternately, some or all of the above-referenced data and/or instructions can be stored on separate memories 214 on board one or more processing unit(s) 202 such as a memory on board a CPU-type processor, a GPU-type processor, an FPGA-type accelerator, a DSP-type accelerator, and/or another accelerator.”) For the remaining limitations, Claim 14 is rejected for the same reasons as Claim 2. Gao is analogous art because it is in the field of endeavor of multimodal content embedding. It would have been obvious before the effective filing date of the claimed invention to combine the computer of Gao with the multimodal embedding of Sun, Amir, Yu, and Vukotic. One of ordinary skill in the art would be motivated to do so in order to “accelerate” the process (Gao [0036]: “an FPGA-type accelerator, a DSP-type accelerator, and/or another accelerator.”) As per Claim 15, this is an apparatus claim corresponding to method Claim 3. The difference is it recites a processor and a memory. The combination of Sun, Amir, Yu, and Vukotic does not explicitly teach a processor and a memory. Gao teaches a processor and a memory (Gao, Para [0036], discloses “Alternately, some or all of the above-referenced data and/or instructions can be stored on separate memories 214 on board one or more processing unit(s) 202 such as a memory on board a CPU-type processor, a GPU-type processor, an FPGA-type accelerator, a DSP-type accelerator, and/or another accelerator.”) For the remaining limitations, Claim 15 is rejected for the same reasons as Claim 3. Gao is analogous art because it is in the field of endeavor of multimodal content embedding. It would have been obvious before the effective filing date of the claimed invention to combine the computer of Gao with the multimodal embedding of Sun, Amir, Yu, and Vukotic. One of ordinary skill in the art would be motivated to do so in order to “accelerate” the process (Gao [0036]: “an FPGA-type accelerator, a DSP-type accelerator, and/or another accelerator.”) As per Claim 18, this is a non-transitory computer-readable medium claim corresponding to method Claim 2. The difference is it recites a processor and a non-transitory computer-readable medium. The combination of Sun, Amir, Yu, and Vukotic does not explicitly teach a processor and a non-transitory computer-readable medium. Gao teaches a processor and a non-transitory computer-readable medium (Gao, Para [0097], discloses “A device comprising: a processor; and a computer-readable medium”). For the remaining limitations, Claim 18 is rejected for the same reasons as Claim 2. Gao is analogous art because it is in the field of endeavor of multimodal content embedding. It would have been obvious before the effective filing date of the claimed invention to combine the computer of Gao with the multimodal embedding of Sun, Amir, Yu, and Vukotic. One of ordinary skill in the art would be motivated to do so in order to gain the efficiency of using a computer over doing impractical hand calculations, to thereby “accelerate” the process (Gao [0036]: “an FPGA-type accelerator, a DSP-type accelerator, and/or another accelerator.”) As per Claim 19, this is a non-transitory computer-readable medium claim corresponding to method Claim 3. The difference is it recites a processor and a non-transitory computer-readable medium. The combination of Sun, Amir, Yu, and Vukotic does not explicitly teach a processor and a non-transitory computer-readable medium. Gao teaches a processor and a non-transitory computer-readable medium (Gao, Para [0097], discloses “A device comprising: a processor; and a computer-readable medium”). For the remaining limitations, Claim 19 is rejected for the same reasons as Claim 3. Gao is analogous art because it is in the field of endeavor of multimodal content embedding. It would have been obvious before the effective filing date of the claimed invention to combine the computer of Gao with the multimodal embedding of Sun, Amir, Yu, and Vukotic. One of ordinary skill in the art would be motivated to do so in order to gain the efficiency of using a computer over doing impractical hand calculations, to thereby “accelerate” the process (Gao [0036]: “an FPGA-type accelerator, a DSP-type accelerator, and/or another accelerator.”) Claims 16 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Sun, Amir, Yu, and Nickel in view of Gao et. al. (US 2017/0061250 A1; hereinafter “Gao”). As per Claim 16, this is an apparatus claim corresponding to method Claim 4. The difference is it recites a processor and a memory. The combination of Sun, Amir, Yu, and Nickel does not explicitly teach a processor and a memory. Gao teaches a processor and a memory (Gao, Para [0036], discloses “Alternately, some or all of the above-referenced data and/or instructions can be stored on separate memories 214 on board one or more processing unit(s) 202 such as a memory on board a CPU-type processor, a GPU-type processor, an FPGA-type accelerator, a DSP-type accelerator, and/or another accelerator.”) For the remaining limitations, Claim 16 is rejected for the same reasons as Claim 4. Gao is analogous art because it is in the field of endeavor of multimodal content embedding. It would have been obvious before the effective filing date of the claimed invention to combine the computer of Gao with the multimodal embedding of Sun, Amir, Yu, and Nickel. One of ordinary skill in the art would be motivated to do so in order to “accelerate” the process (Gao [0036]: “an FPGA-type accelerator, a DSP-type accelerator, and/or another accelerator.”) As per Claim 20, this is a non-transitory computer-readable medium claim corresponding to method Claim 4. The difference is it recites a processor and a non-transitory computer-readable medium. The combination of Sun, Amir, Yu, and Nickel does not explicitly teach a processor and a non-transitory computer-readable medium. Gao teaches a processor and a non-transitory computer-readable medium (Gao, Para [0097], discloses “A device comprising: a processor; and a computer-readable medium”). For the remaining limitations, Claim 20 is rejected for the same reasons as Claim 4. Gao is analogous art because it is in the field of endeavor of multimodal content embedding. It would have been obvious before the effective filing date of the claimed invention to combine the computer of Gao with the multimodal embedding of Sun, Amir, Yu, and Nickel. One of ordinary skill in the art would be motivated to do so in order to gain the efficiency of using a computer over doing impractical hand calculations, to thereby “accelerate” the process (Gao [0036]: “an FPGA-type accelerator, a DSP-type accelerator, and/or another accelerator.”) Claim 21 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Sun, Amir, and Yu in view of Frome et. al. (“DeViSE: A Deep Visual-Semantic Embedding Model”; hereinafter “Frome”). As per Claim 21, the combination of Sun, Amir, and Yu teaches the method of claim 1. Amir teaches a loss function for a user feature vector (Amir, Page 9 Section 7 disclose joint learning of contents and users: “Our model jointly learns and exploits embeddings for the content and users, thus integrating information about the speaker and what he or she has said.” Amir also discloses a loss function in Page 3 Right Column Para 2: “To learn meaningful user embeddings, we seek representations that are predictive of individual word-usage patterns. In light of this motivation, we approximate P(wijuj) via the following hinge-loss objective which we aim to minimize.”) It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Amir with Sun for at least the reasons recited in the rejection to Claim 1. However, the combination of Sun, Amir, and Yu does not teach wherein the loss function comprises a ranking loss term for at least one user feature vector Frome teaches wherein the loss function comprises a ranking loss term for at least one user feature vector (Recall above that Amir teaches a loss function for a user feature vector. Also recall that Amir’s loss function was a “hinge loss”. Frome, Top of Page 4, discloses: “The choice of loss function proved to be important. We used a combination of dot-product similarity and hinge rank loss (similar to [20]) such that the model was trained to produce a higher dot-product similarity between the visual model output and the vector representation of the correct label than between the visual output and other randomly chosen text terms. We defined the per training example hinge rank loss.” Here, Frome teaches a ranking loss term (“hinge rank loss”)). Frome is analogous art because it is in the field of endeavor of multimodal content embeddings. It would have been obvious before the effective filing date of the claimed invention to combine the multimodal embeddings of Sun, Amir, and Yu with the ranking loss of Frome. More specifically, it would have been obvious to replace Amir’s “hinge loss” for a user vector with Frome’s “hinge rank loss” for the user feature vector. One of ordinary skill in the art would be motivated to so do in order to achieve increased accuracy of the multimodal embedding (Frome, Page 4 End of Para 2: “We also experimented with an L2 loss between visual and label embeddings, as suggested by Socher et al. [18], but that consistently yielded about half the accuracy of the rank loss model. We believe this is because the nearest neighbor evaluation is fundamentally a ranking problem and is best solved with a ranking loss, whereas the L2 loss only aims to make the vectors close to one another but remains agnostic to incorrect labels that are closer to the target image.”) Claim 22 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Sun, Amir, Yu, and Frome in view of Gao. As per Claim 22, this is an apparatus claim corresponding to method Claim 21. The difference is it recites a processor and a memory. The combination of Sun, Amir, Yu, and Frome does not explicitly teach a processor and a memory. Gao teaches a processor and a memory (Gao, Para [0036], discloses “Alternately, some or all of the above-referenced data and/or instructions can be stored on separate memories 214 on board one or more processing unit(s) 202 such as a memory on board a CPU-type processor, a GPU-type processor, an FPGA-type accelerator, a DSP-type accelerator, and/or another accelerator.”) For the remaining limitations, Claim 22 is rejected for the same reasons as Claim 21. Gao is analogous art because it is in the field of endeavor of multimodal content embedding. It would have been obvious before the effective filing date of the claimed invention to combine the computer of Gao with the multimodal embedding of Sun, Amir, Yu, and Frome. One of ordinary skill in the art would be motivated to do so in order to “accelerate” the process (Gao [0036]: “an FPGA-type accelerator, a DSP-type accelerator, and/or another accelerator.”) Conclusion Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANN J LO whose telephone number is (571)272-9767. The examiner can normally be reached Monday-Friday, 9 AM to 5 PM. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Cordelia Zecher can be reached at 571-272-7771. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /ANN J LO/Supervisory Patent Examiner, Art Unit 2159
Read full office action

Prosecution Timeline

Show 29 earlier events
Apr 10, 2024
Response after Non-Final Action
Apr 10, 2024
Response after Non-Final Action
Jun 04, 2025
Response after Non-Final Action
Jul 31, 2025
Request for Continued Examination
Aug 05, 2025
Response after Non-Final Action
Nov 21, 2025
Non-Final Rejection mailed — §103, §112
Feb 23, 2026
Response Filed
Apr 01, 2026
Final Rejection mailed — §103, §112 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12614108
METHOD FOR RECOMMENDING INFORMATION, RECOMMENDATION SERVER, AND STORAGE MEDIUM
4y 0m to grant Granted Apr 28, 2026
Patent 12602228
NEUROMORPHIC PROCESSOR AND NEUROMORPHIC PROCESSING METHOD
4y 6m to grant Granted Apr 14, 2026
Patent 12602597
AUTOMATIC DISCOVERY OF MACHINE LEARNING MODEL FEATURES
4y 6m to grant Granted Apr 14, 2026
Patent 12566146
INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING METHOD, NON-TRANSITORY COMPUTER READABLE MEDIA STORING PROGRAM, AND X-RAY ANALYSIS APPARATUS
3y 3m to grant Granted Mar 03, 2026
Patent 12541710
COMPUTERIZED SYSTEMS AND METHODS FOR USER ACTION PREDICTION
3y 10m to grant Granted Feb 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

9-10
Expected OA Rounds
45%
Grant Probability
71%
With Interview (+26.7%)
4y 7m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 224 resolved cases by this examiner. Grant probability derived from career allowance rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month