Office Action Analysis: 17805262 — SUPERVISED CONTRASTIVE LEARNING FOR RELATED CONTENT RECOMMENDATION

Office Action

§103 §112
DETAILED ACTION
This office action is in response to amendments filed on 10/31/2025.
Claims 1, 8, 14, and 19 have been amended. Claims 7 and 20 have been canceled. Claims 21-22 have been added. Claims 1-6, 8-19, and 21-22 are pending.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Objection to the Specification:
In light of applicant’s amendment to the specification (pg. 2), the objection to the specification has been withdrawn.

Objection to the Claims:
In light of applicant’s amendments to the claims (pg. 3-9), the objection to the claims has been withdrawn.

35 U.S.C. 112(b) rejection:
In light of applicant’s amendments to the claims (pg. 3-9), the rejection under 35 USC § 112(b) has been withdrawn. A new rejection under 35 USC § 112(b) has been introduced for the amended claim 14 due to a lack of antecedent basis for the limitation “the determined inner products” (see rejection below). This appears to be a simple typographical error: the amended limitation directed to “determining…an inner product” in system claim 1 is missing from method claim 14.

35 U.S.C. 101 rejection:
Applicant’s arguments regarding the claim rejections under 35 U.S.C. 101 (pg. 13-22) have been fully considered and are persuasive. The rejection of the pending claims has been withdrawn.

Prior Art Rejections:
Applicant's arguments regarding the prior art rejections (pg. 23-32) have been fully considered but they are not persuasive. 
Applicant argues that the amended independent claims 1, 8, and 14 recite features not disclosed by any of the cited references. Specifically, applicant argues that the cited references do not disclose training using contrastive loss based on pairwise co-click signals. Examiner respectfully notes that, as can be seen in the rejection below, these features are taught by the combination of Yu and Yao.
Yu teaches training based on pairwise co-click signals. Yu teaches that “In real-world applications, the input of one tower has a positive interaction with the input of the other for each click” (Yu, pg. 1, section 1) (i.e. positive query-item pairs represent co-click interactions), and that “for the query in each positive query-item pair (label = 1), we randomly sample 𝑆 items from the item corpus to create 𝑆 negative query-item pairs (label = 0) with this query, and add these 𝑆 + 1 pairs to the training dataset” (Yu, pg. 3, section 2.3) (i.e. these query-item pairs are used to train the model). Applicant asserts that these query-item pairs are not co-click signals because a co-click signal tracks the co-occurrence of clicks on two content items (a search result and an instance of recommended content). Applicant points to the definition of a co-click signal set forth in specification paragraph 0015, which reads: “As used herein, a co-click signal may associate an instance of content with which a user has interacted (e.g., by clicking or tapping on the content, zooming in on the content, etc.) with content that is responsive to a search query, thereby indicating that the instance of content is a ‘positive pair’ in relation to the responsive content.” Examiner respectfully notes that under the broadest reasonable interpretation of the claim in light of this definition, Yu’s query-item pairs are co-click signals, as a positive query-item pair associates an item (i.e. an instance of content with which the user has interacted) with a search query, and thus implicitly associates the item with content that is responsive to that search query.
Yao teaches training using contrastive loss. Yao teaches a two tower content recommendation model trained by “adopt[ing] similar contrastive learning algorithms for learning representations of categorical features…us[ing] contrastive loss function to encourage the representations learned for the same training example to be similar” (Yao, pg. 3, section 3.1). 
Yu and Yao suggest the combination of contrastive loss and pairwise co-click signals. Yu’s two-tower recommendation model is trained using pairwise co-click signals and cross-entropy loss. Replacing Yu’s cross-entropy loss with Yao’s contrastive loss amounts to a simple substitution of known alternatives. Further, while Yao is primarily concerned with using contrastive loss to compare an augmented content item to itself, Yao explicitly acknowledges the viability of using contrastive loss for query-item pairs: “Contrastive loss was also applied in training two-tower DNNs… to make positive item agree with its corresponding queries” (Yao, pg. 3, section 3.1).
In light of the amendments to the claims, the anticipation rejections have been withdrawn and replaced with obviousness rejections. Claims 1-4, 6, 14-17, 19, and 22 are now rejected as being unpatentable over Yu in view of Yao. Claims 5 and 18 are now rejected as being unpatentable over Yu in view of Yao and further in view of Xiao. Claims 8-9, 12-13, and 21 are now rejected as being unpatentable over Zamani in view of Yu and Yao. Claims 10-11 are now rejected as being unpatentable over Zamani in view of Yu and Yao and further in view of Wang.
The prior art rejections have been updated to include the amended limitations and to clarify the reasoning given for the limitations that were not amended. 

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 14-19 and 22 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 14 recites the limitation "the determined inner products" on lines 1-2 of page 8.  There is insufficient antecedent basis for this limitation in the claim. For examination purposes, the claim will be interpreted as though it includes the limitation directed to “determining…an inner product” which is present in the similarly amended independent claim 1. 
Claims 15-19 and 22 are additionally rejected due to their dependence on claim 14 for the reason outlined above.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-4, 6, 14-17, 19, and 22 are rejected under 35 U.S.C. 103 as being unpatentable over 
Yu et al. (hereinafter Yu), “A Dual Augmented Two-tower Model for Online Large-scale Recommendation” in view of 
Yao et al. (hereinafter Yao), “Self-supervised Learning for Large-scale Item Recommendations”.

Regarding Claim 1,
Yu teaches A system comprising: at least one processor; and memory storing instructions that, when executed by the at least one processor, causes the system to perform a set of operations, (Examiner notes that this limitation is interpreted as a general-purpose computing environment. Yu, pg. 3, section 3.2: “We implemented these models by distributed TensorFlow…” The use of TensorFlow necessitates implementation of the described system in a computing environment.)
Yu teaches the set of operations comprising: 
obtaining a search query; (Pg. 2, section 2.1: “Our objective is to efficiently select possibly thousands of candidate items from the entire item corpus given a certain query.”)
based on the search query, obtaining one or more candidate content items; (Pg. 2, section 2.1: “We consider a recommendation system with a query set                         
                            
                                    {
                                    
                                            u
                                        
                                            i
                                        
                                    }
                                
                                    i
                                    =
                                    1
                                
                                    N
                                
                     and an item set                         
                            
                                    {
                                    
                                            v
                                        
                                            j
                                        
                                    }
                                
                                    j
                                    =
                                    1
                                
                                    M
                                
                    , where 𝑁 is the number of users and 𝑀 is the number of items.” Pg. 3, section 3.1: “All models were evaluated on two offline large-scale datasets: a large dataset sampled from the daily logs of online systems on Meituan and a dataset from Amazon [3].” Item set                         
                            
                                    {
                                    
                                            v
                                        
                                            j
                                        
                                    }
                                
                                    j
                                    =
                                    1
                                
                                    M
                                
                     comprises candidate content items. During evaluation, a search query                         
                            u
                        
                     is associated with a dataset, which includes a set of items (see pg. 3, table 1) (i.e. the set of candidate items is based on the search query).)
generating a set of features for the search query thereby generating a set of search query features; (Pg. 2, section 2.1-2.2: “We consider a recommendation system with a query set                         
                            
                                    {
                                    
                                            u
                                        
                                            i
                                        
                                    }
                                
                                    i
                                    =
                                    1
                                
                                    N
                                
                     and an item set                         
                            
                                    {
                                    
                                            v
                                        
                                            j
                                        
                                    }
                                
                                    j
                                    =
                                    1
                                
                                    M
                                
                     …each feature f𝑖 ∈ R (e.g., an item ID) in u𝑖 and v𝑗 goes through an embedding layer…For a certain query and candidate item, we create two corresponding augmented vectors a𝑢 and a𝑣 by their IDs, and concatenate them with feature embedding vectors to obtain the augmented input vectors z𝑢 , z𝑣 of the two towers.” Augmented input vector z𝑢 is the generated set of features for a query                         
                            u
                        
                    . Figure 1 (pg. 2) shows the architecture of the proposed model, where the left tower is associated with the query, and the ‘embedding’ represents the generated set of search query features.)
generating a set of features for each candidate content item thereby generating candidate content item features for each candidate content item; (See the cited portion of sections 2.1-2.2 above. Augmented input vector z𝑣 is the generated set of features for an item                         
                            v
                        
                     (i.e. candidate content item). Figure 1 (pg. 2) shows the architecture of the proposed model, where the right tower is associated with the content item, and the ‘embedding’ represents the generated set of candidate content item features.)
generating, by inputting the search query features into a pre-trained search query tower, a first feature vector corresponding to the search query; (Pg. 2, section 2.2: “Next, we feed z𝑢 and z𝑣 into the two towers… to get augmented representations of query p𝑢 and item p𝑣… p𝑢 and p𝑣, the output vectors of the L2 normalization layer, represent the query embedding and item embedding, respectively.” The augmented input vector z𝑢 (i.e. the search query features) is fed into the query tower of the two-tower model to obtain output vector p𝑢 (i.e. a first feature vector corresponding to the search query).)
for each candidate content item, generating, by inputting the candidate content item features into a pre-trained related content tower, a respective second feature vector for each candidate content item; (See the cited portion of section 2.2 above. Each augmented input vector z𝑣 (i.e. the candidate content item features) is fed into the item tower of the two-tower model to obtain output vector p𝑣 (i.e. a respective second feature vector for each candidate content item).)
determining, for each candidate content item, an inner product between the first feature vector and the respective second feature vector for the candidate content item; and (Pg. 3, section 2.2: “Finally, the output of the model is the inner product of the query embedding and item embedding: 𝑠 (u, v) = ⟨p𝑢, p𝑣⟩ where 𝑠 (u, v) denotes the score provided by our retrieval model.”)
generating, based on the outputs of the trained towers and the determined inner products, (Pg. 2, section 2.1: “Our objective is to efficiently select possibly thousands of candidate items from the entire item corpus given a certain query.” Candidate items (i.e. a set of recommended content items) are selected for the search query based on the model output (i.e. based on the outputs of the trained towers and the determined inner products).)
wherein the search query tower and the related content tower are trained, [using contrastive loss] based on pairwise co-click signals associated with a plurality of search queries, such that the inner product indicates a likelihood that a particular candidate content item is relevant to the search query. (Pg. 1, section 1: “In real-world applications, the input of one tower has a positive interaction with the input of the other for each click.” Pg. 3, section 2.3: “Model Training… Specifically, for the query in each positive query-item pair (label = 1), we randomly sample 𝑆 items from the item corpus to create 𝑆 negative query-item pairs (label = 0) with this query, and add these 𝑆 + 1 pairs to the training dataset. The cross-entropy loss for these pairs is as follows…” Pg. 3, section 2.2: “Finally, the output of the model is the inner product of the query embedding and item embedding: 𝑠 (u, v) = ⟨p𝑢, p𝑣⟩ where 𝑠 (u, v) denotes the score provided by our retrieval model.” A positive query-item pair is a co-click signal, and these query-item pairs make up the training data used to train the two-tower model. The inner product between the query and item embeddings represents the retrieval model’s relevance score.)
Yu does not appear to explicitly disclose using contrastive loss.
However, Yao teaches using contrastive loss (Pg. 3, section 3.1: “Inspired by the SimCLR framework [5] for visual representation learning, we adopt similar contrastive learning algorithms for learning representations of categorical features…use contrastive loss function to encourage the representations learned for the same training example to be similar.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Yu and Yao. Yu teaches a two-tower model for query-based online content recommendation using cross-entropy loss. Yao teaches a two-tower model for content recommendation using contrastive learning. One of ordinary skill would have motivation to combine Yu and Yao in order to “leverage self-supervised learning based auxiliary tasks to improve item representations, especially with long-tail distributions and sparse data” (Yao, pg. 2, section 1). According to Yao, contrastive learning can be used for exactly the type of two-tower query-item recommender model presented by Yu: “Contrastive loss was also applied in training two-tower DNNs… to make positive item agree with its corresponding queries” (Yao, pg. 3, section 3.1).

Regarding Claim 2, Yu and Yao teach The system of claim 1, as shown above.
Yu also teaches wherein training the search query tower and related content tower further comprises utilizing a cascaded multilayer perceptron model comprised of multiple layers. (Pg. 2, section 2.2: “…the two towers, which are composed of fully connected layers…” Figure 1 (pg. 2) shows the architecture of the proposed model, where each of the query tower and item tower include multiple cascaded layers labeled ‘Linear + ReLU’ representing fully connected neural network layers (i.e. multilayer perceptron layers).)

Regarding Claim 3, Yu and Yao teach The system of claim 2, as shown above.
Yu also teaches wherein the cascaded multilayer perceptron model layers comprises one or more of an expand layer or a bottleneck layer, and between the multilayer perceptron layers data scaling is performed. (Pg. 3, section 3.2: “the number of FC layers in each tower was fixed to 3, with dimensions 256, 128 and 32, respectively.” Reducing dimensions between fully connected layers (e.g. 256 [Wingdings font/0xE0] 128) equates to a bottleneck layer. Pg. 2, section 2.2: “the two towers, which are composed of fully connected layers with the ReLU activation function…” ReLU activation, which occurs between the layers of the model, is a form of data scaling.)

Regarding Claim 4, Yu and Yao teach The system of claim 3, as shown above.
Yu also teaches wherein data scaling comprises one or more of data standardization or data normalization including batch normalization, activation, and dropout. (Pg. 2, section 2.2: “the two towers, which are composed of fully connected layers with the ReLU activation function…” ReLU activation, which occurs between the layers of the model, is a form of data scaling.)

Regarding Claim 6, Yu and Yao teach The system of claim 1, as shown above.
Yu also teaches wherein generating the set of features further comprises pre-processing the search query to generate a set of low-dimensional features using at least one of feature scaling, centering, or dimensionality reduction. (Pg. 2, section 2.2.1: “each feature f𝑖 ∈ R (e.g., an item ID) in u𝑖 and v𝑗 goes through an embedding layer and is mapped to a low-dimensional dense vector e𝑖 ∈ R𝐾, where 𝐾 is the embedding dimension. Specifically, we define an embedding matrix E ∈ R𝐾×𝐷 where E is to be learned and 𝐷 is the number of unique features, and the embedding vector e𝑖 is the 𝑖th column of the embedding matrix E.” This is a form of dimensionality reduction.)

Claims 14-17 and 19 are method claims, containing substantially the same elements as system claims 1-4 and 6, respectively. Yu and Yao teach the elements of claims 1-4 and 6, as shown above.

Regarding Claim 22, Yu and Yao teach The method of claim 14, as shown above.
Yao also teaches wherein the search query tower and the related content tower are trained using supervised contrastive loss or self-supervised representation learning employing the contrastive loss. (Pg. 3, section 3.1: “Inspired by the SimCLR framework [5] for visual representation learning, we adopt similar contrastive learning algorithms for learning representations of categorical features…use contrastive loss function to encourage the representations learned for the same training example to be similar.”)

Claims 5 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Yu in view of Yao and further in view of
Xiao et al. (hereinafter Xiao), “UPRec: User-Aware Pre-training for Recommender Systems”.

Regarding Claim 5, Yu and Yao teach The system of claim 2, as shown above.
Yu and Yao do not appear to explicitly disclose wherein a skip connection is utilized between an initial input layer and another multilayer perceptron layer of the cascaded multilayer perceptron model.
However, Xiao teaches wherein a skip connection is utilized between an initial input layer and another multilayer perceptron layer of the cascaded multilayer perceptron model. (Pg. 4, section 3.2.3: “In addition to multi-head self-attention layer, each transformer layer also contains a fully connected feed-forward layer… Then a residual connection [48] and layer normalization operation [48] are employed…” Figure 2 (pg. 4) shows the architecture of the transformer layer of the model, where the arrow connecting the input sequence directly to the ‘Add & Layer Normalization’ block represents a residual connection (skip connection) between the input layer and another fully connected (MLP) layer.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Yu, Yao, and Xiao. Yu teaches a two-tower model for query-based online content recommendation using cross-entropy loss. Yao teaches a two-tower model for content recommendation using contrastive learning. Xiao teaches enhancements to content recommendation systems using user-aware pretraining. One of ordinary skill would have motivation to combine Yu, Yao, and Xiao because residual connections “stabilize and accelerate the network training process” (Xiao, pg. 4, section 3.2.3).

Claim 18 is a method claim, containing substantially the same elements as system claim 5. Yu, Yao, and Xiao teach the elements of claim 5, as shown above.

Claims 8-9, 12-13, and 21 are rejected under 35 U.S.C. 103 as being unpatentable over 
Zamani et al. (hereinafter Zamani), “Joint Modeling and Optimization of Search and Recommendation” in view of Yu and Yao.

Regarding Claim 8,
Zamani teaches A method comprising: 
receiving a search request including a search query for content; (Pg. 2, section 2.2: “JSR [joint search and recommendation] is a general framework for jointly modeling search and recommendation and consists of two major components: a retrieval component and a recommendation component. The retrieval component computes the retrieval score for an item i given a query q and a query context cq.” Query q is a search query, and item i is content.)
generating, using a related content model including a two-tower cascaded multilayer perceptron model, a set of recommended content associated with both the search query and an instance of content that is responsive to the search query, [the related content model being trained using contrastive loss based on pairwise co- click signals associated with a plurality of search queries]; and (Pg. 3, section 2.3: “we simply use fully-connected feed-forward networks to implement the components of the JSR [joint search and recommendation] framework.” Fully-connected feed-forward networks are multilayer perceptron models. Figure 2 (pg. 3) shows the architecture of the framework, where these networks are arranged in a cascading structure, with a separate retrieval model and recommendation model which share a loss function (i.e. a two-tower model). Figure 1 (pg. 1) shows an overview of the system, which includes inputting a ‘search query’ and outputting both a ‘Recommendation List’ and a ‘Search Result List’.)
providing, in response to the search request, the generated set of recommended content in association with the instance of content that is responsive to the search query. (Figure 1 (pg. 1) shows an overview of the system, including inputting a ‘search query’ and outputting both a ‘Recommendation List’ and a ‘Search Result List’ which are then provided for model evaluation. Pg. 5, section 3.3: “To evaluate the retrieval model, we use mean average precision (MAP) of the top 100 retrieved items and normalized discounted cumulative gain (NDCG) of the top 10 retrieved items (NDCG@10). To evaluate the recommendation performance, we use NDCG, hit ratio (Hit), and recall. The cut-off for all recommendation metrics is 10. Hit ratio is defined as the ratio of users that are recommended at least one relevant item.” The returned item lists are compared to actual relevant items for the search query/user to which they are responsive.)
Zamani does not appear to explicitly disclose the related content model being trained using contrastive loss based on pairwise co- click signals associated with a plurality of search queries.
However, Yu teaches the related content model being trained [using contrastive loss] based on pairwise co-click signals associated with a plurality of search queries (Pg. 1, section 1: “In real-world applications, the input of one tower has a positive interaction with the input of the other for each click.” Pg. 3, section 2.3: “Model Training… Specifically, for the query in each positive query-item pair (label = 1), we randomly sample 𝑆 items from the item corpus to create 𝑆 negative query-item pairs (label = 0) with this query, and add these 𝑆 + 1 pairs to the training dataset. The cross-entropy loss for these pairs is as follows…” A positive query-item pair is a co-click signal, and these query-item pairs make up the training data used to train the two-tower recommender model.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zamani and Yu. Zamani teaches a two-tower joint framework for search and recommendation based on a search query and an item set. Yu teaches a two-tower model for query-based online content recommendation trained using co-click signals. One of ordinary skill would have motivation to combine Zamani and Yu in order to “provide[] deeper insights into the information interaction of two-tower models in the retrieval task” (Yu, pg. 1-2, section 1).
Zamani and Yu do not appear to explicitly disclose using contrastive loss. 
However, Yao teaches using contrastive loss (Pg. 3, section 3.1: “Inspired by the SimCLR framework [5] for visual representation learning, we adopt similar contrastive learning algorithms for learning representations of categorical features…use contrastive loss function to encourage the representations learned for the same training example to be similar.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zamani, Yu, and Yao. Zamani teaches a two-tower joint framework for search and recommendation based on a search query and an item set. Yu teaches a two-tower model for query-based online content recommendation using co-click signals and cross-entropy loss. Yao teaches a two-tower model for content recommendation using contrastive learning. One of ordinary skill would have motivation to combine Zamani, Yu, and Yao in order to “leverage self-supervised learning based auxiliary tasks to improve item representations, especially with long-tail distributions and sparse data” (Yao, pg. 2, section 1). According to Yao, contrastive learning can be used for exactly the type of two-tower query-item recommender model presented by Zamani and Yu: “Contrastive loss was also applied in training two-tower DNNs… to make positive item agree with its corresponding queries” (Yao, pg. 3, section 3.1).

Regarding Claim 9, Zamani, Yu, and Yao teach The method of claim 8, as shown above.
Zamani also teaches wherein: the instance of content is a first instance of content; and the related content model is trained using a set of co-click signals that includes: a positive pair between a second instance of content and a first instance of recommended content; and a negative pair between the second instance of content and a second instance of recommended content. (Pg. 2, section 2.1: “Also, let DRS = {(u1, I1), (u2, I2), ··· , (um, Im)} be a set of recommendation data where Ii ⊆ I denotes the set of items favored (e.g., purchased) by the user… assume that DRS is split to two disjoint subsets DRStrain and DRStest…” Pg. 3, section 2.2: “we train the JSR framework by minimizing a joint loss function L that is equal to the sum of retrieval loss and recommendation loss… The recommendation loss is also defined similarly; for each user uj, we draw a positive sample ij from the user’s favorite items (i.e., Ij in DRStrain), and a random negative sample īj from I.” Items purchased by a user represent a co-click signal between the user and the item. A positive sample drawn from DRStrain (i.e. first instance of recommended content) forms a positive pair with the user (i.e. the second instance of content). A negative sample drawn from I (i.e. second instance of recommended content) forms a negative pair with the user (i.e. the second instance of content). The framework is trained using these pairs.) 

Regarding Claim 12, Zamani, Yu, and Yao teach The method of claim 8, as shown above.
Zamani also teaches wherein the search request further includes an indication of the instance of content that is responsive to the search query. (Pg. 4, section 3.1: “The Amazon product data does not contain search queries… Van Gysel et al. [23] proposed to automatically generate queries based on the product categories. To be exact, for each item in a category c, a query q is generated based on the terms in the category hierarchy of c. Then, all the items within that category are marked as relevant for the query q.” This creates labeled training data, where items in category c are indicated as being responsive content to query q.)

Regarding Claim 13, Zamani, Yu, and Yao teach The method of claim 8, as shown above.
Zamani also teaches further comprising identifying the instance of content that is responsive to the search query. (Figure 1 (pg. 1) shows an overview of the proposed system, which includes outputting a ‘Search Result List’ (i.e. content responsive to the search query) which is distinct from the ‘Recommendation List’ (i.e. clearly identified).)

Regarding Claim 21, Zamani, Yu, and Yao teach The method of claim 8, as shown above.
Yao also teaches wherein the related content model is trained using supervised contrastive loss or self-supervised representation learning employing the contrastive loss. (Pg. 3, section 3.1: “Inspired by the SimCLR framework [5] for visual representation learning, we adopt similar contrastive learning algorithms for learning representations of categorical features…use contrastive loss function to encourage the representations learned for the same training example to be similar.”)

Claims 10-11 are rejected under 35 U.S.C. 103 as being unpatentable over Zamani in view of Yu and Yao and further in view of 
Wang et al. (hereinafter Wang), “Learning Two-Branch Neural Networks for Image-Text Matching Tasks”.

Regarding Claim 10, Zamani, Yu, and Yao teach The method of claim 8, as shown above.
Zamani, Yu, and Yao do not appear to explicitly disclose wherein a first tower of the two-tower cascaded multilayer perceptron model is associated with a first content type and a second tower of the two-tower cascaded multilayer perceptron model is associated with a second content type.
However, Wang teaches wherein a first tower of the two-tower cascaded multilayer perceptron model is associated with a first content type and a second tower of the two-tower cascaded multilayer perceptron model is associated with a second content type. (Pg. 1-2, section 1: “As suggested by the above discussion, the network architecture for these tasks should consist of two branches that take in image and text features respectively, pass them through one or more layers of transformations, fuse them, and eventually output a learned similarity score.” A branch is equivalent to a tower. As can be seen in figure 1 (pg. 2), the right branch takes input Y representing a text query (first content type), the left branch takes input X representing an image (second content type), and each branch consists of fully connected layers (i.e. MLPs).)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zamani, Yu, Yao, and Wang. Zamani teaches a two-tower joint framework for search and recommendation based on a search query and an item set. Yu teaches a two-tower model for query-based online content recommendation using co-click signals and cross-entropy loss. Yao teaches a two-tower model for content recommendation using contrastive learning. Wang teaches a two-branch multimodal retrieval model for sentence-to-image and image-to-sentence search. One of ordinary skill would have motivation to combine Zamani, Yu, Yao, and Wang because, according to Zamani, “any service that provides both search and recommendation functionalities can benefit from such joint modeling and optimization. This includes…media sharing services, such as YouTube…” (Zamani, pg. 2, section 1). Wang’s model obtains “near state-of-the-art accuracies on bi-directional image-sentence retrieval on Flickr30K [14] and MSCOCO [17] datasets” (Wang, pg. 3, section 1), and thus the combination would allow the extension of the capabilities of Zamani’s search and recommendation framework into the realm of multimodal search on media sharing services such as Flickr.

Regarding Claim 11, Zamani, Yu, Yao, and Wang teach The method of claim 10, as shown above.
Wang also teaches wherein the first content type is a text content type associated with the search query and the second content type is an image content type. (Pg. 4, section 3.1: “Our second task, bi-directional image-sentence retrieval, refers both to image-to-sentence and sentence-to-image search. The definitions of the two scenarios are straightforward: given an input image (resp. sentence), the goal is to find the best matching sentences (resp. images) from a database.” In image-to-sentence search, text content associated with the search query is used to retrieve image content. Again, figure 1 (pg. 2) shows that the right branch (tower) takes text input, and the left branch (tower) takes image input.)

Conclusion
Claims 1-6, 8-19, and 21-22 are rejected.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to BENJAMIN M ROHD whose telephone number is (571)272-6445. The examiner can normally be reached Mon-Thurs 8:00-6:00 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Viker Lamardo can be reached at (571) 270-5871. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/B.M.R./Examiner, Art Unit 2147                                                                                                                                                                                                                                                                                                                                                                                                   /ERIC NILSSON/Primary Examiner, Art Unit 2151
Read full office action
SUPERVISED CONTRASTIVE LEARNING FOR RELATED CONTENT RECOMMENDATION

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

SUPERVISED CONTRASTIVE LEARNING FOR RELATED CONTENT RECOMMENDATION

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email