Prosecution Insights
Last updated: April 19, 2026
Application No. 17/827,448

SELF-SUPERVISED CONTRASTIVE LEARNING USING RANDOM FEATURE CORRUPTION

Non-Final OA §103
Filed
May 27, 2022
Examiner
THAI, JASMINE THANH
Art Unit
2129
Tech Center
2100 — Computer Architecture & Software
Assignee
Google LLC
OA Round
3 (Non-Final)
25%
Grant Probability
At Risk
3-4
OA Rounds
4y 0m
To Grant
81%
With Interview

Examiner Intelligence

Grants only 25% of cases
25%
Career Allow Rate
6 granted / 24 resolved
-30.0% vs TC avg
Strong +56% interview lift
Without
With
+56.3%
Interview Lift
resolved cases with interview
Typical timeline
4y 0m
Avg Prosecution
30 currently pending
Career history
54
Total Applications
across all art units

Statute-Specific Performance

§101
23.6%
-16.4% vs TC avg
§103
37.2%
-2.8% vs TC avg
§102
14.6%
-25.4% vs TC avg
§112
21.8%
-18.2% vs TC avg
Black line = Tech Center average estimate • Based on career data from 24 resolved cases

Office Action

§103
DETAILED ACTION Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Continued Examination Under 37 CFR 1.114 A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 10/30/2025 has been entered. Response to Arguments Applicant's arguments filed 10/30/2025 have been fully considered but they are not persuasive. Regarding applicant’s remarks directed to the rejection of claims under 35 USC § 103, Applicant argues that Yoon and Yao do not teach amended portions “wherein processing each unlabeled training input comprises processing an unchanged version of each unlabeled training input” and provides annotated figure 4A: PNG media_image1.png 257 713 media_image1.png Greyscale Examiner respectfully points out that the amended claim recites “processing an unchanged version of each unlabeled training input.” Examiner notes that there are two reasonable interpretations of the limitation: “processing an unchanged version of each unlabeled training input” Wherein examiner notes that processing is not explicitly defined and thus, the processing can be taught by the masking and corruption of Yoon or the augmentation of Yao. Further, “an unchanged version of each unlabeled training input” can be interpreted to simply mean “each unlabeled training input” as the processing is performed on the original data/version (ie an unchanged version of each unlabeled training input) “processing an unchanged version of each unlabeled training input” Wherein examiner notes that, in this interpretation, the processing is performed on a version of each unlabeled training input wherein the version can be obtained from the masking and corruption of Yoon or the augmentation of Yao and as no further changes are made after the version is obtained, it is thus, “unchanged.” In other words, in response to applicant's argument that the references fail to show certain features of the invention, it is noted that the features upon which applicant relies (i.e., an unchanged input as a first embedding (as interpreted from annotated Figure 4A)) are not recited in the rejected claim(s). Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims. See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993). Thus, applicant’s arguments are considered unpersuasive. For examination purposes, examiner considers the first interpretation (1) as the broadest reasonable interpretation. The examiner further refers to the rejection under 35 USC § 103 in the current office action for more details. Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows: 1. Determining the scope and contents of the prior art. 2. Ascertaining the differences between the prior art and the claims at issue. 3. Resolving the level of ordinary skill in the pertinent art. 4. Considering objective evidence present in the application indicating obviousness or nonobviousness. Claim(s) 1, 9-12, 17, and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yoon, Jinsung, et al. "Vime: Extending the success of self-and semi-supervised learning to tabular domain." (“Yoon”) in view of Yao, Tiansheng, et al. "Self-supervised Learning for Large-scale Item Recommendations." arXiv preprint arXiv:2007.12865 (2021). (“Yao”) In regards to claim 1, Yoon teaches A computer-implemented method of training a neural network having a plurality of network parameters, the method comprising: obtaining [a set of] unlabeled training inputs from an unlabeled training dataset, the [set of] unlabeled training inputs each having a respective feature in each of a plurality of feature dimensions; (Yoon, Figure 1, Section 3, “In this section, we introduce the general formulation of self- and semi-supervised learning. Suppose we have a small labeled dataset PNG media_image2.png 32 134 media_image2.png Greyscale i=1 and a large unlabeled dataset PNG media_image3.png 34 139 media_image3.png Greyscale , [unlabeled training inputs unlabeled training inputs from an unlabeled training dataset] where PNG media_image4.png 26 190 media_image4.png Greyscale and PNG media_image5.png 23 56 media_image5.png Greyscale .:) Yoon teaches for each unlabeled training input [in the set] of unlabeled training inputs, generating a corrupted version of the unlabeled training input, comprising determining a proper subset of the feature dimensions (Yoon, Figure 1, Section 3.2, “Semi-supervised learning optimizes the predictive model f by minimizing the supervised loss function jointly with some unsupervised loss function defined over the output space Y. Formally, semi-supervised learning is formulated as an optimization problem as follows, PNG media_image6.png 49 656 media_image6.png Greyscale Where PNG media_image7.png 27 134 media_image7.png Greyscale is an unsupervised loss function, and a hyperparameter PNG media_image8.png 25 50 media_image8.png Greyscale is introduced to control the trade-off between the supervised and unsupervised losses. x' is a perturbed version of x [generating a corrupted version of the unlabeled training input] assumed to be drawn from a conditional distribution ~p_X(x’|x) [comprising determining a proper subset of the feature dimensions].”) Yoon teaches and, for each feature dimension that is in the proper subset of feature dimensions, applying a corruption to the respective feature in the feature dimension using one or more feature values sampled from a marginal distribution of the feature dimension as specified in the unlabeled training dataset; (Yoon, Figure 1, Section 4.1, “We introduce two pretext tasks: feature vector estimation and mask vector estimation. Our goal is to optimize a pretext model to recover an input sample (a feature vector) from its corrupted variant, at the same time as estimating the mask vector that has been applied to the sample. In our framework, the two pretext tasks share a single pretext distribution pXs;Ys . First, a mask vector generator outputs a binary mask vector PNG media_image9.png 26 227 media_image9.png Greyscale where mj is randomly sampled from a Bernoulli distribution with probability PNG media_image10.png 30 280 media_image10.png Greyscale . Then a pretext generator PNG media_image11.png 28 182 media_image11.png Greyscale takes a sample x [for each feature dimension that is in the proper subset of feature dimensions] from Du and a mask vector m as input, and generates a masked sample ~x. The generating process of ~x is given by PNG media_image12.png 32 533 media_image12.png Greyscale where the j-th feature of PNG media_image13.png 21 12 media_image13.png Greyscale [using one or more feature values sampled from a … distribution of the feature dimension as specified in the set of unlabeled training data] is sampled from the empirical distribution PNG media_image14.png 32 220 media_image14.png Greyscale PNG media_image15.png 24 37 media_image15.png Greyscale where xi,j is the j-th feature of the i-th sample in Du (i.e. the empirical marginal [marginal] distribution of each feature). - see Figure 3 in the Supplementary Materials for further details. The generating process in Equation (3) ensures the corrupted sample ~x is not only tabular but also similar to the samples in Du.”) Yoon teaches processing, using the neural network and in accordance with the current values of the plurality of network parameters, the corrupted version of each unlabeled training input [in the set] of unlabeled training inputs to generate a second embedding of the corrupted version of the unlabeled training input; (Yoon, Section 4.2, Figure 1, “Figure 1: Block diagram of the proposed self-supervised learning framework [processing, using the neural network and in accordance with the current values of the plurality of network parameters] on tabular data. (1) Mask generator generates binary mask vector (m) which is combined with an input sample (x) to create a masked and corrupted sample (~x) [the corrupted version of each unlabeled training input in the batch of unlabeled training inputs to generate], (2) Encoder (e) transforms ~x into a latent representation (z) [a second embedding of the corrupted version of the unlabeled training input; wherein the generating can be performed on each input provided from the batch of Yao], (3) Mask vector estimator (sm) is trained by minimizing the cross-entropy loss with m, feature vector estimator (sr) is trained by minimizing the reconstruction loss with x, (4) Encoder (e) is trained by minimizing the weighted sum of both losses.”) However, Yoon does not explicitly teach a set; processing, using the neural network and in accordance with current values of the plurality of network parameters, each unlabeled training input in the set of unlabeled training inputs to generate a first embedding of the unlabeled training input, wherein processing each unlabeled training input comprises processing an unchanged version of each unlabeled training input; and training the neural network based on optimizing a contrastive learning loss function, wherein for each unlabeled training input in the set of unlabeled training inputs, optimizing the contrastive learning loss function trains the neural network to reduce a difference between (i) the first embedding of the unlabeled training input that is generated by the neural network from processing an unchanged version of the unlabeled training input and (ii) the second embeddings of the corrupted version of the unlabeled training input that is generated by the neural network from processing the corrupted version of the unlabeled training input, and to increase a difference between (i) the first embedding of the unlabeled training input that is generated by the neural network from processing an unchanged version of the unlabeled training input and (ii) a different second embedding of a corrupted version of a different unlabeled training input that is generated by the neural network from processing the corrupted version of the different unlabeled training input in the set of unlabeled training inputs Yao teaches a set (Yao, Section 3.1, “We consider a batch of 𝑁 [set] item examples 𝑥1, ..., 𝑥𝑁 , where 𝑥𝑖 ∈ X represents a set of features for example 𝑖. In the context of recommenders, an example indicates a query, an item or a query-item pair. Suppose there are a pair of transform function ℎ, 𝑔 : X → X that augment 𝑥𝑖 to be 𝑦𝑖 , 𝑦′ 𝑖 respectively, PNG media_image16.png 31 335 media_image16.png Greyscale ”) Yao teaches processing, using the neural network and in accordance with current values of the plurality of network parameters, each unlabeled training input in the set of unlabeled training inputs to generate a first embedding of the unlabeled training input, wherein processing each unlabeled training input comprises processing an unchanged version of each unlabeled training input; (Yao, Section 3, Fig. 2, “The basic idea is two folds: first, we apply different data augmentation for the same training example to learn representations [processing, using the neural network and in accordance with current values of the plurality of network parameters]; and then use contrastive loss function to encourage the representations learned for the same training example to be similar.” PNG media_image17.png 395 568 media_image17.png Greyscale ) (Yao, Section 3.1, “We consider a batch of 𝑁 item examples 𝑥1, ..., 𝑥𝑁 , where 𝑥𝑖 ∈ X represents a set of features for example 𝑖. In the context of recommenders, an example indicates a query, an item or a query-item pair. Suppose there are a pair of transform function ℎ, 𝑔 : X → X that augment 𝑥𝑖 to be 𝑦𝑖 , 𝑦′ 𝑖 respectively, [each unlabeled training input in the set of unlabeled training inputs to generate a first embedding of the unlabeled training input, wherein processing each unlabeled training input comprises processing an unchanged version of each unlabeled training input; wherein transform function h and encoder H is applied to the original input data to generate the first embedding; examiner further notes that transform function g and encoder G can be taught by the augmentation (corruption) and encoder of Yoon] PNG media_image16.png 31 335 media_image16.png Greyscale ”) Yao teaches and training the neural network based on optimizing a contrastive learning loss function, wherein for each unlabeled training input in the set of unlabeled training inputs, optimizing the contrastive learning loss function trains the neural network to reduce a difference between (i) the first embedding of the unlabeled training input that is generated by the neural network from processing an unchanged version of the unlabeled training input and (ii) the second embeddings of the corrupted version of the unlabeled training input that is generated by the neural network from processing the corrupted version of the unlabeled training input, and to increase a difference between (i) the first embedding of the unlabeled training input that is generated by the neural network from processing an unchanged version of the unlabeled training input and (ii) a different second embedding of a corrupted version of a different unlabeled training input that is generated by the neural network from processing the corrupted version of the different unlabeled training input in the set of unlabeled training inputs. (Yao, Section 3.1, annotated Fig. 2, “Given the same input of example 𝑖, we want to learn different representations 𝑦𝑖 , 𝑦′ 𝑖 after augmentation to make sure the model still recognizes that both 𝑦𝑖 and 𝑦𝑖 represent the same input 𝑖. In other words, the contrastive loss learns to minimize the difference between 𝑦𝑖 , 𝑦′ 𝑖 [reduce a difference between the first and second embeddings; wherein the second embedding is provided from the aforementioned methods of Yoon]. In the mean time, for different example 𝑖 and 𝑗, the contrastive loss maximizes the difference between the representations learned 𝑦𝑖 , 𝑦′ 𝑗 after data different augmentations. Let z𝑖 , z ′ 𝑖 denote the embeddings of 𝑦𝑖 , 𝑦′ 𝑖 after encoded by two neural networks H, G : X → R 𝑑 , that is PNG media_image18.png 34 345 media_image18.png Greyscale We treat (z𝑖 , z ′ 𝑖 ) as positive pairs, and (z𝑖 , z ′ 𝑗 ) as negative pairs for 𝑖 ≠ 𝑗. Let 𝑠(z𝑖 , z ′ 𝑗 ) = ⟨z𝑖 , z ′ 𝑗 ⟩/(∥z𝑖 ∥ · ∥z ′ 𝑗 ∥). To encourage the above properties, we define the SSL loss for a batch of 𝑁 examples {𝑥𝑖 } as: PNG media_image19.png 80 471 media_image19.png Greyscale where 𝜏 is a tunable hyper-parameter for the softmax temperature. The above loss function learns a robust embedding space such that similar items are close to each other after data augmentation, and random examples are pushed farther away [increase a difference between the first embedding and a different second embedding; wherein the second embedding is provided from the aforementioned methods of Yoon]. The overall framework is illustrated in Figure 2. PNG media_image20.png 350 655 media_image20.png Greyscale ”) Yoon and Yao are both considered to be analogous to the claimed invention because they are in the same field of self-supervised learning. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yoon to incorporate the teachings of Yao in order to provide a framework to improve item representation learning from a batch of items (Yao, Abstract para. 2, “Inspired by the recent success in self-supervised representation learning research in both computer vision and natural language understanding, we propose a multi-task self-supervised learning (SSL) framework for large-scale item recommendations. The framework is designed to tackle the label sparsity problem by learning better latent relationship of item features. Specifically, SSL improves item representation learning as well as serving as additional regularization to improve generalization. Furthermore, we propose a novel data augmentation method that utilizes feature correlations within the proposed framework.”) In regards to claim 11 and analogous claims 9 and 10, Yoon and Yao teaches The method of claim 1, Yoon teaches wherein the features in a first feature dimension are numerical features and the features in a second feature dimension are categorical features. (Yoon, Section 4.1, “Our framework is novel in learning the correlations for tabular data whose correlation structure is less obvious than in images or language. The learned representation that captures the correlation across different parts of the object, regardless of the object type (e.g. language, image or tabular data) [wherein the features in a first feature dimension are numerical features and the features in a second feature dimension are categorical features; wherein the type can be numerical, categorical, etc], is an informative input for the various downstream tasks.”) In regards to claim 12, Yoon and Yao teaches The method of claim 1, Yoon teaches wherein the neural network comprises an encoder sub-neural network having a plurality of encoder network parameters and an embedding generation sub- neural network having a plurality of embedding generation network parameters, PNG media_image21.png 211 764 media_image21.png Greyscale (Yoon, Section 3.2, Figure 1 Encoder, “(2) Encoder (e) [an encoder sub-neural network having a plurality of encoder network parameters and an embedding generation sub- neural network having a plurality of embedding generation network parameters; wherein the encoder and embedding generation sub-neural network is interpreted to be the same] transforms ~x into a latent representation (z)”) Claims 9 and 10 are rejected under the same rationale as claim 11 as they are substantially similar Claim 17 and claim 20 are rejected on the same grounds under 35 U.S.C. 103 as claim 1 as they are substantially similar. Claim(s) 2-3 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yoon in view of Yao in further view of Oord, Aaron van den, Yazhe Li, and Oriol Vinyals. "Representation learning with contrastive predictive coding." (“Oord”). In regards to claim 2 and analogous claim 18, Yoon and Yao teaches The method of claim 1, Oord teaches wherein the contrastive learning loss function comprises a noise contrastive estimation (NCE) loss function. (Oord, Section 2.3, “Both the encoder and autoregressive model are trained to jointly optimize a loss based on NCE [a noise contrastive estimation (NCE) loss function], which we will call InfoNCE”) Oord is considered to be analogous to the claimed invention because they are in the same field of contrastive learning. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yoon and Yao to incorporate the teachings of Oord in order to provide a technique that capture information that is maximally useful for prediction (Oord, Abstract, “The key insight of our model is to learn such representations by predicting the future in latent space by using powerful autoregressive models. We use a probabilistic contrastive loss which induces the latent space to capture information that is maximally useful to predict future samples.”) In regards to claim 3, Yoon and Yao and Oord teaches The method of claim 2, Oord teaches wherein the NCE loss function comprises an InfoNCE loss function. (Oord, Section 2.3, “Both the encoder and autoregressive model are trained to jointly optimize a loss based on NCE, which we will call InfoNCE. [an InfoNCE loss function]”) Claim 18 is rejected on the same grounds under 35 U.S.C. 103 as claim 2 as they are substantially similar. Claim(s) 4-8, 13-16 and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yoon in view of Yao in further view of U.S. Pat. No. US11687540B2 Pushak et al. (“Pushak”). In regards to claim 4, Yoon and Yao teaches The method of claim 1, Pushak teaches wherein determining the proper subset of feature dimensions comprises sampling the proper subset of feature dimensions from the plurality of feature dimensions with uniform randomness. (Pushak, Col. 6 line 51 – Col. 7 line 1, “Algorithm 1: 1.1 Project out the features to be replaced. First all the data, X, is projected into the subspace of features, fkeep, that are not being replaced, i.e., fkeep=F\freplace. The data in this projected subspace is denoted by projfkeep(X). 1.2 Find a set of k neighbors in the subspace. Next, any (approximate) nearest neighbor method is used to find a set of data instances, {projfkeep(X)[j 1],projfkeep(X)[j 2], . . . , projfkeep(X)[j k]}, that are similar to projfkeep(X)[i]. 1.3 Randomly sample one of the neighbors. A single random neighbor projfkeep(X)[j]ε{projfkeep(X)[j 1],projfkeep(X)[j 2], . . . , projfkeep(X)[j k]}, is picked from the set of k neighbors.”) (Pushak, Col. 7 lines 31-38, “Steps 1.1 and 1.2 of Algorithm 1 are conditioning steps, and step 1.2 is generally very slow. There are numerous approximate nearest neighbor methods that can be used to find projfkeep(X)[j] to help speed up this portion of Algorithm 1. For example, an exact (or a 1+epsilon-approximate) k-nearest neighbor method can be used to find k approximate nearest neighbors, from which a single approximate nearest neighbor can be sampled uniformly at random [sampling the proper subset of feature dimensions from the plurality of feature dimensions with uniform randomness].”) Pushak is considered to be analogous to the claimed invention because they are in the same field of approximate sampling of features for use in machine learning. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yoon and Yao to incorporate the teachings of Pushak in order to provide steps of conditional sampling in order to solve deficiencies with marginal sampling (Pushak, Background, “While marginal sampling is a relatively inexpensive and efficient procedure, sampling from the marginal distribution of a dataset can result in data instances with perturbed feature values that represent unrealistic data instances, thus explaining the black-box ML model in regions where predictions of the model cannot be trusted and may never be needed. This can be particularly problematic when the black-box ML model is configured to identify anomalous data, where feature shuffling based on random samples generated using marginal sampling can easily produce perturbed data instances that do not conform to correlations found in the original dataset…Sampling from conditional distributions can be advantageous in many situations, including to help ML explainers identify features that distinguish anomalous data instances in a dataset. To illustrate, consider that instead of utilizing the marginal distribution for the ML explainer example above, the value for the age of the example 150-pound person is instead sampled from a conditional distribution that represents the age feature values of data instances within the highlighted section of dataset 100 centering on 150 pounds (the value of the non-perturbed weight feature of the target data instance). Sampling from the conditional distribution, the correlations occurring in the dataset are preserved, thus creating approximately realistic records that are far less likely to be flagged as anomalous by the machine learning model. Thus, for dataset 100, the importance of the age feature will not be artificially inflated by perturbed data instances that break correlations in the dataset.”) In regards to claim 5, Yoon and Yao and Pushak teaches The method of claim 4, further comprising determining the subset of selected feature dimensions with uniform randomness Yoon teaches in accordance with a predetermined corruption rate which specifies a total number of feature dimensions to be selected. (Yoon, Section 4.2, Figure 2, “Figure 2: Block diagram of the proposed semi-supervised learning framework on tabular data. For an unlabeled sample x in Du, (1) Mask generator generates K-number [with a predetermined corruption rate which specifies a total number of feature dimensions to be selected] of mask vectors and combine each of them with x to generate the corrupted samples PNG media_image22.png 19 127 media_image22.png Greyscale via pretext generator (gm), (2) Encoder (e) transforms these corrupted samples into latent representations PNG media_image23.png 22 126 media_image23.png Greyscale as K different augmented samples, (3) Predictive model is trained by minimizing the supervised loss on (x, y) in Dl and the consistency loss on the augmented samples PNG media_image24.png 28 135 media_image24.png Greyscale jointly. The block diagram of the proposed self- and semi-supervised learning frameworks on exemplary tabular data can be found in the Supplementary Materials (Figure 2).”) In regards to claim 6 and analogous claim 19, Yoon and Yao teaches The method of claim 1, Yoon teaches wherein sampling the one or more feature values from the marginal distribution of the feature dimension as specified in the set of unlabeled training data comprises: sampling, the one or more feature values from a [uniform] distribution over the feature values that appear in the feature dimension at least a threshold amount of times across the unlabeled training inputs in the set of unlabeled training data. (Yoon, Figure 1, Section 4.1, “We introduce two pretext tasks: feature vector estimation and mask vector estimation. Our goal is to optimize a pretext model to recover an input sample (a feature vector) from its corrupted variant, at the same time as estimating the mask vector that has been applied to the sample. In our framework, the two pretext tasks share a single pretext distribution pXs;Ys . First, a mask vector generator outputs a binary mask vector PNG media_image9.png 26 227 media_image9.png Greyscale where mj is randomly sampled from a Bernoulli distribution with probability PNG media_image10.png 30 280 media_image10.png Greyscale . Then a pretext generator PNG media_image11.png 28 182 media_image11.png Greyscale takes a sample x [sampling, the one or more feature values from a [uniform] distribution over the feature values that appear in the feature dimension at least a threshold amount of times across the unlabeled training inputs in the set of unlabeled training data; wherein a sample is taken if it’s in the set of unlabeled training data (thus appearing once)] from Du and a mask vector m as input, and generates a masked sample ~x.”) However, Yoon does not teach uniform distribution Pushak teaches a uniform distribution (Pushak, Col. 7 lines 31-38, “Steps 1.1 and 1.2 of Algorithm 1 are conditioning steps, and step 1.2 is generally very slow. There are numerous approximate nearest neighbor methods that can be used to find projfkeep(X)[j] to help speed up this portion of Algorithm 1. For example, an exact (or a 1+epsilon-approximate) k-nearest neighbor method can be used to find k approximate nearest neighbors, from which a single approximate nearest neighbor can be sampled uniformly at random [uniform distribution].”) In regards to claim 7, Yoon and Yao and Pushak teaches The method of claim 6, Yoon teaches wherein the threshold is one. (Yoon, Figure 1, Section 4.1, “We introduce two pretext tasks: feature vector estimation and mask vector estimation. Our goal is to optimize a pretext model to recover an input sample (a feature vector) from its corrupted variant, at the same time as estimating the mask vector that has been applied to the sample. In our framework, the two pretext tasks share a single pretext distribution pXs;Ys . First, a mask vector generator outputs a binary mask vector PNG media_image9.png 26 227 media_image9.png Greyscale where mj is randomly sampled from a Bernoulli distribution with probability PNG media_image10.png 30 280 media_image10.png Greyscale . Then a pretext generator PNG media_image11.png 28 182 media_image11.png Greyscale takes a sample x [wherein the threshold is one; wherein a sample is taken if it’s in the set of unlabeled training data (thus appearing once)] from Du and a mask vector m as input, and generates a masked sample ~x.”) In regards to claim 8, Yoon and Yao teaches The method of claim 1, Pushak teaches wherein applying the corruption to the feature using the one or more feature values comprises replacing the feature with the one or more feature values. (Pushak, Col. 7 lines 2-7, Algorithm 1, “1.4 Replace feature values with the neighbor's values [replacing the feature with the one or more feature values]. Then, the feature values freplace of X[i] are replaced with those from X[j]. This creates an approximately realistic data instance, X[i]perturbed [applying the corruption to the feature using the one or more feature values comprises], in which the features, freplace, are modified with values that are likely to occur when conditioning on the other features, fkeep.”) In regards to claim 13, Yoon and Yao and Pushak teaches The method of claim 8, Yoon teaches wherein the training further comprises, after training the neural network on the set of unlabeled training data, (Yoon, Section 4.2, Figure 1, “Figure 1: Block diagram of the proposed self-supervised learning framework on tabular data. (1) Mask generator generates binary mask vector (m) which is combined with an input sample (x) to create a masked and corrupted sample (~x), (2) Encoder (e) transforms ~x into a latent representation (z), (3) Mask vector estimator (sm) is trained by minimizing the cross-entropy loss with m, feature vector estimator (sr) is trained by minimizing the reconstruction loss with x, (4) Encoder (e) is trained by minimizing the weighted sum of both losses [training the neural network on the set of unlabeled training data].”)) (Yoon, Figure 1, Section 3, “In this section, we introduce the general formulation of self- and semi-supervised learning. Suppose we have a small labeled dataset PNG media_image2.png 32 134 media_image2.png Greyscale i=1 and a large unlabeled dataset PNG media_image3.png 34 139 media_image3.png Greyscale , [unlabeled training data] where PNG media_image4.png 26 190 media_image4.png Greyscale and PNG media_image5.png 23 56 media_image5.png Greyscale .”) Yoon teaches adapting the sub-encoder neural network for a specific machine learning task including adjusting learned values of the plurality of encoder network parameters using labeled data comprising labeled training inputs. PNG media_image25.png 222 758 media_image25.png Greyscale (Yoon, Section 4.2, Figure 2 teaches adapting the encoder sub-neural network for a specific machine learning task including adjusting learned values of the plurality of encoder network parameters using labeled data comprising labeled training inputs (green arrows; With labeled samples (Dl))) (Yoon, Figure 2 Encoder (e), Section 3, “In this section, we introduce the general formulation of self- and semi-supervised learning. Suppose we have a small labeled dataset PNG media_image2.png 32 134 media_image2.png Greyscale [ labeled data comprising labeled training inputs] i=1 and a large unlabeled dataset PNG media_image3.png 34 139 media_image3.png Greyscale , where PNG media_image4.png 26 190 media_image4.png Greyscale and PNG media_image5.png 23 56 media_image5.png Greyscale . The label yi is a scalar in single-task learning while it can be given as a multi-dimensional vector in multi-task learning [for a specific machine learning task including adjusting learned values of the plurality of encoder network parameters]. We assume every input feature xi in Dl and Du feature vector is sampled i.i.d. from a feature distribution p_x, and the labeled data pairs (xi; yi) in Dl are drawn from a joint distribution p_x,y .”) In regards to claim 14, Yoon and Yao and Pushak teaches The method of claim 13, Yoon teaches wherein adapting the encoder neural network for a specific machine learning task further comprises: processing, using the sub-encoder sub-neural network and in accordance with the learned values of the plurality of encoder network parameters, a labeled training input to generate an embedding of the labeled training input; PNG media_image26.png 395 855 media_image26.png Greyscale (Yoon, Section 4.2, Figure 2, processing, using the sub-encoder sub-neural network and in accordance with the learned values of the plurality of encoder network parameters ie Encoder (e), a labeled training input ie x to generate an embedding of the labeled training input ie feature representation z (green arrows)) Yoon teaches processing, using an output sub-neural network and in accordance with current values of a plurality of output network parameters, the embedding to generate a training output; (Yoon, Section 4.2, Figure 2 teaches processing, using an output sub-neural and in accordance with current values of a plurality of output network parameters network ie Predictor f, the embedding ie feature representations to generate a training output ie predictions y) Yoon teaches computing a supervised learning loss function evaluating a difference between the training output and a ground truth output associated with the labeled training input; (Yoon, Section 4.2, “The supervised loss Ls is given by PNG media_image27.png 39 514 media_image27.png Greyscale where ls is the standard supervised loss function, e.g. mean squared error for regression or categorical cross-entropy for classification.”; wherein y is the training output and fe(x) is the ground truth output associated with the labeled training input) Yoon teaches and determining, based on computing a gradient of the supervised learning loss function with respect to the plurality of encoder network parameters and to the plurality of output network parameters, an adjustment to the learned values of the plurality of encoder network parameters. (Yoon, Section 4.2, Figure 2, “(3) Predictive model is trained by minimizing the supervised loss on (x, y) in Dl”; figure 2 teaches training using back-propagation which is computing a gradient with respect to the plurality of network parameters of the supervised loss function that evaluates a difference) In regards to claim 15, Yoon and Yao and Pushak teaches The method of claim 14, Yoon teaches wherein the specific machine learning task comprises a classification task, and wherein the supervised learning loss function comprises a cross-entropy loss function. (Yoon, Section 4.2, “The supervised loss Ls is given by PNG media_image27.png 39 514 media_image27.png Greyscale where ls is the standard supervised loss function, e.g. mean squared error for regression or categorical cross-entropy for classification.”) In regards to claim 16, Yoon and Yao and Pushak The method of claim 13, Yoon teaches further comprising providing the learned values of the plurality of encoder network parameters for use in performing the specific machine learning task. (Yoon, Section 5.2, “In this subsection, we evaluate the methods on clinical data, using the UK and US prostate cancer datasets (from Prostate Cancer UK and SEER datasets, respectively). The features consist of patients’ clinical information (e.g. age, grade, stage, Gleason scores) - total 28 features. We predict 2 possible treatments of UK prostate cancer patients (1) Hormone therapy (whether the patients got hormone therapy), (2) Radical therapy (whether the patient got radical therapy) [performing the specific machine learning task]. Both tasks are binary classification. In the UK prostate cancer dataset, we only have around 10,000 labeled patients samples. The US prostate cancer dataset contains more than 200,000 unlabeled patients samples, twenty times bigger than the labeled UK dataset. We use 50% of the UK dataset (as the labeled data) and the entire US dataset (as the unlabeled data) for training, with the remainder of the UK data being used as the testing set. We also test three popular supervised learning models: Logistic Regression, a 2-layer Multi-layer Perceptron and XGBoost. Table 1 shows that VIME [providing the learned values of the plurality of encoder network parameters for use] results in the best prediction performance, outperforming the benchmarks. More importantly, VIME is the only self- or semi-supervised learning framework that significantly outperforms supervised learning models. These results shed light on the unique advantage of using VIME in leveraging a large unlabeled tabular dataset (e.g. the US dataset) to strengthen a model’s predictive power. Here we also demonstrate that VIME can perform well even when there exists a distribution shift between the UK labeled data and the US unlabeled data (see the Supplementary Materials (Section 2) for further details). PNG media_image28.png 269 770 media_image28.png Greyscale ”) Claim 19 is rejected on the same grounds under 35 U.S.C. 103 as claim 6 as they are substantially similar. Conclusion The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Dang, Zhiyuan, et al. "Doubly contrastive deep clustering." arXiv preprint arXiv:2103.05484 (2021). Any inquiry concerning this communication or earlier communications from the examiner should be directed to JASMINE THAI whose telephone number is (703)756-5904. The examiner can normally be reached M-F 8-4. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael Huntley can be reached at (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /J.T.T./Examiner, Art Unit 2129 /MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129
Read full office action

Prosecution Timeline

May 27, 2022
Application Filed
Feb 11, 2025
Non-Final Rejection — §103
May 05, 2025
Applicant Interview (Telephonic)
May 12, 2025
Examiner Interview Summary
Jun 20, 2025
Response Filed
Jul 27, 2025
Final Rejection — §103
Oct 22, 2025
Applicant Interview (Telephonic)
Oct 22, 2025
Examiner Interview Summary
Oct 30, 2025
Request for Continued Examination
Nov 05, 2025
Response after Non-Final Action
Jan 25, 2026
Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12561603
SYSTEM FOR TIME BASED MONITORING AND IMPROVED INTEGRITY OF MACHINE LEARNING MODEL INPUT DATA
2y 5m to grant Granted Feb 24, 2026
Patent 12555000
GENERATION OF CONVERSATIONAL TASK COMPLETION STRUCTURE
2y 5m to grant Granted Feb 17, 2026
Patent 12462154
METHOD AND SYSTEM FOR ASPECT-LEVEL SENTIMENT CLASSIFICATION BY MERGING GRAPHS
2y 5m to grant Granted Nov 04, 2025
Patent 12395590
REDUCTION AND GEO-SPATIAL DISTRIBUTION OF TRAINING DATA FOR GEOLOCATION PREDICTION USING MACHINE LEARNING
2y 5m to grant Granted Aug 19, 2025
Patent 12380361
Federated Machine Learning Management
2y 5m to grant Granted Aug 05, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

3-4
Expected OA Rounds
25%
Grant Probability
81%
With Interview (+56.3%)
4y 0m
Median Time to Grant
High
PTA Risk
Based on 24 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month