Last updated: July 17, 2026
Application No. 17/827,448
SELF-SUPERVISED CONTRASTIVE LEARNING USING RANDOM FEATURE CORRUPTION

Final Rejection §103
Filed
May 27, 2022
Priority
May 28, 2021 — provisional 63/194,899
Examiner
THAI, JASMINE THANH
Art Unit
2129
Tech Center
2100 — Computer Architecture & Software
Assignee
Google LLC
OA Round
4 (Final)
Interview Optional

— +58.9% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 32% grant rate with +58.9% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 28 resolved cases, 2023–2026
Examiner Intelligence

THAI, JASMINE THANH View full profile →
Grants only 32% of cases
Career Allowance Rate
9 granted / 28 resolved
-22.9% vs TC avg
Strong +59% interview lift
Without
With
+58.9%
Interview Lift
resolved cases with interview
Typical timeline
3y 10m
Avg Prosecution
20 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§103
85.0%
+45.0% vs TC avg
§102
15.0%
-25.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 28 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant's arguments filed 05/08/2026 have been fully considered but they are not persuasive.
Regarding applicant’s remarks directed to the rejection of claims under 35 USC § 103, 
Applicant argues that the prior interpretations does not apply as the amended claim recite “processing… original, uncorrupted version of each unlabeled training input” and “without applying any corruption;” however, Examiner respectfully notes that these limitations do not explicitly exclude the data augmentations taught by Yao and only excludes the corruption taught by Yoon wherein the corruption of Yoon is not applied for generating the first embedding. Thus, the arguments are considered unpersuasive.
The examiner refers to the rejection under 35 USC § 103 in the current office action for more details.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1, 9-12, 17, and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yoon, Jinsung, et al. "Vime: Extending the success of self-and semi-supervised learning to tabular domain." (“Yoon”) in view of Yao, Tiansheng, et al. "Self-supervised Learning for Large-scale Item Recommendations." arXiv preprint arXiv:2007.12865 (2021). (“Yao”)
In regards to claim 1,
Yoon teaches A computer-implemented method of training a neural network having a plurality of network parameters, the method comprising: obtaining [a set of] unlabeled training inputs from an unlabeled training dataset, the [set of] unlabeled training inputs each having a respective feature in each of a plurality of feature dimensions; 
(Yoon, Figure 1, Section 3, “In this section, we introduce the general formulation of self- and semi-supervised learning. Suppose we have a small labeled dataset 
    PNG
    media_image1.png
    32
    134
    media_image1.png
    Greyscale

i=1 and a large unlabeled dataset 
    PNG
    media_image2.png
    34
    139
    media_image2.png
    Greyscale
, [unlabeled training inputs unlabeled training inputs from an unlabeled training dataset]
where 
    PNG
    media_image3.png
    26
    190
    media_image3.png
    Greyscale
 and 
    PNG
    media_image4.png
    23
    56
    media_image4.png
    Greyscale
.:)
Yoon teaches for each unlabeled training input [in the set] of unlabeled training inputs, generating a corrupted version of the unlabeled training input, comprising determining a proper subset of the feature dimensions 
(Yoon, Figure 1, Section 3.2, “Semi-supervised learning optimizes the predictive model f by minimizing the supervised loss function jointly with some unsupervised loss function defined over the output space Y. Formally, semi-supervised learning is formulated as an optimization problem as follows,

    PNG
    media_image5.png
    49
    656
    media_image5.png
    Greyscale

Where 
    PNG
    media_image6.png
    27
    134
    media_image6.png
    Greyscale
  is an unsupervised loss function, and a hyperparameter 
    PNG
    media_image7.png
    25
    50
    media_image7.png
    Greyscale
  is introduced to control the trade-off between the supervised and unsupervised losses. x' is a perturbed version of x [generating a corrupted version of the unlabeled training input] assumed to be drawn from a conditional distribution ~p_X(x’|x) [comprising determining a proper subset of the feature dimensions].”)
Yoon teaches and, for each feature dimension that is in the proper subset of feature dimensions, applying a corruption to the respective feature in the feature dimension using one or more feature values sampled from a marginal distribution of the feature dimension as specified in the unlabeled training dataset; 
(Yoon, Figure 1, Section 4.1, “We introduce two pretext tasks: feature vector estimation and mask vector estimation. Our goal is to optimize a pretext model to recover an input sample (a feature vector) from its corrupted variant, at the same time as estimating the mask vector that has been applied to the sample.
In our framework, the two pretext tasks share a single pretext distribution pXs;Ys . First, a mask
vector generator outputs a binary mask vector 
    PNG
    media_image8.png
    26
    227
    media_image8.png
    Greyscale
 where mj is randomly sampled from a Bernoulli distribution with probability
    PNG
    media_image9.png
    30
    280
    media_image9.png
    Greyscale
. Then a pretext generator 
    PNG
    media_image10.png
    28
    182
    media_image10.png
    Greyscale
 takes a sample x [for each feature dimension that is in the proper subset of feature dimensions] from Du and a mask vector m as input, and generates a masked sample ~x. The generating process of ~x is given by

    PNG
    media_image11.png
    32
    533
    media_image11.png
    Greyscale

where the j-th feature of 
    PNG
    media_image12.png
    21
    12
    media_image12.png
    Greyscale
 [using one or more feature values sampled from a … distribution of the feature dimension as specified in the set of unlabeled training data] is sampled from the empirical distribution 
    PNG
    media_image13.png
    32
    220
    media_image13.png
    Greyscale

    PNG
    media_image14.png
    24
    37
    media_image14.png
    Greyscale
 where xi,j is the j-th feature of the i-th sample in Du (i.e. the empirical marginal [marginal] distribution of each feature). - see Figure 3 in the Supplementary Materials for further details. The generating process in Equation (3) ensures the corrupted sample ~x is not only tabular but also similar to the samples in Du.”)
Yoon teaches processing, using the neural network and in accordance with the current values of the plurality of network parameters, the corrupted version of each unlabeled training input [in the set] of unlabeled training inputs to generate a second embedding of the corrupted version of the unlabeled training input; 
(Yoon, Section 4.2, Figure 1, “Figure 1: Block diagram of the proposed self-supervised learning framework [processing, using the neural network and in accordance with the current values of the plurality of network parameters] on tabular data. (1) Mask generator generates binary mask vector (m) which is combined with an input sample (x) to create a masked and corrupted sample (~x) [the corrupted version of each unlabeled training input in the batch of unlabeled training inputs to generate], (2) Encoder (e) transforms ~x into a latent representation (z) [a second embedding of the corrupted version of the unlabeled training input; wherein the generating can be performed on each input provided from the batch of Yao], (3) Mask vector estimator (sm) is trained by minimizing the cross-entropy loss with m, feature vector estimator (sr) is trained by minimizing the reconstruction loss with x, (4) Encoder (e) is trained by minimizing the weighted sum of both losses.”)

However, Yoon does not explicitly teach a set; processing, using the neural network and in accordance with current values of the plurality of network parameters, each unlabeled training input in the set of unlabeled training inputs to generate a first embedding of the unlabeled training input, wherein processing each unlabeled training input comprises processing an original, uncorrupted version of each unlabeled training input using the neural network and without applying any corruption to each unlabeled training input using the neural network to generate the first embedding of the unlabeled training input; and training the neural network based on optimizing a contrastive learning loss function, wherein for each unlabeled training input in the set of unlabeled training inputs, optimizing the contrastive learning loss function trains the neural network to reduce a difference between (i) the first embedding of the unlabeled training input that is generated by the neural network from processing an original, uncorrupted version of the unlabeled training input and (ii) the second embeddings of the corrupted version of the unlabeled training input that is generated by the neural network from processing the corrupted version of the unlabeled training input, and to increase a difference between (i) the first embedding of the unlabeled training input that is generated by the neural network from processing an original, uncorrupted version of the unlabeled training input and (ii) a different second embedding of a corrupted version of a different unlabeled training input that is generated by the neural network from processing the corrupted version of the different unlabeled training input in the set of unlabeled training inputs
Yao teaches a set
(Yao, Section 3.1, “We consider a batch of 𝑁 [set] item examples 𝑥1, ..., 𝑥𝑁 , where 𝑥𝑖 ∈ X represents a set of features for example 𝑖. In the context of recommenders, an example indicates a query, an item or a query-item pair. Suppose there are a pair of transform function ℎ, 𝑔 : X → X that augment 𝑥𝑖 to be 𝑦𝑖 , 𝑦′ 𝑖 respectively,

    PNG
    media_image15.png
    31
    335
    media_image15.png
    Greyscale
”)
Yao teaches processing, using the neural network and in accordance with current values of the plurality of network parameters, each unlabeled training input in the set of unlabeled training inputs to generate a first embedding of the unlabeled training input, wherein processing each unlabeled training input comprises processing original, uncorrupted version of each unlabeled training input using the neural network and without applying any corruption to each unlabeled training input using the neural network to generate the first embedding of the unlabeled training input;
(Yao, Section 3, Fig. 2, “The basic idea is two folds: first, we apply different data augmentation for the same training example to learn representations [processing, using the neural network and in accordance with current values of the plurality of network parameters… using the neural network to generate the first embedding of the unlabeled training input]; and then use contrastive loss function to encourage the representations learned for the same training example to be similar.”

    PNG
    media_image16.png
    395
    568
    media_image16.png
    Greyscale
)
(Yao, Section 3.1, “We consider a batch of 𝑁 item examples 𝑥1, ..., 𝑥𝑁 , where 𝑥𝑖 ∈ X represents a set of features for example 𝑖. In the context of recommenders, an example indicates a query, an item or a query-item pair. Suppose there are a pair of transform function ℎ, 𝑔 : X → X that augment 𝑥𝑖 to be 𝑦𝑖 , 𝑦′ 𝑖 respectively, [each unlabeled training input in the set of unlabeled training inputs to generate a first embedding of the unlabeled training input, wherein processing each unlabeled training input comprises processing an original, uncorrupted version of each unlabeled training input and without applying any corruption to each unlabeled training input using the neural network to generate the first embedding of the unlabeled training input; wherein transform function h and encoder H of Yao (ie data augmentation) is applied to the original input data to generate the first embedding]

    PNG
    media_image15.png
    31
    335
    media_image15.png
    Greyscale
”)
Yao teaches and training the neural network based on optimizing a contrastive learning loss function, wherein for each unlabeled training input in the set of unlabeled training inputs, optimizing the contrastive learning loss function trains the neural network to reduce a difference between (i) the first embedding of the unlabeled training input that is generated by the neural network from processing an original, uncorrupted version of the unlabeled training input and (ii) the second embeddings of the corrupted version of the unlabeled training input that is generated by the neural network from processing the corrupted version of the unlabeled training input, and to increase a difference between (i) the first embedding of the unlabeled training input that is generated by the neural network from processing an original, uncorrupted version of the unlabeled training input and (ii) a different second embedding of a corrupted version of a different unlabeled training input that is generated by the neural network from processing the corrupted version of the different unlabeled training input in the set of unlabeled training inputs.
(Yao, Section 3.1, annotated Fig. 2, “Given the same input of example 𝑖, we want to learn different representations 𝑦𝑖 , 𝑦′ 𝑖 after augmentation to make sure the model still recognizes that both 𝑦𝑖 and 𝑦𝑖 represent the same input 𝑖. In other words, the contrastive loss learns to minimize the difference between 𝑦𝑖 , 𝑦′ 𝑖 [reduce a difference between the first and second embeddings; wherein the second embedding is provided from the aforementioned methods of Yoon; examiner further notes that transform function g and encoder G can be taught by the corruption and encoder of Yoon to generate the second embedding]. In the mean time, for different example 𝑖 and 𝑗, the contrastive loss maximizes the difference between the representations learned 𝑦𝑖 , 𝑦′ 𝑗 after data different augmentations. Let z𝑖 , z ′ 𝑖 denote the embeddings of 𝑦𝑖 , 𝑦′ 𝑖 after encoded by two neural networks H, G : X → R 𝑑 , that is

    PNG
    media_image17.png
    34
    345
    media_image17.png
    Greyscale

We treat (z𝑖 , z ′ 𝑖 ) as positive pairs, and (z𝑖 , z ′ 𝑗 ) as negative pairs for 𝑖 ≠ 𝑗. Let 𝑠(z𝑖 , z ′ 𝑗 ) = ⟨z𝑖 , z ′ 𝑗 ⟩/(∥z𝑖 ∥ · ∥z ′ 𝑗 ∥). To encourage the above properties, we define the SSL loss for a batch of 𝑁 examples {𝑥𝑖 } as:

    PNG
    media_image18.png
    80
    471
    media_image18.png
    Greyscale

where 𝜏 is a tunable hyper-parameter for the softmax temperature. The above loss function learns a robust embedding space such that similar items are close to each other after data augmentation, and random examples are pushed farther away [increase a difference between the first embedding and a different second embedding; wherein the second embedding is provided from the aforementioned methods of Yoon]. The overall framework is illustrated in Figure 2.

    PNG
    media_image19.png
    350
    655
    media_image19.png
    Greyscale
”)
Yoon and Yao are both considered to be analogous to the claimed invention because they are in the same field of self-supervised learning. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yoon to incorporate the teachings of Yao in order to provide a framework to improve item representation learning from a batch of items (Yao, Abstract para. 2, “Inspired by the recent success in self-supervised representation learning research in both computer vision and natural language understanding, we propose a multi-task self-supervised learning (SSL) framework for large-scale item recommendations. The framework is designed to tackle the label sparsity problem by learning better latent relationship of item features. Specifically, SSL improves item representation learning as well as serving as additional regularization to improve generalization. Furthermore, we propose a novel data augmentation method that utilizes feature correlations within the proposed framework.”)

In regards to claim 11 and analogous claims 9 and 10, 
Yoon and Yao teaches The method of claim 1, 
Yoon teaches wherein the features in a first feature dimension are numerical features and the features in a second feature dimension are categorical features.
(Yoon, Section 4.1, “Our framework is novel in learning the correlations for tabular data whose correlation structure is less obvious than in images or language. The learned representation that captures the correlation across different parts of the object, regardless of the object type (e.g. language, image or tabular data) [wherein the features in a first feature dimension are numerical features and the features in a second feature dimension are categorical features; wherein the type can be numerical, categorical, etc], is an informative input for the various downstream tasks.”)

In regards to claim 12,
Yoon and Yao teaches The method of claim 1, 
Yoon teaches wherein the neural network comprises an encoder sub-neural network having a plurality of encoder network parameters and an embedding generation sub- neural network having a plurality of embedding generation network parameters,

    PNG
    media_image20.png
    211
    764
    media_image20.png
    Greyscale

(Yoon, Section 3.2, Figure 1 Encoder, “(2) Encoder (e) [an encoder sub-neural network having a plurality of encoder network parameters and an embedding generation sub- neural network having a plurality of embedding generation network parameters; wherein the encoder and embedding generation sub-neural network is interpreted to be the same] transforms ~x into a latent representation
(z)”)

Claims 9 and 10 are rejected under the same rationale as claim 11 as they are substantially similar
Claim 17 and claim 20 are rejected on the same grounds under 35 U.S.C. 103 as claim 1 as they are substantially similar.

Claim(s) 2-3 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yoon in view of Yao in further view of Oord, Aaron van den, Yazhe Li, and Oriol Vinyals. "Representation learning with contrastive predictive coding." (“Oord”).
In regards to claim 2 and analogous claim 18,
Yoon and Yao teaches The method of claim 1, 
Oord teaches wherein the contrastive learning loss function comprises a noise contrastive estimation (NCE) loss function.
(Oord, Section 2.3, “Both the encoder and autoregressive model are trained to jointly optimize a loss based on NCE [a noise contrastive estimation (NCE) loss function], which we will call InfoNCE”)
Oord is considered to be analogous to the claimed invention because they are in the same field of contrastive learning. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yoon and Yao to incorporate the teachings of Oord in order to provide a technique that capture information that is maximally useful for prediction (Oord, Abstract, “The key insight of our model is to learn such representations by predicting the future in latent space by using powerful autoregressive models. We use a probabilistic contrastive loss which induces the latent space to capture information that is maximally useful to predict future samples.”)

In regards to claim 3,
Yoon and Yao and Oord teaches The method of claim 2, 
Oord teaches wherein the NCE loss function comprises an InfoNCE loss function.
(Oord, Section 2.3, “Both the encoder and autoregressive model are trained to jointly optimize a loss based on NCE, which we will call InfoNCE. [an InfoNCE loss function]”)

Claim 18 is rejected on the same grounds under 35 U.S.C. 103 as claim 2 as they are substantially similar.

Claim(s) 4-8, 13-16 and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yoon in view of Yao in further view of U.S. Pat. No. US11687540B2 Pushak et al. (“Pushak”).
In regards to claim 4,
Yoon and Yao teaches The method of claim 1, 
Pushak teaches wherein determining the proper subset of feature dimensions comprises sampling the proper subset of feature dimensions from the plurality of feature dimensions with uniform randomness.
(Pushak, Col. 6 line 51 – Col. 7 line 1, “Algorithm 1:
1.1 Project out the features to be replaced. First all the data, X, is projected into the subspace of features, fkeep, that are not being replaced, i.e., fkeep=F\freplace. The data in this projected subspace is denoted by projfkeep(X).
1.2 Find a set of k neighbors in the subspace. Next, any (approximate) nearest neighbor method is used to find a set of data instances,
{projfkeep(X)[j 1],projfkeep(X)[j 2], . . . , projfkeep(X)[j k]},
that are similar to projfkeep(X)[i].
1.3 Randomly sample one of the neighbors. A single random neighbor
projfkeep(X)[j]ε{projfkeep(X)[j 1],projfkeep(X)[j 2], . . . , projfkeep(X)[j k]},
is picked from the set of k neighbors.”)
(Pushak, Col. 7 lines 31-38, “Steps 1.1 and 1.2 of Algorithm 1 are conditioning steps, and step 1.2 is generally very slow. There are numerous approximate nearest neighbor methods that can be used to find projfkeep(X)[j] to help speed up this portion of Algorithm 1. For example, an exact (or a 1+epsilon-approximate) k-nearest neighbor method can be used to find k approximate nearest neighbors, from which a single approximate nearest neighbor can be sampled uniformly at random [sampling the proper subset of feature dimensions from the plurality of feature dimensions with uniform randomness].”)
Pushak is considered to be analogous to the claimed invention because they are in the same field of approximate sampling of features for use in machine learning. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yoon and Yao to incorporate the teachings of Pushak in order to provide steps of conditional sampling in order to solve deficiencies with marginal sampling (Pushak, Background, “While marginal sampling is a relatively inexpensive and efficient procedure, sampling from the marginal distribution of a dataset can result in data instances with perturbed feature values that represent unrealistic data instances, thus explaining the black-box ML model in regions where predictions of the model cannot be trusted and may never be needed. This can be particularly problematic when the black-box ML model is configured to identify anomalous data, where feature shuffling based on random samples generated using marginal sampling can easily produce perturbed data instances that do not conform to correlations found in the original dataset…Sampling from conditional distributions can be advantageous in many situations, including to help ML explainers identify features that distinguish anomalous data instances in a dataset. To illustrate, consider that instead of utilizing the marginal distribution for the ML explainer example above, the value for the age of the example 150-pound person is instead sampled from a conditional distribution that represents the age feature values of data instances within the highlighted section of dataset 100 centering on 150 pounds (the value of the non-perturbed weight feature of the target data instance). Sampling from the conditional distribution, the correlations occurring in the dataset are preserved, thus creating approximately realistic records that are far less likely to be flagged as anomalous by the machine learning model. Thus, for dataset 100, the importance of the age feature will not be artificially inflated by perturbed data instances that break correlations in the dataset.”)

In regards to claim 5,
Yoon and Yao and Pushak teaches The method of claim 4, further comprising determining the subset of selected feature dimensions with uniform randomness 
Yoon teaches in accordance with a predetermined corruption rate which specifies a total number of feature dimensions to be selected.
(Yoon, Section 4.2, Figure 2, “Figure 2: Block diagram of the proposed semi-supervised learning framework on tabular data. For an unlabeled sample x in Du, (1) Mask generator generates K-number [with a predetermined corruption rate which specifies a total number of feature dimensions to be selected] of mask vectors and combine each of them with x to generate the corrupted samples 
    PNG
    media_image21.png
    19
    127
    media_image21.png
    Greyscale
 via pretext generator (gm), (2) Encoder (e) transforms these corrupted samples into latent representations 
    PNG
    media_image22.png
    22
    126
    media_image22.png
    Greyscale
 as K different augmented samples, (3) Predictive model is trained by minimizing the supervised loss on (x, y) in Dl and the consistency loss on the augmented samples 
    PNG
    media_image23.png
    28
    135
    media_image23.png
    Greyscale
  jointly. The block diagram of the proposed self- and semi-supervised learning frameworks on exemplary tabular data can be found in the Supplementary Materials (Figure 2).”)

In regards to claim 6 and analogous claim 19,
Yoon and Yao teaches The method of claim 1, 
Yoon teaches wherein sampling the one or more feature values from the marginal distribution of the feature dimension as specified in the set of unlabeled training data comprises: sampling, the one or more feature values from a [uniform] distribution over the feature values that appear in the feature dimension at least a threshold amount of times across the unlabeled training inputs in the set of unlabeled training data.
(Yoon, Figure 1, Section 4.1, “We introduce two pretext tasks: feature vector estimation and mask vector estimation. Our goal is to optimize a pretext model to recover an input sample (a feature vector) from its corrupted variant, at the same time as estimating the mask vector that has been applied to the sample.
In our framework, the two pretext tasks share a single pretext distribution pXs;Ys . First, a mask
vector generator outputs a binary mask vector 
    PNG
    media_image8.png
    26
    227
    media_image8.png
    Greyscale
 where mj is randomly sampled from a Bernoulli distribution with probability
    PNG
    media_image9.png
    30
    280
    media_image9.png
    Greyscale
. Then a pretext generator 
    PNG
    media_image10.png
    28
    182
    media_image10.png
    Greyscale
 takes a sample x [sampling, the one or more feature values from a [uniform] distribution over the feature values that appear in the feature dimension at least a threshold amount of times across the unlabeled training inputs in the set of unlabeled training data; wherein a sample is taken if it’s in the set of unlabeled training data (thus appearing once)] from Du and a mask vector m as input, and generates a masked sample ~x.”)
However, Yoon does not teach uniform distribution
Pushak teaches a uniform distribution
(Pushak, Col. 7 lines 31-38, “Steps 1.1 and 1.2 of Algorithm 1 are conditioning steps, and step 1.2 is generally very slow. There are numerous approximate nearest neighbor methods that can be used to find projfkeep(X)[j] to help speed up this portion of Algorithm 1. For example, an exact (or a 1+epsilon-approximate) k-nearest neighbor method can be used to find k approximate nearest neighbors, from which a single approximate nearest neighbor can be sampled uniformly at random [uniform distribution].”)

In regards to claim 7,
Yoon and Yao and Pushak teaches The method of claim 6, 
Yoon teaches wherein the threshold is one.
(Yoon, Figure 1, Section 4.1, “We introduce two pretext tasks: feature vector estimation and mask vector estimation. Our goal is to optimize a pretext model to recover an input sample (a feature vector) from its corrupted variant, at the same time as estimating the mask vector that has been applied to the sample.
In our framework, the two pretext tasks share a single pretext distribution pXs;Ys . First, a mask
vector generator outputs a binary mask vector 
    PNG
    media_image8.png
    26
    227
    media_image8.png
    Greyscale
 where mj is randomly sampled from a Bernoulli distribution with probability
    PNG
    media_image9.png
    30
    280
    media_image9.png
    Greyscale
. Then a pretext generator 
    PNG
    media_image10.png
    28
    182
    media_image10.png
    Greyscale
 takes a sample x [wherein the threshold is one; wherein a sample is taken if it’s in the set of unlabeled training data (thus appearing once)] from Du and a mask vector m as input, and generates a masked sample ~x.”)

In regards to claim 8,
Yoon and Yao teaches The method of claim 1, 
Pushak teaches wherein applying the corruption to the feature using the one or more feature values comprises replacing the feature with the one or more feature values.
(Pushak, Col. 7 lines 2-7, Algorithm 1, “1.4 Replace feature values with the neighbor's values [replacing the feature with the one or more feature values]. Then, the feature values freplace of X[i] are replaced with those from X[j]. This creates an approximately realistic data instance, X[i]perturbed [applying the corruption to the feature using the one or more feature values comprises], in which the features, freplace, are modified with values that are likely to occur when conditioning on the other features, fkeep.”)

In regards to claim 13,
Yoon and Yao and Pushak teaches The method of claim 8, 
Yoon teaches wherein the training further comprises, after training the neural network on the set of unlabeled training data, 
(Yoon, Section 4.2, Figure 1, “Figure 1: Block diagram of the proposed self-supervised learning framework on tabular data. (1) Mask generator generates binary mask vector (m) which is combined with an input sample (x) to create a masked and corrupted sample (~x), (2) Encoder (e) transforms ~x into a latent representation (z), (3) Mask vector estimator (sm) is trained by minimizing the cross-entropy loss with m, feature vector estimator (sr) is trained by minimizing the reconstruction loss with x, (4) Encoder (e) is trained by minimizing the weighted sum of both losses [training the neural network on the set of unlabeled training data].”))
(Yoon, Figure 1, Section 3, “In this section, we introduce the general formulation of self- and semi-supervised learning. Suppose we have a small labeled dataset 
    PNG
    media_image1.png
    32
    134
    media_image1.png
    Greyscale

i=1 and a large unlabeled dataset 
    PNG
    media_image2.png
    34
    139
    media_image2.png
    Greyscale
, [unlabeled training data]
where 
    PNG
    media_image3.png
    26
    190
    media_image3.png
    Greyscale
 and 
    PNG
    media_image4.png
    23
    56
    media_image4.png
    Greyscale
.”)
Yoon teaches adapting the sub-encoder neural network for a specific machine learning task including adjusting learned values of the plurality of encoder network parameters using labeled data comprising labeled training inputs.

    PNG
    media_image24.png
    222
    758
    media_image24.png
    Greyscale

(Yoon, Section 4.2, Figure 2 teaches adapting the encoder sub-neural network for a specific machine learning task including adjusting learned values of the plurality of encoder network parameters using labeled data comprising labeled training inputs (green arrows; With labeled samples (Dl)))
(Yoon, Figure 2 Encoder (e), Section 3, “In this section, we introduce the general formulation of self- and semi-supervised learning. Suppose we have a small labeled dataset 
    PNG
    media_image1.png
    32
    134
    media_image1.png
    Greyscale
[ labeled data comprising labeled training inputs]
i=1 and a large unlabeled dataset 
    PNG
    media_image2.png
    34
    139
    media_image2.png
    Greyscale
, 
where 
    PNG
    media_image3.png
    26
    190
    media_image3.png
    Greyscale
 and 
    PNG
    media_image4.png
    23
    56
    media_image4.png
    Greyscale
. The label yi is a scalar in single-task learning while it can be given as a multi-dimensional vector in multi-task learning [for a specific machine learning task including adjusting learned values of the plurality of encoder network parameters]. We assume every input feature xi in Dl and Du feature vector is sampled i.i.d. from a feature distribution p_x, and the labeled data pairs (xi; yi) in Dl are drawn from a joint distribution p_x,y .”)

In regards to claim 14,
Yoon and Yao and Pushak teaches The method of claim 13, 
Yoon teaches wherein adapting the encoder neural network for a specific machine learning task further comprises: processing, using the sub-encoder sub-neural network and in accordance with the learned values of the plurality of encoder network parameters, a labeled training input to generate an embedding of the labeled training input; 

    PNG
    media_image25.png
    395
    855
    media_image25.png
    Greyscale

(Yoon, Section 4.2, Figure 2, processing, using the sub-encoder sub-neural network and in accordance with the learned values of the plurality of encoder network parameters ie Encoder (e), a labeled training input ie x to generate an embedding of the labeled training input ie feature representation z (green arrows))
Yoon teaches processing, using an output sub-neural network and in accordance with current values of a plurality of output network parameters, the embedding to generate a training output; 
(Yoon, Section 4.2, Figure 2 teaches processing, using an output sub-neural and in accordance with current values of a plurality of output network parameters network ie Predictor f, the embedding ie feature representations to generate a training output ie predictions y)
Yoon teaches computing a supervised learning loss function evaluating a difference between the training output and a ground truth output associated with the labeled training input; 
(Yoon, Section 4.2, “The supervised loss Ls is given by

    PNG
    media_image26.png
    39
    514
    media_image26.png
    Greyscale

where ls is the standard supervised loss function, e.g. mean squared error for regression or categorical cross-entropy for classification.”; wherein y is the training output and fe(x) is the ground truth output associated with the labeled training input)
Yoon teaches and determining, based on computing a gradient of the supervised learning loss function with respect to the plurality of encoder network parameters and to the plurality of output network parameters, an adjustment to the learned values of the plurality of encoder network parameters.
(Yoon, Section 4.2, Figure 2, “(3) Predictive model is trained by minimizing the supervised loss on (x, y) in Dl”; figure 2 teaches training using back-propagation which is computing a gradient with respect to the plurality of network parameters of the supervised loss function that evaluates a difference)

In regards to claim 15,
Yoon and Yao and Pushak teaches The method of claim 14, 
Yoon teaches wherein the specific machine learning task comprises a classification task, and wherein the supervised learning loss function comprises a cross-entropy loss function.
(Yoon, Section 4.2, “The supervised loss Ls is given by

    PNG
    media_image26.png
    39
    514
    media_image26.png
    Greyscale

where ls is the standard supervised loss function, e.g. mean squared error for regression or categorical cross-entropy for classification.”)

In regards to claim 16,
Yoon and Yao and Pushak The method of claim 13, 
Yoon teaches further comprising providing the learned values of the plurality of encoder network parameters for use in performing the specific machine learning task.
(Yoon, Section 5.2, “In this subsection, we evaluate the methods on clinical data, using the UK and US prostate cancer datasets (from Prostate Cancer UK and SEER datasets, respectively). The features consist of patients’ clinical information (e.g. age, grade, stage, Gleason scores) - total 28 features. We predict 2 possible treatments of UK prostate cancer patients (1) Hormone therapy (whether the patients got hormone therapy), (2) Radical therapy (whether the patient got radical therapy) [performing the specific machine learning task]. Both tasks are binary classification. In the UK prostate cancer dataset, we only have around 10,000 labeled patients samples. The US prostate cancer dataset contains more than 200,000 unlabeled patients samples, twenty times bigger than the labeled UK dataset. We use 50% of the UK dataset (as the labeled data) and the entire US dataset (as the unlabeled data) for training, with the remainder of the UK data being used as the testing set. We also test three popular supervised learning models: Logistic Regression, a 2-layer Multi-layer Perceptron and XGBoost.
Table 1 shows that VIME [providing the learned values of the plurality of encoder network parameters for use] results in the best prediction performance, outperforming the benchmarks. More importantly, VIME is the only self- or semi-supervised learning framework that significantly outperforms supervised learning models. These results shed light on the unique advantage of using VIME in leveraging a large unlabeled tabular dataset (e.g. the US dataset) to strengthen a model’s predictive power. Here we also demonstrate that VIME can perform well even when there exists a distribution shift between the UK labeled data and the US unlabeled data (see the Supplementary Materials (Section 2) for further details).

    PNG
    media_image27.png
    269
    770
    media_image27.png
    Greyscale
”)

Claim 19 is rejected on the same grounds under 35 U.S.C. 103 as claim 6 as they are substantially similar.


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Dang, Zhiyuan, et al. "Doubly contrastive deep clustering." arXiv preprint arXiv:2103.05484 (2021).
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JASMINE THAI whose telephone number is (703)756-5904. The examiner can normally be reached M-F 8-4.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael Huntley can be reached at (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/J.T.T./Examiner, Art Unit 2129                                                                                                                                                                                                        




/SEHWAN KIM/Examiner, Art Unit 2129
Read full office action
Prosecution Timeline

Show 7 earlier events
Oct 22, 2025
Examiner Interview Summary
Oct 30, 2025
Request for Continued Examination
Nov 05, 2025
Response after Non-Final Action
Jan 30, 2026
Non-Final Rejection mailed — §103
Apr 30, 2026
Response Filed
May 04, 2026
Applicant Interview (Telephonic)
May 04, 2026
Examiner Interview Summary
Jun 09, 2026
Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/366,773
Patent 12561603
SYSTEM FOR TIME BASED MONITORING AND IMPROVED INTEGRITY OF MACHINE LEARNING MODEL INPUT DATA
4y 7m to grant Granted Feb 24, 2026
17/588,175
Patent 12555000
GENERATION OF CONVERSATIONAL TASK COMPLETION STRUCTURE
4y 0m to grant Granted Feb 17, 2026
17/676,775
Patent 12462154
METHOD AND SYSTEM FOR ASPECT-LEVEL SENTIMENT CLASSIFICATION BY MERGING GRAPHS
3y 8m to grant Granted Nov 04, 2025
17/470,900
Patent 12395590
REDUCTION AND GEO-SPATIAL DISTRIBUTION OF TRAINING DATA FOR GEOLOCATION PREDICTION USING MACHINE LEARNING
3y 11m to grant Granted Aug 19, 2025
17/357,626
Patent 12380361
Federated Machine Learning Management
4y 1m to grant Granted Aug 05, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

5-6
Expected OA Rounds
32%
Grant Probability
91%
With Interview (+58.9%)
3y 10m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 28 resolved cases by this examiner. Grant probability derived from career allowance rate.