DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 10/30/2025 has been entered.
Response to Arguments
Applicant's arguments filed 10/30/2025 have been fully considered but they are not persuasive.
Regarding applicant’s remarks directed to the rejection of claims under 35 USC § 103, Applicant argues that Yoon and Yao do not teach amended portions “wherein processing each unlabeled training input comprises processing an unchanged version of each unlabeled training input” and provides annotated figure 4A:
PNG
media_image1.png
257
713
media_image1.png
Greyscale
Examiner respectfully points out that the amended claim recites “processing an unchanged version of each unlabeled training input.” Examiner notes that there are two reasonable interpretations of the limitation:
“processing an unchanged version of each unlabeled training input”
Wherein examiner notes that processing is not explicitly defined and thus, the processing can be taught by the masking and corruption of Yoon or the augmentation of Yao.
Further, “an unchanged version of each unlabeled training input” can be interpreted to simply mean “each unlabeled training input” as the processing is performed on the original data/version (ie an unchanged version of each unlabeled training input)
“processing an unchanged version of each unlabeled training input”
Wherein examiner notes that, in this interpretation, the processing is performed on a version of each unlabeled training input wherein the version can be obtained from the masking and corruption of Yoon or the augmentation of Yao and as no further changes are made after the version is obtained, it is thus, “unchanged.”
In other words, in response to applicant's argument that the references fail to show certain features of the invention, it is noted that the features upon which applicant relies (i.e., an unchanged input as a first embedding (as interpreted from annotated Figure 4A)) are not recited in the rejected claim(s). Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims. See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993). Thus, applicant’s arguments are considered unpersuasive.
For examination purposes, examiner considers the first interpretation (1) as the broadest reasonable interpretation. The examiner further refers to the rejection under 35 USC § 103 in the current office action for more details.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1, 9-12, 17, and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yoon, Jinsung, et al. "Vime: Extending the success of self-and semi-supervised learning to tabular domain." (“Yoon”) in view of Yao, Tiansheng, et al. "Self-supervised Learning for Large-scale Item Recommendations." arXiv preprint arXiv:2007.12865 (2021). (“Yao”)
In regards to claim 1,
Yoon teaches A computer-implemented method of training a neural network having a plurality of network parameters, the method comprising: obtaining [a set of] unlabeled training inputs from an unlabeled training dataset, the [set of] unlabeled training inputs each having a respective feature in each of a plurality of feature dimensions;
(Yoon, Figure 1, Section 3, “In this section, we introduce the general formulation of self- and semi-supervised learning. Suppose we have a small labeled dataset
PNG
media_image2.png
32
134
media_image2.png
Greyscale
i=1 and a large unlabeled dataset
PNG
media_image3.png
34
139
media_image3.png
Greyscale
, [unlabeled training inputs unlabeled training inputs from an unlabeled training dataset]
where
PNG
media_image4.png
26
190
media_image4.png
Greyscale
and
PNG
media_image5.png
23
56
media_image5.png
Greyscale
.:)
Yoon teaches for each unlabeled training input [in the set] of unlabeled training inputs, generating a corrupted version of the unlabeled training input, comprising determining a proper subset of the feature dimensions
(Yoon, Figure 1, Section 3.2, “Semi-supervised learning optimizes the predictive model f by minimizing the supervised loss function jointly with some unsupervised loss function defined over the output space Y. Formally, semi-supervised learning is formulated as an optimization problem as follows,
PNG
media_image6.png
49
656
media_image6.png
Greyscale
Where
PNG
media_image7.png
27
134
media_image7.png
Greyscale
is an unsupervised loss function, and a hyperparameter
PNG
media_image8.png
25
50
media_image8.png
Greyscale
is introduced to control the trade-off between the supervised and unsupervised losses. x' is a perturbed version of x [generating a corrupted version of the unlabeled training input] assumed to be drawn from a conditional distribution ~p_X(x’|x) [comprising determining a proper subset of the feature dimensions].”)
Yoon teaches and, for each feature dimension that is in the proper subset of feature dimensions, applying a corruption to the respective feature in the feature dimension using one or more feature values sampled from a marginal distribution of the feature dimension as specified in the unlabeled training dataset;
(Yoon, Figure 1, Section 4.1, “We introduce two pretext tasks: feature vector estimation and mask vector estimation. Our goal is to optimize a pretext model to recover an input sample (a feature vector) from its corrupted variant, at the same time as estimating the mask vector that has been applied to the sample.
In our framework, the two pretext tasks share a single pretext distribution pXs;Ys . First, a mask
vector generator outputs a binary mask vector
PNG
media_image9.png
26
227
media_image9.png
Greyscale
where mj is randomly sampled from a Bernoulli distribution with probability
PNG
media_image10.png
30
280
media_image10.png
Greyscale
. Then a pretext generator
PNG
media_image11.png
28
182
media_image11.png
Greyscale
takes a sample x [for each feature dimension that is in the proper subset of feature dimensions] from Du and a mask vector m as input, and generates a masked sample ~x. The generating process of ~x is given by
PNG
media_image12.png
32
533
media_image12.png
Greyscale
where the j-th feature of
PNG
media_image13.png
21
12
media_image13.png
Greyscale
[using one or more feature values sampled from a … distribution of the feature dimension as specified in the set of unlabeled training data] is sampled from the empirical distribution
PNG
media_image14.png
32
220
media_image14.png
Greyscale
PNG
media_image15.png
24
37
media_image15.png
Greyscale
where xi,j is the j-th feature of the i-th sample in Du (i.e. the empirical marginal [marginal] distribution of each feature). - see Figure 3 in the Supplementary Materials for further details. The generating process in Equation (3) ensures the corrupted sample ~x is not only tabular but also similar to the samples in Du.”)
Yoon teaches processing, using the neural network and in accordance with the current values of the plurality of network parameters, the corrupted version of each unlabeled training input [in the set] of unlabeled training inputs to generate a second embedding of the corrupted version of the unlabeled training input;
(Yoon, Section 4.2, Figure 1, “Figure 1: Block diagram of the proposed self-supervised learning framework [processing, using the neural network and in accordance with the current values of the plurality of network parameters] on tabular data. (1) Mask generator generates binary mask vector (m) which is combined with an input sample (x) to create a masked and corrupted sample (~x) [the corrupted version of each unlabeled training input in the batch of unlabeled training inputs to generate], (2) Encoder (e) transforms ~x into a latent representation (z) [a second embedding of the corrupted version of the unlabeled training input; wherein the generating can be performed on each input provided from the batch of Yao], (3) Mask vector estimator (sm) is trained by minimizing the cross-entropy loss with m, feature vector estimator (sr) is trained by minimizing the reconstruction loss with x, (4) Encoder (e) is trained by minimizing the weighted sum of both losses.”)
However, Yoon does not explicitly teach a set; processing, using the neural network and in accordance with current values of the plurality of network parameters, each unlabeled training input in the set of unlabeled training inputs to generate a first embedding of the unlabeled training input, wherein processing each unlabeled training input comprises processing an unchanged version of each unlabeled training input; and training the neural network based on optimizing a contrastive learning loss function, wherein for each unlabeled training input in the set of unlabeled training inputs, optimizing the contrastive learning loss function trains the neural network to reduce a difference between (i) the first embedding of the unlabeled training input that is generated by the neural network from processing an unchanged version of the unlabeled training input and (ii) the second embeddings of the corrupted version of the unlabeled training input that is generated by the neural network from processing the corrupted version of the unlabeled training input, and to increase a difference between (i) the first embedding of the unlabeled training input that is generated by the neural network from processing an unchanged version of the unlabeled training input and (ii) a different second embedding of a corrupted version of a different unlabeled training input that is generated by the neural network from processing the corrupted version of the different unlabeled training input in the set of unlabeled training inputs
Yao teaches a set
(Yao, Section 3.1, “We consider a batch of 𝑁 [set] item examples 𝑥1, ..., 𝑥𝑁 , where 𝑥𝑖 ∈ X represents a set of features for example 𝑖. In the context of recommenders, an example indicates a query, an item or a query-item pair. Suppose there are a pair of transform function ℎ, 𝑔 : X → X that augment 𝑥𝑖 to be 𝑦𝑖 , 𝑦′ 𝑖 respectively,
PNG
media_image16.png
31
335
media_image16.png
Greyscale
”)
Yao teaches processing, using the neural network and in accordance with current values of the plurality of network parameters, each unlabeled training input in the set of unlabeled training inputs to generate a first embedding of the unlabeled training input, wherein processing each unlabeled training input comprises processing an unchanged version of each unlabeled training input;
(Yao, Section 3, Fig. 2, “The basic idea is two folds: first, we apply different data augmentation for the same training example to learn representations [processing, using the neural network and in accordance with current values of the plurality of network parameters]; and then use contrastive loss function to encourage the representations learned for the same training example to be similar.”
PNG
media_image17.png
395
568
media_image17.png
Greyscale
)
(Yao, Section 3.1, “We consider a batch of 𝑁 item examples 𝑥1, ..., 𝑥𝑁 , where 𝑥𝑖 ∈ X represents a set of features for example 𝑖. In the context of recommenders, an example indicates a query, an item or a query-item pair. Suppose there are a pair of transform function ℎ, 𝑔 : X → X that augment 𝑥𝑖 to be 𝑦𝑖 , 𝑦′ 𝑖 respectively, [each unlabeled training input in the set of unlabeled training inputs to generate a first embedding of the unlabeled training input, wherein processing each unlabeled training input comprises processing an unchanged version of each unlabeled training input; wherein transform function h and encoder H is applied to the original input data to generate the first embedding; examiner further notes that transform function g and encoder G can be taught by the augmentation (corruption) and encoder of Yoon]
PNG
media_image16.png
31
335
media_image16.png
Greyscale
”)
Yao teaches and training the neural network based on optimizing a contrastive learning loss function, wherein for each unlabeled training input in the set of unlabeled training inputs, optimizing the contrastive learning loss function trains the neural network to reduce a difference between (i) the first embedding of the unlabeled training input that is generated by the neural network from processing an unchanged version of the unlabeled training input and (ii) the second embeddings of the corrupted version of the unlabeled training input that is generated by the neural network from processing the corrupted version of the unlabeled training input, and to increase a difference between (i) the first embedding of the unlabeled training input that is generated by the neural network from processing an unchanged version of the unlabeled training input and (ii) a different second embedding of a corrupted version of a different unlabeled training input that is generated by the neural network from processing the corrupted version of the different unlabeled training input in the set of unlabeled training inputs.
(Yao, Section 3.1, annotated Fig. 2, “Given the same input of example 𝑖, we want to learn different representations 𝑦𝑖 , 𝑦′ 𝑖 after augmentation to make sure the model still recognizes that both 𝑦𝑖 and 𝑦𝑖 represent the same input 𝑖. In other words, the contrastive loss learns to minimize the difference between 𝑦𝑖 , 𝑦′ 𝑖 [reduce a difference between the first and second embeddings; wherein the second embedding is provided from the aforementioned methods of Yoon]. In the mean time, for different example 𝑖 and 𝑗, the contrastive loss maximizes the difference between the representations learned 𝑦𝑖 , 𝑦′ 𝑗 after data different augmentations. Let z𝑖 , z ′ 𝑖 denote the embeddings of 𝑦𝑖 , 𝑦′ 𝑖 after encoded by two neural networks H, G : X → R 𝑑 , that is
PNG
media_image18.png
34
345
media_image18.png
Greyscale
We treat (z𝑖 , z ′ 𝑖 ) as positive pairs, and (z𝑖 , z ′ 𝑗 ) as negative pairs for 𝑖 ≠ 𝑗. Let 𝑠(z𝑖 , z ′ 𝑗 ) = ⟨z𝑖 , z ′ 𝑗 ⟩/(∥z𝑖 ∥ · ∥z ′ 𝑗 ∥). To encourage the above properties, we define the SSL loss for a batch of 𝑁 examples {𝑥𝑖 } as:
PNG
media_image19.png
80
471
media_image19.png
Greyscale
where 𝜏 is a tunable hyper-parameter for the softmax temperature. The above loss function learns a robust embedding space such that similar items are close to each other after data augmentation, and random examples are pushed farther away [increase a difference between the first embedding and a different second embedding; wherein the second embedding is provided from the aforementioned methods of Yoon]. The overall framework is illustrated in Figure 2.
PNG
media_image20.png
350
655
media_image20.png
Greyscale
”)
Yoon and Yao are both considered to be analogous to the claimed invention because they are in the same field of self-supervised learning. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yoon to incorporate the teachings of Yao in order to provide a framework to improve item representation learning from a batch of items (Yao, Abstract para. 2, “Inspired by the recent success in self-supervised representation learning research in both computer vision and natural language understanding, we propose a multi-task self-supervised learning (SSL) framework for large-scale item recommendations. The framework is designed to tackle the label sparsity problem by learning better latent relationship of item features. Specifically, SSL improves item representation learning as well as serving as additional regularization to improve generalization. Furthermore, we propose a novel data augmentation method that utilizes feature correlations within the proposed framework.”)
In regards to claim 11 and analogous claims 9 and 10,
Yoon and Yao teaches The method of claim 1,
Yoon teaches wherein the features in a first feature dimension are numerical features and the features in a second feature dimension are categorical features.
(Yoon, Section 4.1, “Our framework is novel in learning the correlations for tabular data whose correlation structure is less obvious than in images or language. The learned representation that captures the correlation across different parts of the object, regardless of the object type (e.g. language, image or tabular data) [wherein the features in a first feature dimension are numerical features and the features in a second feature dimension are categorical features; wherein the type can be numerical, categorical, etc], is an informative input for the various downstream tasks.”)
In regards to claim 12,
Yoon and Yao teaches The method of claim 1,
Yoon teaches wherein the neural network comprises an encoder sub-neural network having a plurality of encoder network parameters and an embedding generation sub- neural network having a plurality of embedding generation network parameters,
PNG
media_image21.png
211
764
media_image21.png
Greyscale
(Yoon, Section 3.2, Figure 1 Encoder, “(2) Encoder (e) [an encoder sub-neural network having a plurality of encoder network parameters and an embedding generation sub- neural network having a plurality of embedding generation network parameters; wherein the encoder and embedding generation sub-neural network is interpreted to be the same] transforms ~x into a latent representation
(z)”)
Claims 9 and 10 are rejected under the same rationale as claim 11 as they are substantially similar
Claim 17 and claim 20 are rejected on the same grounds under 35 U.S.C. 103 as claim 1 as they are substantially similar.
Claim(s) 2-3 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yoon in view of Yao in further view of Oord, Aaron van den, Yazhe Li, and Oriol Vinyals. "Representation learning with contrastive predictive coding." (“Oord”).
In regards to claim 2 and analogous claim 18,
Yoon and Yao teaches The method of claim 1,
Oord teaches wherein the contrastive learning loss function comprises a noise contrastive estimation (NCE) loss function.
(Oord, Section 2.3, “Both the encoder and autoregressive model are trained to jointly optimize a loss based on NCE [a noise contrastive estimation (NCE) loss function], which we will call InfoNCE”)
Oord is considered to be analogous to the claimed invention because they are in the same field of contrastive learning. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yoon and Yao to incorporate the teachings of Oord in order to provide a technique that capture information that is maximally useful for prediction (Oord, Abstract, “The key insight of our model is to learn such representations by predicting the future in latent space by using powerful autoregressive models. We use a probabilistic contrastive loss which induces the latent space to capture information that is maximally useful to predict future samples.”)
In regards to claim 3,
Yoon and Yao and Oord teaches The method of claim 2,
Oord teaches wherein the NCE loss function comprises an InfoNCE loss function.
(Oord, Section 2.3, “Both the encoder and autoregressive model are trained to jointly optimize a loss based on NCE, which we will call InfoNCE. [an InfoNCE loss function]”)
Claim 18 is rejected on the same grounds under 35 U.S.C. 103 as claim 2 as they are substantially similar.
Claim(s) 4-8, 13-16 and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yoon in view of Yao in further view of U.S. Pat. No. US11687540B2 Pushak et al. (“Pushak”).
In regards to claim 4,
Yoon and Yao teaches The method of claim 1,
Pushak teaches wherein determining the proper subset of feature dimensions comprises sampling the proper subset of feature dimensions from the plurality of feature dimensions with uniform randomness.
(Pushak, Col. 6 line 51 – Col. 7 line 1, “Algorithm 1:
1.1 Project out the features to be replaced. First all the data, X, is projected into the subspace of features, fkeep, that are not being replaced, i.e., fkeep=F\freplace. The data in this projected subspace is denoted by projfkeep(X).
1.2 Find a set of k neighbors in the subspace. Next, any (approximate) nearest neighbor method is used to find a set of data instances,
{projfkeep(X)[j 1],projfkeep(X)[j 2], . . . , projfkeep(X)[j k]},
that are similar to projfkeep(X)[i].
1.3 Randomly sample one of the neighbors. A single random neighbor
projfkeep(X)[j]ε{projfkeep(X)[j 1],projfkeep(X)[j 2], . . . , projfkeep(X)[j k]},
is picked from the set of k neighbors.”)
(Pushak, Col. 7 lines 31-38, “Steps 1.1 and 1.2 of Algorithm 1 are conditioning steps, and step 1.2 is generally very slow. There are numerous approximate nearest neighbor methods that can be used to find projfkeep(X)[j] to help speed up this portion of Algorithm 1. For example, an exact (or a 1+epsilon-approximate) k-nearest neighbor method can be used to find k approximate nearest neighbors, from which a single approximate nearest neighbor can be sampled uniformly at random [sampling the proper subset of feature dimensions from the plurality of feature dimensions with uniform randomness].”)
Pushak is considered to be analogous to the claimed invention because they are in the same field of approximate sampling of features for use in machine learning. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Yoon and Yao to incorporate the teachings of Pushak in order to provide steps of conditional sampling in order to solve deficiencies with marginal sampling (Pushak, Background, “While marginal sampling is a relatively inexpensive and efficient procedure, sampling from the marginal distribution of a dataset can result in data instances with perturbed feature values that represent unrealistic data instances, thus explaining the black-box ML model in regions where predictions of the model cannot be trusted and may never be needed. This can be particularly problematic when the black-box ML model is configured to identify anomalous data, where feature shuffling based on random samples generated using marginal sampling can easily produce perturbed data instances that do not conform to correlations found in the original dataset…Sampling from conditional distributions can be advantageous in many situations, including to help ML explainers identify features that distinguish anomalous data instances in a dataset. To illustrate, consider that instead of utilizing the marginal distribution for the ML explainer example above, the value for the age of the example 150-pound person is instead sampled from a conditional distribution that represents the age feature values of data instances within the highlighted section of dataset 100 centering on 150 pounds (the value of the non-perturbed weight feature of the target data instance). Sampling from the conditional distribution, the correlations occurring in the dataset are preserved, thus creating approximately realistic records that are far less likely to be flagged as anomalous by the machine learning model. Thus, for dataset 100, the importance of the age feature will not be artificially inflated by perturbed data instances that break correlations in the dataset.”)
In regards to claim 5,
Yoon and Yao and Pushak teaches The method of claim 4, further comprising determining the subset of selected feature dimensions with uniform randomness
Yoon teaches in accordance with a predetermined corruption rate which specifies a total number of feature dimensions to be selected.
(Yoon, Section 4.2, Figure 2, “Figure 2: Block diagram of the proposed semi-supervised learning framework on tabular data. For an unlabeled sample x in Du, (1) Mask generator generates K-number [with a predetermined corruption rate which specifies a total number of feature dimensions to be selected] of mask vectors and combine each of them with x to generate the corrupted samples
PNG
media_image22.png
19
127
media_image22.png
Greyscale
via pretext generator (gm), (2) Encoder (e) transforms these corrupted samples into latent representations
PNG
media_image23.png
22
126
media_image23.png
Greyscale
as K different augmented samples, (3) Predictive model is trained by minimizing the supervised loss on (x, y) in Dl and the consistency loss on the augmented samples
PNG
media_image24.png
28
135
media_image24.png
Greyscale
jointly. The block diagram of the proposed self- and semi-supervised learning frameworks on exemplary tabular data can be found in the Supplementary Materials (Figure 2).”)
In regards to claim 6 and analogous claim 19,
Yoon and Yao teaches The method of claim 1,
Yoon teaches wherein sampling the one or more feature values from the marginal distribution of the feature dimension as specified in the set of unlabeled training data comprises: sampling, the one or more feature values from a [uniform] distribution over the feature values that appear in the feature dimension at least a threshold amount of times across the unlabeled training inputs in the set of unlabeled training data.
(Yoon, Figure 1, Section 4.1, “We introduce two pretext tasks: feature vector estimation and mask vector estimation. Our goal is to optimize a pretext model to recover an input sample (a feature vector) from its corrupted variant, at the same time as estimating the mask vector that has been applied to the sample.
In our framework, the two pretext tasks share a single pretext distribution pXs;Ys . First, a mask
vector generator outputs a binary mask vector
PNG
media_image9.png
26
227
media_image9.png
Greyscale
where mj is randomly sampled from a Bernoulli distribution with probability
PNG
media_image10.png
30
280
media_image10.png
Greyscale
. Then a pretext generator
PNG
media_image11.png
28
182
media_image11.png
Greyscale
takes a sample x [sampling, the one or more feature values from a [uniform] distribution over the feature values that appear in the feature dimension at least a threshold amount of times across the unlabeled training inputs in the set of unlabeled training data; wherein a sample is taken if it’s in the set of unlabeled training data (thus appearing once)] from Du and a mask vector m as input, and generates a masked sample ~x.”)
However, Yoon does not teach uniform distribution
Pushak teaches a uniform distribution
(Pushak, Col. 7 lines 31-38, “Steps 1.1 and 1.2 of Algorithm 1 are conditioning steps, and step 1.2 is generally very slow. There are numerous approximate nearest neighbor methods that can be used to find projfkeep(X)[j] to help speed up this portion of Algorithm 1. For example, an exact (or a 1+epsilon-approximate) k-nearest neighbor method can be used to find k approximate nearest neighbors, from which a single approximate nearest neighbor can be sampled uniformly at random [uniform distribution].”)
In regards to claim 7,
Yoon and Yao and Pushak teaches The method of claim 6,
Yoon teaches wherein the threshold is one.
(Yoon, Figure 1, Section 4.1, “We introduce two pretext tasks: feature vector estimation and mask vector estimation. Our goal is to optimize a pretext model to recover an input sample (a feature vector) from its corrupted variant, at the same time as estimating the mask vector that has been applied to the sample.
In our framework, the two pretext tasks share a single pretext distribution pXs;Ys . First, a mask
vector generator outputs a binary mask vector
PNG
media_image9.png
26
227
media_image9.png
Greyscale
where mj is randomly sampled from a Bernoulli distribution with probability
PNG
media_image10.png
30
280
media_image10.png
Greyscale
. Then a pretext generator
PNG
media_image11.png
28
182
media_image11.png
Greyscale
takes a sample x [wherein the threshold is one; wherein a sample is taken if it’s in the set of unlabeled training data (thus appearing once)] from Du and a mask vector m as input, and generates a masked sample ~x.”)
In regards to claim 8,
Yoon and Yao teaches The method of claim 1,
Pushak teaches wherein applying the corruption to the feature using the one or more feature values comprises replacing the feature with the one or more feature values.
(Pushak, Col. 7 lines 2-7, Algorithm 1, “1.4 Replace feature values with the neighbor's values [replacing the feature with the one or more feature values]. Then, the feature values freplace of X[i] are replaced with those from X[j]. This creates an approximately realistic data instance, X[i]perturbed [applying the corruption to the feature using the one or more feature values comprises], in which the features, freplace, are modified with values that are likely to occur when conditioning on the other features, fkeep.”)
In regards to claim 13,
Yoon and Yao and Pushak teaches The method of claim 8,
Yoon teaches wherein the training further comprises, after training the neural network on the set of unlabeled training data,
(Yoon, Section 4.2, Figure 1, “Figure 1: Block diagram of the proposed self-supervised learning framework on tabular data. (1) Mask generator generates binary mask vector (m) which is combined with an input sample (x) to create a masked and corrupted sample (~x), (2) Encoder (e) transforms ~x into a latent representation (z), (3) Mask vector estimator (sm) is trained by minimizing the cross-entropy loss with m, feature vector estimator (sr) is trained by minimizing the reconstruction loss with x, (4) Encoder (e) is trained by minimizing the weighted sum of both losses [training the neural network on the set of unlabeled training data].”))
(Yoon, Figure 1, Section 3, “In this section, we introduce the general formulation of self- and semi-supervised learning. Suppose we have a small labeled dataset
PNG
media_image2.png
32
134
media_image2.png
Greyscale
i=1 and a large unlabeled dataset
PNG
media_image3.png
34
139
media_image3.png
Greyscale
, [unlabeled training data]
where
PNG
media_image4.png
26
190
media_image4.png
Greyscale
and
PNG
media_image5.png
23
56
media_image5.png
Greyscale
.”)
Yoon teaches adapting the sub-encoder neural network for a specific machine learning task including adjusting learned values of the plurality of encoder network parameters using labeled data comprising labeled training inputs.
PNG
media_image25.png
222
758
media_image25.png
Greyscale
(Yoon, Section 4.2, Figure 2 teaches adapting the encoder sub-neural network for a specific machine learning task including adjusting learned values of the plurality of encoder network parameters using labeled data comprising labeled training inputs (green arrows; With labeled samples (Dl)))
(Yoon, Figure 2 Encoder (e), Section 3, “In this section, we introduce the general formulation of self- and semi-supervised learning. Suppose we have a small labeled dataset
PNG
media_image2.png
32
134
media_image2.png
Greyscale
[ labeled data comprising labeled training inputs]
i=1 and a large unlabeled dataset
PNG
media_image3.png
34
139
media_image3.png
Greyscale
,
where
PNG
media_image4.png
26
190
media_image4.png
Greyscale
and
PNG
media_image5.png
23
56
media_image5.png
Greyscale
. The label yi is a scalar in single-task learning while it can be given as a multi-dimensional vector in multi-task learning [for a specific machine learning task including adjusting learned values of the plurality of encoder network parameters]. We assume every input feature xi in Dl and Du feature vector is sampled i.i.d. from a feature distribution p_x, and the labeled data pairs (xi; yi) in Dl are drawn from a joint distribution p_x,y .”)
In regards to claim 14,
Yoon and Yao and Pushak teaches The method of claim 13,
Yoon teaches wherein adapting the encoder neural network for a specific machine learning task further comprises: processing, using the sub-encoder sub-neural network and in accordance with the learned values of the plurality of encoder network parameters, a labeled training input to generate an embedding of the labeled training input;
PNG
media_image26.png
395
855
media_image26.png
Greyscale
(Yoon, Section 4.2, Figure 2, processing, using the sub-encoder sub-neural network and in accordance with the learned values of the plurality of encoder network parameters ie Encoder (e), a labeled training input ie x to generate an embedding of the labeled training input ie feature representation z (green arrows))
Yoon teaches processing, using an output sub-neural network and in accordance with current values of a plurality of output network parameters, the embedding to generate a training output;
(Yoon, Section 4.2, Figure 2 teaches processing, using an output sub-neural and in accordance with current values of a plurality of output network parameters network ie Predictor f, the embedding ie feature representations to generate a training output ie predictions y)
Yoon teaches computing a supervised learning loss function evaluating a difference between the training output and a ground truth output associated with the labeled training input;
(Yoon, Section 4.2, “The supervised loss Ls is given by
PNG
media_image27.png
39
514
media_image27.png
Greyscale
where ls is the standard supervised loss function, e.g. mean squared error for regression or categorical cross-entropy for classification.”; wherein y is the training output and fe(x) is the ground truth output associated with the labeled training input)
Yoon teaches and determining, based on computing a gradient of the supervised learning loss function with respect to the plurality of encoder network parameters and to the plurality of output network parameters, an adjustment to the learned values of the plurality of encoder network parameters.
(Yoon, Section 4.2, Figure 2, “(3) Predictive model is trained by minimizing the supervised loss on (x, y) in Dl”; figure 2 teaches training using back-propagation which is computing a gradient with respect to the plurality of network parameters of the supervised loss function that evaluates a difference)
In regards to claim 15,
Yoon and Yao and Pushak teaches The method of claim 14,
Yoon teaches wherein the specific machine learning task comprises a classification task, and wherein the supervised learning loss function comprises a cross-entropy loss function.
(Yoon, Section 4.2, “The supervised loss Ls is given by
PNG
media_image27.png
39
514
media_image27.png
Greyscale
where ls is the standard supervised loss function, e.g. mean squared error for regression or categorical cross-entropy for classification.”)
In regards to claim 16,
Yoon and Yao and Pushak The method of claim 13,
Yoon teaches further comprising providing the learned values of the plurality of encoder network parameters for use in performing the specific machine learning task.
(Yoon, Section 5.2, “In this subsection, we evaluate the methods on clinical data, using the UK and US prostate cancer datasets (from Prostate Cancer UK and SEER datasets, respectively). The features consist of patients’ clinical information (e.g. age, grade, stage, Gleason scores) - total 28 features. We predict 2 possible treatments of UK prostate cancer patients (1) Hormone therapy (whether the patients got hormone therapy), (2) Radical therapy (whether the patient got radical therapy) [performing the specific machine learning task]. Both tasks are binary classification. In the UK prostate cancer dataset, we only have around 10,000 labeled patients samples. The US prostate cancer dataset contains more than 200,000 unlabeled patients samples, twenty times bigger than the labeled UK dataset. We use 50% of the UK dataset (as the labeled data) and the entire US dataset (as the unlabeled data) for training, with the remainder of the UK data being used as the testing set. We also test three popular supervised learning models: Logistic Regression, a 2-layer Multi-layer Perceptron and XGBoost.
Table 1 shows that VIME [providing the learned values of the plurality of encoder network parameters for use] results in the best prediction performance, outperforming the benchmarks. More importantly, VIME is the only self- or semi-supervised learning framework that significantly outperforms supervised learning models. These results shed light on the unique advantage of using VIME in leveraging a large unlabeled tabular dataset (e.g. the US dataset) to strengthen a model’s predictive power. Here we also demonstrate that VIME can perform well even when there exists a distribution shift between the UK labeled data and the US unlabeled data (see the Supplementary Materials (Section 2) for further details).
PNG
media_image28.png
269
770
media_image28.png
Greyscale
”)
Claim 19 is rejected on the same grounds under 35 U.S.C. 103 as claim 6 as they are substantially similar.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Dang, Zhiyuan, et al. "Doubly contrastive deep clustering." arXiv preprint arXiv:2103.05484 (2021).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JASMINE THAI whose telephone number is (703)756-5904. The examiner can normally be reached M-F 8-4.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael Huntley can be reached at (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/J.T.T./Examiner, Art Unit 2129
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129