Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statements (IDS) submitted on 10/27/2021 and 04/20/2023 were determined to be in compliance with the provisions of 37 CFR 1.97 in the first Non-Final Office Action. No new IDS has been entered on the record. Accordingly, the information disclosure statements were considered by the examiner.
Status of Claims
The present application is being examined under the claims filed 10/09/2025. The status of the claims are as follows:
Claims 1-5, 7-9, 11-15, 17-19 are pending;
Claims 1, 7, 11, 17 are amended;
Claims 6, 10, 16, 20 are cancelled.
Response to Amendment
This Office Action is in response to Applicant’s communication filed October 09, 2025 in response to office action mailed April 09, 2025. The Applicant’s remarks and any amendments to the claims or specification have been considered with the results that follow.
Response to Arguments
Regarding 35 U.S.C. 101
In Remarks page 9, Argument 1
(Examiner summarizes Applicant’s arguments) Applicant argues that the amended claims are patent-eligible and highlights the added limitations - (i) generating a second plurality of samples with distributional characteristics that correspond to the first dataset, (ii) interleaving first/second datasets, and (iii) determining a performance metric on the portion of output corresponding to the first dataset – then argues these provide a specific solution to an Internet-centric problem (protecting/securely sharing a dataset). Applicant analogizes to Weisner v. Google and DDR, asserting the claims recite a “specific way”, not a generic use of the Internet. Applicant cites the specification (e.g., ¶¶[0004], [0005], [0030], [0113]) to frame the technical problem of protecting validation data from capture, and asserts the amended claims implement that protection by distribution-matched synthetic data interleaved with validation data and portion-based metrics. Applicant contends this integrates the exception into a practical application akin to Weisner/DDR.
Examiner’s response to Argument 1,
Examiner disagrees. As amended, the claims still recite data analysis/generation/arrangement and metric computation performed on generic computing with network I/O. Under Step 2A Prong One, the limitations are mathematical concepts/mental processes. Under Step 2A Prong Two, the additional elements (retrieving, transmitting/receiving, generic control circuitry) are insignificant extra-solution activity and do not improve computer functionality. Step 2B: elements are well-understood, routine, and conventional (WURC) in ML evaluation. Applicant’s Weisner/DDR arguments are distinguishable because the claims don’t change how the computer/network operates; they specify what data to generate/mix and what metric to compute. §101 rejection is maintained.
Regarding 35 U.S.C. 103
In Remarks page 11, Argument 2
Applicant argues Ghanta does not teach “generating a second dataset having second distributional characteristics that correspond to the first.” The cited Ghanta passages discuss generating an error dataset from model output, which does not share the training data’s distribution and therefore cannot render obvious the claimed distribution-matched second dataset.
Examiner’s response to Argument 2,
Examiner disagrees. The prior art combination teaches:
Distribution-matched second dataset: synthetic generation that corresponds to source distribution: Walters teaches using synthetic data to recreate a larger-scale dataset from a smaller one (Walters, ¶[0002]); Walters also shows network transmission of models/datasets between components (Walters, ¶[0127]).
Interleaving/combination & traceability: Walters teaches ordered mixing/concatenation and aggregation over windows for combining data streams (¶[0089]); Williams further sends testing data and known classifications to a performance analysis component, supporting label-based traceability of outputs (¶[0102]-[0103]).
Metric form portion: Ghanta teaches comparing model predictions to true labels to calculate error rate/score (¶[0081]); see also validation labels resulting error dataset (¶¶[0078]-[0079]).
Motivation per KSR: predictable use of known elements for privacy-preserving, traceable evaluation; reasonable expectation of success. §103 maintained on the same combinations, updated to amended text.
In Remarks page 11-12, Argument 3
Applicant argues the Office Action allegedly equates Ghanta’s error dataset with both (a) the claimed second dataset (an input) and (b) the claimed performance metrics (an output/measure). Applicant argues the same Ghanta “error dataset” cannot be both the input second dataset and the performance metric determined from output; therefore Ghanta cannot render obvious both limitations and the §103 rejection should be withdrawn.
Examiner’s response to Argument 3,
Examiner disagrees. The rejection does not require Ghanta’s “error dataset” to satisfy two different claim roles. In the maintained combination:
The claimed second dataset with distributional characteristics corresponding to the first is taught by Walters (synthetic data used to “create or recreate a realistic, larger-scale dataset from a smaller … dataset,” i.e., distribution-faithful generation). (Walters, ¶[0002])
Interleaving/combination traceability to recover the output portion is taught by Williams (ordered combination/concatenation and delivering known classification alongside test data to a Model Performance Analysis component for evaluation). (Williams, ¶[0089], ¶¶[0102]-[0103]).
The metric computed from the portion of output is taught by Ghanta (compare predictions to labels on validation data to compute accuracy/error, producing an error dataset from validation). (Ghanta ¶¶[0078]-[0079], [0106]-[0108]).
Thus, there is no conflation: Walters supplies the second dataset; Williams supplies ordered mixing/labels for portion identification; and Ghanta supplies subset-based performance metrics. The motivation to combine remains as stated (ML pipelines using synthetic augmentation and labeled evaluation with predictable results).
Regarding Objections and Informalities
Objection to abstract is withdrawn (correction entered).
No drawing objections appear in this record after amendment.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 7 and 17 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
In claims 7 and 17, the limitation recites “assigning a first source identifier to each of the first plurality of samples; and assigning a second source identifier to each of the first plurality of samples …” As written, both identifiers are applied to the first plurality, rendering the step unclear/inoperative for the distinguishing outputs. Under BRI, this internal inconsistency leaves the metes and bounds not reasonably certain. See MPEP § 2173.02 (claim scope must be clearly delineated).
Claim Rejections - 35 USC § 101
Claims 1-5, 7-9, 11-15, and 17-19 are rejected under 35 U.S.C. 101 as being directed to a judicial exception (i.e., an abstract idea) without significantly more.
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
The claims fall into two statutory categories:
Claims 1-5, 7-9: method claims
Claims 11-15, 17-19: system claims with functionally similar limitations
Detailed Analysis of 35 U.S.C. 101:
Claim 1 – Step 2A Prong One – Abstract Idea Identification:
Claim 1 recites limitations related to analyzing data distributions, analyzing data distributions, generating synthetic data, dataset combination, and model output segmentation, which fall within the judicial exception of mathematical concepts and mental processes identified in MPEP 2106.04(a)(2).
Abstract idea limitations:
identifying first distributional characteristics of the first plurality of samples of the first dataset;
generating, based on the first distributional characteristics, a second plurality of samples for a second dataset, wherein the second plurality of samples comprise second distributional characteristics that correspond to the first distributional characteristics;
generating a combined dataset based on interleaving samples from the first dataset with samples from the second dataset;
identifying a portion of the output corresponding to the first dataset;
determining a performance metric of the trained machine learning model based on the portion of the output corresponding to the first plurality of samples.
These limitations are directed to data manipulation and model evaluation, which are abstract ideas implemented via generic computing operations.
Therefore, Claim 1 is directed to an abstract idea under Step 2A Prong One.
Claim 1 – Step 2A Prong 2 and Step 2B Combined Analysis:
Additional Elements beyond the abstract idea:
retrieving a first dataset that comprises a first plurality of samples;
transmitting, over a network, the combined dataset as an input to a trained machine learning model;
receiving over the network output from the trained machine learning model that was generated based on the combined dataset input;
These are generic data gathering and output steps performed using conventional computing infrastructure (see MPEP § 2106.05(g)). They do not reflect an improvement to the functioning of a computer or another technology, nor do they apply the abstract idea in a meaningful way beyond merely implementing it. In addition, these limitations are considered extra-solution activities or insignificant post-solution activity per MPEP 2106.05(g) and (f).
Accordingly, the additional elements do not integrate the judicial exception into a practical application under Step 2A, Prong Two, nor do they provide an inventive concept under Step 2B. The claim is directed to an abstract idea and fails to recite significantly more.
Conclusion: Claim 1 is ineligible under 35 USC 101.
Claim 2 – Step 2A Prong One – Abstract Idea Identification:
Claim 2 depends on claim 1 and adds limitations related to evaluating whether a dataset includes personal identifiable information (PII), generating pseudo-random PII, and assigning it to another dataset. These steps involve mental processes and mathematical operations, which fall under judicial exceptions identified in MPEP 2106.04(a)(w).
Abstract idea limitations:
Determining whether the dataset includes personal identifiable information (PII);
Pseudo-randomly generating PII;
Assigning the generated PII to the second dataset;
These limitations involve analysis, decision-making, and substitution operations and are mental steps and mathematical operations – they represent data anonymization and preprocessing logic, which are abstract ideas under MPEP 2106.04(a)(2).
Therefore, Claim 2 is directed to an abstract idea under Step 2A, Prong One.
Claim 2 – Step 2A Prong 2 and Step 2B Combined Analysis:
Additional Elements Beyond the Abstract Ideas:
None beyond those already identified as abstract.
These do not integrate the abstract idea into a practical application, nor do they represent a technological improvement. They are extra-solution activities related to data preparation and information masking and do not improve computer functionality.
Therefore, Claim 2 fails Step 2A Prong 2 and also lacks an inventive concept under Step 2B.
Conclusion: Claim 2 is ineligible under 35 USC 101.
Claim 3 – Step 2A Prong One – Abstract Idea Identification:
Claim 3 depends on claim 1 and adds limitations related to using a neural network to determine distributional characteristics of the first dataset. These steps involve training a neural network and analyzing assigned weights, which are mathematical concepts under MPEP 2106.04(a)(2).
Abstract idea limitations:
Retrieving a neural network comprising a plurality of nodes;
Training the neural network using at least a subset of the first dataset by assigning weights;
Determining the first distributional characteristics based on the assigned weights.
These limitations involve mathematical operations on data, such as matrix operations, weight assignments, and model parameter interpretation – all abstract mathematical concepts and mental processes.
Therefore, Claim 3 is directed to an abstract idea under Step 2A, Prong One.
Claim 3 – Step 2A Prong 2 and Step 2B Combined Analysis:
Additional Elements Beyond the Abstract Ideas:
None beyond those already identified as abstract.
These elements relate to model training and statistical analysis, which are routine ML steps executed on a generic computing system. There is no improvement to the structure or functioning of the neural network or the computer system. These are considered insignificant post-solution activity and do not provide a practical application or inventive concept.
Therefore, Claim 2 fails Step 2A Prong 2 and also lacks an inventive concept under Step 2B.
Conclusion: Claim 3 is ineligible under 101.
Claim 4 – Step 2A Prong One – Abstract Idea Identification:
Claim 4 depends on claim 1 and adds a limitation regarding dataset size, specifying that the first dataset has fewer samples than the second dataset. This involves quantitative comparison of dataset sizes, which is a mathematical concept under MPEP 2106.04(a)(2).
Abstract idea limitations:
First number of samples in the first dataset is smaller than a second number of samples in the second dataset.
This is a data measurement and comparison operation, which is a mathematical relationship or mental process.
Therefore, Claim 4 is directed to an abstract idea under Step 2A Prong One.
Claim 4 – Step 2A Prong 2 and Step 2B Combined Analysis:
Additional Elements Beyond the Abstract Ideas:
None.
Quantifying or comparing the size of datasets does not improve the operation of a machine or ML model. It is a generic statistical condition that fails to integrate the abstract idea into a practical application. There is also no inventive concept, as such comparisons are commonplace in data preparation workflows.
Conclusion: Claim 4 is ineligible under 101.
Claim 5 – Step 2A Prong One – Abstract Idea Identification:
Claim 5 depends from claim 4 and adds a limitation that specifies a quantitative ratio - that the second dataset contains 100x more samples than the first. This is a refined quantitative comparison of dataset sizes, which is a mathematical concept under MPEP 2106.04(a)(2).
Abstract idea limitations:
The second number is one hundred times larger than the first number.
Therefore, Claim 5 is directed to an abstract idea under Step 2A Prong One
Claim 5 – Step 2A Prong 2 and Step 2B Combined Analysis:
Additional Elements Beyond the Abstract Ideas:
None.
Specifying a numerical threshold (e.g., 100x) is a quantitative condition that may guide implementation but does not amount to significantly more than the abstract idea nor does it improve a computer or ML model’s functionality, and thus fails to provide an inventive concept.
Conclusion: Claim 5 is ineligible under 101.
Claim 7 – Step 2A Prong One – Abstract Idea Identification:
Claim 7 depends from claim 6 and adds that source identifiers are assigned to distinguish data samples. This is a form of metadata labeling, which is a mental process and data classification under MPEP 2106.04(a)(2).
Abstract idea limitations:
Assigning two source identifiers to the same samples;
Identifying corresponding portions of output using those identifiers.
This labeling is informational and does not involve any technical transformation of the data or machine.
Therefore, claim 7 is directed to an abstract idea under Step 2A Prong One.
Claim 7 – Step 2A Prong 2 and Step 2B Combined Analysis:
Additional Elements Beyond the Abstract Ideas:
None.
Source tagging or identifier assignment is a routine and conventional data annotation operation. The identifiers are used as labels and do not modify system behavior or model architecture. This is extra-solution activity performed using conventional computing.
Conclusion: Claim 7 is ineligible under 101.
Claim 8 – Step 2A Prong One – Abstract Idea Identification:
Claim 8 depends from claim 1 and adds steps for:
Modifying part of the first dataset,
Assigning a known output,
Submitting it to a trained model,
And detecting cheating if the model’s output matches the known result.
These limitations relate to testing model behavior using synthetically manipulated data and analyzing the model’s response, which fall under represent mental processes and mathematical/logical evaluations under MPEP 2106.04(a)(2).
Abstract idea limitations:
Modifying a subset of the first dataset;
Associating modified input with a predetermined output;
Receiving model output for the modified input;
Detecting cheating based on whether output matches expected value.
These steps can be carried out mentally or via a rule-based logic and do not involve a particular improvement in ML model structure or computer functionality.
Therefore, claim 8 is directed to an abstract idea under Step 2A Prong One.
Claim 8 – Step 2A Prong 2 and Step 2B Combined Analysis:
Additional Elements Beyond the Abstract Ideas:
None.
Although “cheating detection” may appear novel at first glance, the claim does not tie this detection to a specific technical use or improved functioning of the machine learning model. The logic amounts to a rule-based output check – i.e., comparing known vs. predicted results – which is a mental process and does not integrate the abstract idea into a practical application. Furthermore, the claim lacks any non-conventional or technological elements that amount to significantly more under Step 2B.
Conclusion: Claim 8 is ineligible under 101.
Claim 9 – Step 2A Prong One – Abstract Idea Identification:
Claim 9 depends from claim 1 and recites that each sample of the first dataset comprises a plurality of attributes. This relates to the structure of the dataset and is a data characterization step, which falls under mathematical concepts per MPEP 2106.04(a)(2).
Abstract idea limitations:
Dataset entries (samples) include multiple attributes.
This is a generic feature of data representation in machine learning and does not limit the claim in a meaningful way. It reflects informational content, not a technological improvement.
Therefore, claim 9 is directed to an abstract idea under Step 2A Prong One.
Claim 9 – Step 2A Prong 2 and Step 2B Combined Analysis:
Additional Elements Beyond the Abstract Ideas:
None.
Merely stating that each data point includes multiple features is routine in ML datasets and is not tied to any inventive or non-conventional implementation. This does not integrate the abstract idea into a practical application, nor does it add an inventive concept.
Conclusion: Claim 9 is ineligible under 101.
Claim 11 – Step 2A Prong One – Abstract Idea Identification:
Claim 11 recites limitations related to analyzing data distributions, generating synthetic data, dataset combination, and model output segmentation, which fall within the judicial exception of mathematical concepts and mental processes under MPEP 2106.04(a)(2).
Specifically, the following limitations are identified as abstract ideas:
identifying first distributional characteristics of a first dataset;
generating, based on the first distributional characteristics, a second plurality of samples for a second dataset, wherein the second plurality of samples comprise second distributional characteristics that correspond to the first distributional characteristics;
generating a combined dataset based on interleaving samples from the first dataset with samples from the second dataset;
identifying a portion of the output corresponding to the first dataset;
determining a performance metric of the trained machine learning model based on the portion of the output corresponding to the first dataset.
These limitations mirror the data manipulation and processing limitations of Claim 1, except implemented by generic “communication circuitry”, “storage circuitry” and “control circuitry”. The addition of control circuitry does not remove the claim from being directed to an abstract idea, as it is simply a generic computer component performing abstract data processing steps.
Therefore, Claim 11 is directed to an abstract idea under Step 2A, Prong One.
Claim 11 – Step 2A Prong 2 and Step 2B Combined Analysis:
Additional Elements beyond the abstract idea:
This claim is a system version of claim 1, it recites the same abstract ideas those presented for claim 1, above (e.g., transmitting and receiving data constitutes mere data gathering under MPEP § 2106.05(g); see rejection of claim 1).
Storage circuitry
Control circuitry
Communications circuitry
The only additional element beyond those presented in claim 1 are: “communication circuitry”, “storage circuitry” and “control circuitry,” these additional elements amount to no more than generally linking the use of the judicial exception to a particular technological environment or field of use (see MPEP § 2106.05(h)). Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept.
Thus, the judicial exception is not integrated into a practical application (see MPEP 2106.04(d) I.), failing step 2A Prong 2. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception under Step 2B.
Conclusion: Claim 11 is directed to an abstract idea and fails to recite significantly more. Therefore, Claim 11 is ineligible under 101.
Claim 12 – Step 2A Prong One – Abstract Idea Identification:
Claim 12 is a system version of claim 2, and includes the same abstract limitations presented for method claim 2, implemented by generic control circuitry.
Abstract idea limitations:
Determine if PII is present in a dataset.
Pseudo-randomly generating and assigning PII to another dataset.
Mirror the same abstract ideas of Claim 2.
Claim 12 – Step 2A Prong 2 and Step 2B Combined Analysis:
Additional Elements Beyond the Abstract Ideas:
Performed by control circuitry.
Use of generic hardware does not meaningfully limit the abstract idea. No technological improvement is recited.
Conclusion: Claim 12 is ineligible under 101.
Claim 13 – Step 2A Prong One – Abstract Idea Identification:
Claim 13 is a system version of claim 3 and includes the same abstract limitations presented, now implemented using control circuitry.
Abstract idea limitations:
Retrieving and training a neural network;
Assigning weights and using them to determine distributional characteristics.
Mirrors the same abstract ideas as Claim 3 and are implemented using generic system hardware.
Claim 13 – Step 2A Prong 2 and Step 2B Combined Analysis:
Additional Elements Beyond the Abstract Ideas:
Performed by control circuitry.
This is a generic ML implementation using generic control circuitry components to execute model training and parameter analysis and does not meaningfully limit the abstract idea or improve system functionality, nor do they amount to significantly more.
Conclusion: Claim 13 is ineligible under 101.
Claim 14 – Step 2A Prong One – Abstract Idea Identification:
Claim 14 is a system version of claim 4 and includes the same abstract limitation presented, now implemented using control circuitry.
Abstract idea limitations:
A first number of samples is smaller than a second number of samples.
Mirrors the mathematical relationship in Claim 4 and remains abstract.
Claim 14 – Step 2A Prong 2 and Step 2B Combined Analysis:
Additional Elements Beyond the Abstract Ideas:
Performed by control circuitry.
Using generic control circuitry to perform a dataset size comparison does not integrate the abstract idea into a practical application. Nor does it reflect a technological improvement or inventive concept.
Conclusion: Claim 14 is ineligible under 101.
Claim 15 – Step 2A Prong One – Abstract Idea Identification:
Claim 15 is a system version of claim 5 and includes the same abstract limitation presented, now implemented using control circuitry.
Abstract idea limitations:
Second dataset is 100x larger than the first.
This is a quantitative condition and is the same as Claim 5
Claim 15 – Step 2A Prong 2 and Step 2B Combined Analysis:
Additional Elements Beyond the Abstract Ideas:
Performed by control circuitry.
As in claim 5, this is purely mathematical and not integrated into a practical application. Using generic control circuitry does not transform the claim into patent eligible subject matter.
Conclusion: Claim 15 is ineligible under 101.
Claim 17 – Step 2A Prong One – Abstract Idea Identification:
Claim 17 is a system version of claim 7 and includes the same abstract limitation presented, now implemented using control circuitry.
Abstract idea limitations:
Assigning source labels (identifiers) to samples and using them to track output.
Same as Claim 7 – abstract data tagging.
Claim 17 – Step 2A Prong 2 and Step 2B Combined Analysis:
Additional Elements Beyond the Abstract Ideas:
Performed by control circuitry.
The use of generic control circuitry to assign metadata does not improve the system or integrate the abstract idea into a practical application. There is no inventive concept added by the implementation details.
Conclusion: Claim 17 is ineligible under 101.
Claim 18 – Step 2A Prong One – Abstract Idea Identification:
Claim 18 is a system version of claim 8 and includes the same abstract limitation presented, now implemented using control circuitry.
Abstract idea limitations:
Same limitations from claim 8
Recites the same abstract logic from claim 8 and mirrors the same mental/logical processing described above.
Claim 18 – Step 2A Prong 2 and Step 2B Combined Analysis
Additional Elements Beyond the Abstract Ideas:
Performed by control circuity.
Implementing the same abstract logic via generic control circuitry does not transform the claim into patent-eligible subject matter nor is there an improvement to hardware or the model.
Conclusion: Claim 18 is ineligible under 101.
Claim 19 – Step 2A Prong One – Abstract Idea Identification:
Claim 19 is a system version of claim 9 and includes the same abstract limitation presented, now implemented using control circuitry.
Abstract idea limitations:
Samples have multiple attributes.
Remains an abstract data description, not a functional limitation.
Claim 19 – Step 2A Prong 2 and Step 2B Combined Analysis
Additional Elements Beyond the Abstract Ideas:
Performed by control circuitry.
Using control circuitry to operate on multi-attribute data does not meaningfully limit the claim or integrate the abstract idea into a practical application. There is no technological improvement recited.
Conclusion: Claim 19 is ineligible under 101.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-5, 7-9, 11-15, and 17-19 are rejected under 35 U.S.C. § 103 as being unpatentable over Ghanta (US20200034665A1) in view of Walters (US20200012902A1) and in further view of Williams (US20150254555A1).
3, 7, 8, 13, 17, and 18
Regarding claim 1, Ghanta in view of Walters and in further view of Williams teach a method for protecting a dataset, the method comprising:
retrieving a first dataset that comprises a first plurality of samples; - Ghanta teaches this limitation in part. Ghanta teaches validation on a labeled ‘first dataset’ & producing an error dataset:
“The validation data set … includes labels … predictions … compared against the labels … to determine the accuracy of the predictions.” (Ghanta, p. 8, ¶[0078])
“The output of the validation of the first machine learning algorithm / model may include an error data set.” (Ghanta, p. 11, ¶[0104])
“In further embodiments, the primary validation module 304 validates 504 the first machine learning algorithm /model using a validation data set 505a . The output … may include an error data set 505b.” (Ghanta, p. 12, ¶[0106])
Ghanta further describes labeled validation data, comparing predictions to labels, and accuracy/metrics:
“The validation data set … includes labels … predictions … compared against the labels … to determine the accuracy of the predictions.” (Ghanta, p. 8, ¶[0078])
“The primary validation module … compare[s] the predictions … to the true label … to calculate the error rate, score, weight, or other value.” (Ghanta, p. 9, ¶[0081])
Additionally, Ghanta discloses that the “error data set” contains metrics/labels:
“the resulting output … comprises an error data set.” (Ghanta, p. 8, ¶[0079])
Ghanta does not teach:
identifying first distributional characteristics of the first plurality of samples of the first dataset;
generating, based on the first distributional characteristics, a second plurality of samples for a second dataset, wherein the second plurality of samples comprise second distributional characteristics that correspond to the first distributional characteristics;
generating a combined dataset based on interleaving samples from the first dataset with samples from the second dataset;
transmitting, over a network, the combined dataset as an input to a trained machine learning model;
receiving over the network output from the trained machine learning model that was generated based on the combined dataset input;
Walters, however, teaches these limitations:
identifying first distributional characteristics of the first plurality of samples of the first dataset; - Walters teaches this limitation. Walters discloses operations that may include:
“determining respective distribution measures of the data segments.” – (Walters, p. 1, ¶[0007]; FIG. 4, step 408)
generating, based on the first distributional characteristics, a second plurality of samples for a second dataset, wherein the second plurality of samples comprise second distributional characteristics that correspond to the first distributional characteristics; - Walters teaches this limitation. Walters describes distribution-matched synthetic data:
“Synthetic data may be used … to create or recreate a realistic, larger-scale data set from a smaller, compressed dataset” (Walters, p. 1, § Background)
“Synthetic data may be needed … where … confidentiality is required.” (Walters, p. 1, § Background)
generating a combined dataset based on interleaving samples from the first dataset with samples from the second dataset; - Walters teaches this limitation. Walters teaches combining segments to “generate a synthetic dataset”, including append/prepend operations and combining into a synthetic set:
“Segmenter 338 may be configured[:] to generate a synthetic dataset by combining one or more synthetic data segments … to append and/or prepend synthetic data-segments to generate a synthetic dataset … to combine data segments into a multidimensional synthetic dataset.” (Walters, p. 8, ¶[0080])
“Generate, via the Distribution Model, a Series of Synthetic Data-Segments Based on the Series of Synthetic Segment Parameters ” (Walters, FIG. 5, step 504)
KSR note: Substituting interleaving (round-robin ordering) for append/prepend/concatenate is an art-recognized equivalent used for the same purpose – forming a single mixed dataset from two sources. See MPEP § 2144.06(II) (substituting equivalents known for the same purpose).
Additionally or in the alternative, selecting interleaving from the finite set of predictable ordering schemes (concatenate, batch, interleave) would have been obvious to try with a reasonable expectation of success. See MPEP § 2144.07; MPEP § 2143 (KSR predictable results).
transmitting, over a network, the combined dataset as an input to a trained machine learning model; - Walters teaches this limitation. Walters discloses network transmission:
“Synthetic-data … may … receive data …, retrieve data …, and/or transmit data to other components of system 100 and/or computing components outside system 100 (e.g., via network 112). Synthetic-data system 102 is disclosed in greater detail below (in reference to FIG. 3).” (Walters, p. 2, ¶[0029])
“Providing a model or dataset may include transmitting the model or dataset to a component of system 100 … (e.g., via network 112)” (Walters, p. 12, ¶ [0128])
receiving over the network output from the trained machine learning model that was generated based on the combined dataset input; - Walters teaches this limitation. Walters discloses network transmission:
“Synthetic-data … may … receive data …, retrieve data …, and/or transmit data to other components of system 100 and/or computing components outside system 100 (e.g., via network 112). Synthetic-data system 102 is disclosed in greater detail below (in reference to FIG. 3).” (Walters, p. 2, ¶[0029])
“Providing a model or dataset may include transmitting the model or dataset to a component of system 100 … (e.g., via network 112)” (Walters, p. 12, ¶ [0128])
Walters does not teach these limitations:
identifying a portion of the output corresponding to the first dataset;
determining a performance metric of the trained machine learning model based on the portion of the output corresponding to the first plurality of samples.
Williams, however, teaches these limitations:
identifying a portion of the output corresponding to the first dataset; - Williams teaches known classifications carried with a test set and traceability of output to the known subset:
“Testing scoring delivers testing data to Scoring Process 522 and delivers those results and the known classifications to the Model Performance Analysis component 524, which is used to calculate and evaluate performance metrics.” (Williams, p. 9, ¶[0102])
determining a performance metric of the trained machine learning model based on the portion of the output corresponding to the first plurality of samples. – Williams explicitly teaches performance-metric computation on the labeled test subset:
“Testing scoring … delivers those results and the known classifications to the Model Performance Analysis component 524, which is used to calculate and evaluate performance metrics.” (Williams, p. 9, ¶[0102])
“Model Performance Analysis 524 consists of an Evaluation Component 526 and a Visualization Component 528. The Model Performance Evaluation component 524 calculates the metrics necessary for a human to evaluate the systems performance … One common method … is to analyze confusion matrices…” (Williams, p. 9, ¶[0103])
A POSITA would have combined Ghanta (validation on labeled data; subset metrics) with Walters (distribution-matched synthetic generation; combining segments; network I/O) and Williams (labels/Model Performance Analysis) because: (i) incorporating Walter’s synthetic, distribution-faithful data into Ghanta’s validation predictably scales coverage while preserving confidentiality (KSR; MPEP § 2143); (ii) William’s label/traceability lets the practitioner identify the output portion attributable to the first dataset and compute metrics on that subset, which is common sense for mixed inputs (MPEP § 2144); (iii) Walters expressly teaches combining segments, using interleaving (round-robin) instead of append/concatenate is an art-recognized equivalent for the same purpose (forming one mixed dataset), or, alternatively, a selection from finite, predictable ordering schemas (MPEP § 2144.06(II); § 2144.07); and (iv) Walter’s network transmit/receive are familiar elements used according to their established function (KSR; MPEP § 2143). The references are analogous, do not teach away, and the combination yields predictable results with a reasonable expectation of success.
Regarding claim 9, Ghanta in view of Walters and in further view of Williams, teach the method of claim 1, wherein
the first dataset comprises a plurality of samples, - This limitation is already required by claim 1 (it recites “a first dataset that comprises a first plurality of samples”); therefore, this clause imposes no additional limitation beyond the parent claim. (See rejection of claim 1).
and wherein each sample of the plurality of samples is associated with a plurality of attributes. - Ghanta teaches this limitation. Ghanta explicitly teaches that each data point (sample) is associated with multiple attributes:
“the training data set may include various data points for dogs such as weight , height , gender , breed , etc.” (Ghanta, p. 8, ¶[0077])
Given the adoption of claim-1’s workflow (Ghanta + Walters + Williams), using multi-attribute samples is the known, conventional form of the same data and would have been chosen by a POSITA to support training/evaluation and the subset-metric analysis (Williams/Ghanta), with a reasonable expectation of success. See KSR; MPEP § 2143 (predictable results) and §2144.04 (design choice).
Regarding claims 11 and 19 (system claims)
Each of claims 11 and 19 is the system analog of the correspondingly numbered method claims, 1 and 9 and is rejected for the same reasons as discussed above for those method claims. The recited communications/storage/control circuitry “configured to” perform the claimed operations reads, under BRI, on conventional computing executing the same functional steps: retrieving the first dataset; generating a second plurality with distributional characteristics corresponding to the first (Walters); generating a combined dataset (Walters; interleaving as an art-recognized equivalent ordering; Williams for ordered mixing/traceability); transmitting/receiving over a network (Walters); identifying the output portion corresponding to the first dataset (Williams); and determining a performance metric from that portion (Ghanta). For claim 19 (system analog of claim 9), the additional “plurality of attributes per sample” is taught by Ghanta (multi-field labeled records). The parent limitations are met for the same reasons set forth for claim 11. Accordingly, claims 11 and 19 are rejected under 35 U.S.C. § 103 as being obvious over Ghanta in view of Walters and in further view of Williams.
Regarding claim 2, Ghanta in view of Walters and in further view of Williams teach the method of claim 1, further comprising:
determining whether the first dataset comprises personal identifiable information; Ghanta does not teach this limitation. – Ghanta does not teach this limitation. Walters, however, teaches this limitation and discloses using synthetic data when confidentiality is required, i.e., when the real dataset includes sensitive identifiers, thus motivating a POSITA to inspect (determine whether) the dataset contains such information as part of building the data profile:
“Synthetic data … may be needed where … confidentially is required.” (Walters, p. 1, § Background)
PNG
media_image1.png
1016
533
media_image1.png
Greyscale
(Walters, FIG. 4, steps 402-404)
in response to determining that the first dataset comprises personal identifiable information: pseudo randomly generating a set of personal identifiable information; - Ghanta does not teach this limitation. Walters, however, teaches this. Walters expressly generates synthetic data from seeded random parameters, i.e., pseudo-random generation as part of its standard flow:
“synthetic segment-parameters … may generate a sequence of synthetic segment parameters based on a segment-parameter seed” (Walters, p. 9, ¶[0095])
“Generating a synthetic dataset may be based on a … random seed…” (Walters, p. 11, ¶[0116]; FIG. 8, step 808; FIG. 9, step 912)
assigning the set of pseudo randomly generated personal identifiable information to the second dataset. – Ghanta does not teach this limitation. Walters, however, teaches this and discloses generating a synthetic dataset by combining generated segments, i.e., populating the second dataset with the synthetic (seed-driven) values:
“Generate a synthetic dataset by combining synthetic data-segments … appending and/or prepending … combine data segments into a multidimensional synthetic dataset.” (Walters, p. 10, ¶[0097]; FIG. 5, step 504.)
A POSITA would have combined Ghanta’s validation/evaluation workflow with Walter’s confidentiality-motivated synthetic data generation to address the known risk of exposing sensitive identifiers during model evaluation. Walters expressly teaches using synthetic data when confidentiality is required and generating such data form seeded (pseudo-random) parameters; thus, a practitioner would determine whether the first dataset comprises personal identifiable information to decide whether to trigger Walter’s privacy workflow, and, upon finding PII, pseudo-randomly generate corresponding identifier values and assign them to the second (synthetic) dataset. This is applying a known technique to a known problem in the same field (model evaluation), yielding predictable results (privacy-preserving evaluation without degrading distributional fidelity), with a reasonable expectation of success. See KSR; MPEP §2143 (predictable results) and §2144 (common sense implementation).
Regarding claim 4, Ghanta in view of Walters and in further view of Williams teach the method of claim 1, wherein
a first number of samples in the first dataset is smaller than a second number of samples in the second dataset. – Ghanta does not teach this limitation. Walters, however, teaches this limitation. Walters expressly teaches a larger-scale (second) dataset from a smaller (first) dataset:
“Synthetic data may be used in methods of data compression to create or recreate a realistic, larger-scale data set from a smaller, compressed dataset…” (Walters, p. 1, ¶[0002])
Given Walter’s explicit teaching of generating a larger dataset from a smaller one, a POSITA at the time of the claimed invention would have adopted this relative-size relationship within Ghanta’s validation/evaluation workflow as it would have been a predictable use of a known technique to meet the claimed condition, with a reasonable expectation of success.
Regarding claim 5, Ghanta in view of Walters and in further view of Williams teach the method of claim 4, wherein
the second number is one hundred times larger than the first number. - Ghanta does not teach this limitation. Walters, however, teaches this limitation. Walters teaches the second dataset being larger than the first:
“Synthetic data may be used … to create or recreate a realistic, larger-scale data set from a smaller, compressed dataset…” (Walters, p. 1, ¶[0002])
Selecting a particular multiple (e.g., 100x) of the known “larger scale” relationship is a routine optimization of a result-effective variable (dataset size) and a design choice yielding predictable results, absent evidence of criticality or unexpected results for the specific multiple. See MPEP § 2144.05 (optimization of result-effective variables) and MPEP § 2144.04 (design choice).
A POSITA implementing the claim-4 workflow (second number larger than the first number) would have selected a specific scale factor for the second number as a routine optimization of a result-effective variable – dataset size. Walters expressly teaches creating a larger-scale synthetic dataset from a smaller source; the practitioner would tune how much larger to achieve predictable benefits such as (i) variance reduction/statistical power evaluation, (ii) better coverage of rare/long-tail patterns while preserving the first dataset’s distribution, and (iii) stronger confidentiality by relying more heavily on synthetic records. Choosing that factor so the second number is one hundred times larger than the first number is therefore a design choice within ordinary skill, yielding predictable results, absent evidence of criticality or unexpected results tied to “100x”.
Regarding claims 12, 14, and 15 (system claims)
Each of claims 12, 14, and 15 is the system analog of the method claims 2, 4, and 5, respectively, and is rejected for the same reasons discussed above for those method claims. Walters teaches confidentiality-motivated synthetic generation using seeded/pseudo-random parameters and populating the synthetic dataset (claim 12), and scaling the synthetic dataset relative to the first dataset (claims 14 and 15); specific size rations (e.g.,
≥
100x) are predictable design choices absent criticality. Ghanta provides the surrounding ML validation framework. Accordingly, claims 12, 14, and 15 are rejected under 35 U.S.C. § 103 as being obvious over Ghanta in view of Walters and in further view of Williams.
Claims 3, 7, 8, 13, 17, and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ghanta (US20200034665A1) in view of Williams (US20150254555A1).
Regarding claim 3, Ghanta in view of Walters and in further view of Williams teach the method of claim 1, wherein identifying the first distributional characteristics of the first dataset comprises:
retrieving a neural network comprising a plurality of nodes, wherein each node is connected to at least one other node; - Ghanta does not teach this limitation. Williams, however, teaches this. Williams discloses a deep learning neural network (DLNN) used in the disclosed pipeline:
“Train Deep Learning Neural Network (DLNN)… ” (Williams, Fig. 6)
“the DLNN may be trained using training data” (Williams, p. 9, ¶[0104])
Williams discloses training a deep learning neural network by assigning weights to connections during training (e.g., ¶¶[0102]-[0104]). A “connection” necessarily joins two units of the network; thus a DLNN that is trained by assigning weights on connections necessarily comprises a plurality of nodes (neurons) in which each node is connected to at least one other node. This structured clause is therefore inherent in the Williams network under BRI. See MPEP §2112 (a property/feature may be relied on in §103 when it is necessarily present in the thing taught, not merely probably present).
training the neural network using at least a subset of the first dataset by assigning weights to connections between the plurality of nodes; - Ghanta does not teach this limitation. Williams, however, teaches this limitation. Williams teaches training the DLNN’s on domain data, where learning proceeds by assigning (learning) weights on the network’s connections during training:
“the training data is processed through a training algorithm and computes the biases, weights, and transfer functions” (Williams, p. 8, ¶[0097])
“the DLNN may be trained using training data appropriate for the current domain being modeled.” (Williams, p. 9, ¶[0104])
A POSITA implementing Ghanta’s validation framework would use William’s DLNN to characterize (via the learned weights) because Williams already teaches NN training on the domain data; relying on the assigned weights to reflect the dataset’s distributional characteristics is a predictable use of a known technique with a reasonable expectation of success. (KSR; MPEP §2143).
Regarding claim 7, Ghanta in view of Walters and in further view of Williams teach the method of claim 1, wherein generating the combined dataset further comprises:
assigning a first source identifier to each of the first plurality of samples; Ghanta does not teach this limitation. Williams, however, teaches this limitation. Williams discloses that labels (known classifications) are source identifiers that tag each sample with the dataset/source membership needed for later segmentation:
“Both Training Corpus 508 and Testing Corpus 510 may include one or more pre-labeled datasets. A pre-labeled dataset may be one where known classifications are applied as labels.” (Williams, p. 8, ¶[0091])
§112(b) Note / Amenability: as written, the limitation of claim 7, “assigning a second source identifier to each of the first plurality of samples”, applies both identifiers to the first plurality. The Office enters a separate §112(b) rejection (indefinite). In the alternative for §103, under BRI the clause is amenable to construction as assigning the second source identifier to each of the second plurality of samples in the second dataset (consistent with the disclosure and context).
and assigning a second source identifier to each of the first plurality of samples, - Williams teaches applying labels to datasets/corpora for exactly this purpose (distinguishing sources across mixed inputs):
“Both Training Corpus 508 and Testing Corpus 510