Last updated: May 29, 2026
Application No. 18/127,551
ALLOCATING COMPUTING RESOURCES BETWEEN MODEL SIZE AND TRAINING DATA DURING TRAINING OF A MACHINE LEARNING MODEL

Non-Final OA §102§103
Filed
Mar 28, 2023
Priority
Mar 29, 2022 — provisional 63/324,997
Examiner
TAN, DAVID H
Art Unit
2145
Tech Center
2100 — Computer Architecture & Software
Assignee
Deepmind Technologies Limited
OA Round
1 (Non-Final)
Interview Optional

— +17.0% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 31% grant rate with +17.0% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 99 resolved cases, 2023–2026
Examiner Intelligence

TAN, DAVID H View full profile →
Grants only 31% of cases
Career Allowance Rate
31 granted / 99 resolved
-23.7% vs TC avg
Strong +17% interview lift
Without
With
+17.0%
Interview Lift
resolved cases with interview
Typical timeline
3y 11m
Avg Prosecution
26 currently pending
Career history
139
Total Applications
across all art units
Statute-Specific Performance

§103
95.7%
+55.7% vs TC avg
§102
4.1%
-35.9% vs TC avg
§112
0.2%
-39.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 99 resolved cases
Office Action

§102 §103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 10/23/2024 and 12/11/2024 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claim(s) 1-5, 7-8, 10-12, & 15-20 is/are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020, January 23). Scaling laws for neural language models. arXiv.org. https://arxiv.org/abs/2001.08361, hereinafter “Kaplan”.
Claim 1:
Kaplan teaches a method performed by one or more computers, the method comprising:	 obtaining data defining a compute budget that characterizes an amount of computing resources allocated for training a machine learning model to perform a machine learning task (i.e. [1.1. Summary, pg. 3], “: When working within a fixed compute budget C but without any other restrictions on the model size N or available data D, we attain optimal performance by training very large models and stopping significantly short of convergence”, wherein the BRI for data defining a compute budget encompasses how a LLM model may have a fixed compute budget for a reasoning task);	 processing the data defining the compute budget using an allocation mapping, in accordance with a set of allocation mapping parameters, to generate an allocation tuple defining:	(i) a target model size for the machine learning model (i.e. [6.1 Optimal Performance and Allocation, pg. 16], “Each value of the compute budget Cmin has an associated optimal model size N. Optimal model size grows very rapidly with Cmin, increasing by 5x for each 10x increase in compute”, where it can be seen in Fig. 14 that data defining a Cmin Compute budget is associated with an optimal model size N. Wherein the BRI for an allocation tuple encompasses the input data defining varying parameters associated with a fixed compute, such as in Table 1, which are then tested as calculations that result in a test loss curve for each varying parameter and is used to find an optimal model allocation), and 
(ii) a target amount of training data for training the machine learning model, wherein selecting a model size of the machine learning model as the target model size and training the machine learning model on the target amount of training data is predicted to optimize a performance of the machine learning model on the machine learning task subject to a constraint that an amount of computing resources used for training the machine learning model satisfies a threshold defined by the compute budget (i.e. [3.3 Performance with Dataset Size and Compute, pg. 9-10], “For the trend with D we trained a model with (nlayer, nembd) = (36, 1280) on fixed subsets of the WebText2 dataset. We stopped training once the test loss ceased to decrease. We see that the resulting test losses can be fit with simple power-law 
    PNG
    media_image1.png
    44
    124
    media_image1.png
    Greyscale
 in the dataset size. The data and fit appear in Figure 1. The total amount of non-embedding compute used during training can be estimated as C = 6NBS, where B is the batch size, S is the number of parameter updates, and the factor of 6 accounts for the Forward and backward passes. Thus for a given value of C we can scan over all models with various N to find the model”, wherein the BRI for a target amount of training data encompasses a number of parameters N. Which for a given Compute Value C, a varying number of parameter counts N may be tested to find an optimized loss defined by the Compute value budget. Wherein the BRI for satisfies a threshold defined by the compute budget encompasses how loss performance with varying amounts of parameters constrained by a given Compute Value) 	 instantiating the machine learning model, wherein the machine learning model has the target model size (i.e. [4 Charting the Infinite Data Limit and Overfitting, pg. 10], “Here we will study the performance of a model of size N trained on a dataset with D tokens while varying N and D simultaneously. We will empirically demonstrate that the optimally trained test loss accords with the scaling law of Equation (1.5)”, wherein the model is trained according to the sizes above to find a test loss and determine an optimal model size for a given compute budget);	 obtaining the target amount of training data for training the machine learning model (i.e. [5.2 Results for r L(N, Smin) and Performance with Model Size and Compute, pg. 14,], When we hold either total compute or number of training steps fixed, performance follows L(N, S) from Equation (5.6). Each value of compute budget has an associated optimal model size that maximizes performance); and	 training the machine learning model having the target model size on the target amount of training data (i.e. [6.1, Optimal Performance and Allocations, pg. 15], Given L(Cmin), it is natural to ask for the optimal model size N(Cmin) that provides the minimal loss with a given quantity of training compute. The optimal model size is shown in Figure 14).
Claim 2:
Kaplan teaches the method of claim 1, wherein values of the set of allocation mapping parameters are determined by operations comprising:	 identifying a plurality of trial allocation tuples, wherein each trial allocation tuple defines:	 (i) a trial model size for the machine learning model, and (ii) a trial amount of training data for training the machine learning model (i.e. [4. Charting the Infinite Data Limit and Overfitting, pg. 11], “The early-stopped test loss L(N, D) depends predictably on the dataset size D and model size N according to Equation (1.5). Left: For large D, performance is a straight power law in N. For a smaller fixed D, performance stops improving as N increases and the model begins to overfit”, wherein the BRI for a trial allocation tuple encompasses how varying amounts of model size N are tested for a given compute budget with varying amounts of parameters in order to find an optimal loss and prevent overfitting);	 determining, for each of the plurality of trial allocation tuples, a performance measure characterizing a performance of a trial machine learning model on the machine learning task resulting from selecting a model size of the trial machine learning model as the trial model size and training the trial machine learning model on the trial amount of training data (i.e. [4 Charting the Infinite Data Limit and Overfitting, pg. 10], “Here we will study the performance of a model of size N trained on a dataset with D tokens while varying N and D simultaneously. We will empirically demonstrate that the optimally trained test loss accords with the scaling law of Equation (1.5). This provides guidance on how much data we would need to train models of increasing size while keeping overfitting under control”, wherein the BRI for a performance measure encompasses a test loss for each trial); and	 determining the values of the set of allocation mapping parameters based on the performance measures corresponding to the plurality of trial allocation tuples data (i.e. [4 Charting the Infinite Data Limit and Overfitting, pg. 10], “We will empirically demonstrate that the optimally trained test loss accords with the scaling law of Equation (1.5). This provides guidance on how much data we would need to train models of increasing size while keeping overfitting under control”, where an optimal model size may be found for a compute budget).  

Claim 3:
Kaplan teaches the method of claim 2, determining the values of the set of allocation mapping parameters based on the performance measures corresponding to the plurality of trial allocation tuples comprises:	 determining, for each of a plurality of compute budgets, an optimal model size and an optimal amount of training data corresponding to the compute budget based on the performance measures corresponding to the plurality of trial allocation tuples (i.e. [5.2 Results for L(N, Smin) and Performance with Model Size and Compute, pg. 14], When we hold either total compute or number of training steps fixed, performance follows L(N, S) from Equation (5.6). Each value of compute budget has an associated optimal model size that maximizes performance); and	 determining the values of the set of allocation mapping parameters based on the optimal model size and the optimal amount of training data corresponding to each of the plurality of compute budgets (i.e. 6. Optimal Allocation of the Compute Budget pg. 15], More importantly, we will use the results of Section 5 to determine the optimal allocation of compute between model size N and the quantity of data processed during training).  

Claim 4:
Kaplan teaches the method of claim 3, wherein determining the values of the set of allocation mapping parameters based on the optimal model size and the optimal amount of training data corresponding to each of the plurality of compute budgets comprises:	 fitting the values of the set of allocation mapping parameters based on the optimal model size and the optimal amount of training data corresponding to each of the plurality of compute budgets (i.e. [6.1 Optimal Performance and Allocation, pg. 15], “Given L(Cmin), it is natural to ask for the optimal model size N(Cmin) that provides the minimal loss with a given quantity of training compute. The optimal model size is shown in Figure 14.”, wherein given a Compute budget minimum an optimal mapping may be found using the found power-law equation).  

Claim 5:
Kaplan teaches the method of claim 3, wherein determining, for each of the plurality of compute budgets, the optimal model size and the optimal amount of training data corresponding to the compute budget comprises:	 determining a respective performance curve for each of a plurality of trial model sizes based on the performance measures corresponding to the plurality of trial allocation tuples, wherein a performance curve for a trial model size defines a continuous mapping from possible compute budgets to predicted performance measures, wherein a predicted performance measure corresponding to a possible compute budget defines a predicted performance of a trial machine learning model with the trial model size that is trained using an amount of computing resources that satisfies a threshold defined by the possible compute budget (i.e. [5.2 Results for L(N, Smin) and Performance with Model Size and Compute , pg. 14], “When we hold either total compute or number of training steps fixed, performance follows L(N, S) from Equation (5.6). Each value of compute budget has an associated optimal model size that maximizes performance)”, wherein the power-law equation may display a performance vs compute budget for a number of parameters N found for each trialed combination to find an optimal performance balance);	 and determining the optimal model size and the optimal amount of training data corresponding to each compute budget using the performance curves (i.e. [5.2 Results for L(N, Smin) and Performance with Model Size and Compute , pg. 14], Each value of compute budget has an associated optimal model size that maximizes performance).  

Claim 7:
Kaplan teaches the method of claim 5, wherein determining the optimal model size and the optimal amount of training data corresponding to each compute budget using the performance curves comprises, for each compute budget of the plurality of compute budgets (i.e. 6.1, pg. 16, Each value of the compute budget Cmin has an associated optimal model size N):	 determining an optimal performance curve that achieves an optimal performance measure, from among the performance curves, for the compute budget (i.e. [1.2 Summary of Scaling Laws, pg. 5], “When training within a fixed compute budget C, but with no other constraints, Equation (1.6) leads to the prediction that the optimal model size N, optimal batch size B, optimal number of steps S, and dataset size D should grow as 
    PNG
    media_image2.png
    72
    376
    media_image2.png
    Greyscale
);	 determining the optimal model size as the trial model size corresponding to the optimal performance curve (i.e. , “After an initial transient period, learning curves for all model sizes N can be fit with Equation (1.6), which is parameterized in terms of Smin, the number of steps when training at large batch size”, wherein a fixed compute budged is trailed with a variety of model sizes and each trial has its loss charted in a curve as seen in Fig. 4 Right);	 and determining the optimal amount of training data based on the compute budget and the optimal model size (i.e. [5.2 Results for r L(N, Smin) and Performance with Model Size and Compute, pg. 14,], When we hold either total compute or number of training steps fixed, performance follows L(N, S) from Equation (5.6). Each value of compute budget has an associated optimal model size that maximizes performance).  

Claim 8:
Kaplan teaches the method of claim 3, wherein determining, for each of the plurality of compute budgets, the optimal model size and the optimal amount of training data corresponding to the compute budget comprises:	 determining a respective performance curve for each of the plurality of compute budgets based on the performances measures corresponding to the plurality of trial allocation tuples, wherein a performance curve for a compute budget defines a continuous mapping from possible model sizes to predicted performance measures, wherein a predicted performance measure corresponding to a possible model size defines a predicted performance of a trial machine learning model with the possible model size that is trained using an amount of computing resources that satisfies a threshold defined by the compute budget (i.e. [5.2 Results for L(N, Smin) and Performance with Model Size and Compute, pg. 14], “The data and fits can be visualized in a different and more interesting way, as shown in Figure 11. There we study the test loss as a function of model size while fixing either the total non-embedding compute C used in training”, wherein it is noted that each trial size consisting of a number of parameters used in a fixed compute budged model may have their loss performance measured and mapped in a curve according to the threshold of a fixed compute budget);	 and determining the optimal model size and the optimal amount of training data corresponding to each compute budget using the performance curves (i.e. [5.2 Results for L(N, Smin) and Performance with Model Size and Compute, pg. 14], Each value of compute budget has an associated optimal model size that maximizes performance).  

Claim 10:
Kaplan teaches the method of claim 8, wherein determining the optimal model size and the optimal amount of training data corresponding to each compute budget using the performance curves comprises, for each compute budget of the plurality of compute budgets:	 determining the optimal model size as a model size that optimizes the performance curve corresponding to the compute budget;	 and determining the optimal amount of training data based on the compute budget and the optimal model size (i.e. [5.2 Results for L(N, Smin) and Performance with Model Size and Compute, pg. 14], Each value of compute budget has an associated optimal model size that maximizes performance).  

Claim 11:
Kaplan teaches the method of claim 2, wherein determining the values of the set of allocation mapping parameters based on the performance measures corresponding to the plurality of trial allocation tuples comprises:	 determining a set of parameters of a performance estimation function that is configured to process data defining:	 (i) an input model size, and (ii) an input amount of training data, to generate a predicted performance measure that characterizes a predicted performance of a machine learning model having the input model size, that is trained on the input amount of training data, on the machine learning task (i.e. [4. Charting the Infinite Data Limit and Overfitting, pg. 11], “The early-stopped test loss L(N, D) depends predictably on the dataset size D and model size N according to Equation (1.5). Left: For large D, performance is a straight power law in N. For a smaller fixed D, performance stops improving as N increases and the model begins to overfit”, wherein the BRI for an input model size and training data encompasses how varying amounts of model size N and model parameters are tested for a given compute budget with varying amounts of parameters in order to find an optimal loss and prevent overfitting);, comprising:	 fitting values of the set of parameters of the performance estimation function based on the performance measures corresponding to the plurality of trial allocation tuples (i.e. [3.3 Performance with Dataset Size and Compute, pg. 10], Figure 8, “The figure also includes images of individual learning curves to clarify when individual models are optimal”, wherein a model may be fit with optimal parameters matching the trial tests for loss in order to find optimal values);	 and determining the values of the set of allocation mapping parameters using the performance estimation function (i.e. [4 Charting the Infinite Data Limit and Overfitting, pg. 10], “We will empirically demonstrate that the optimally trained test loss accords with the scaling law of Equation (1.5). This provides guidance on how much data we would need to train models of increasing size while keeping overfitting under control”, wherein using the equations and curves above, certain optimal values of parameters may be found by comparing test losses for trial batches of parameters)

Claim 12:
Kaplan teaches the method of claim 11, wherein determining the values of the set of allocation mapping parameters using the performance estimation function comprises:	 determining the values of the set of allocation mapping parameters to cause each input compute budget to be mapped to a target model size and a target amount of training data that optimize the performance estimation function subject to a constraint that training a machine learning model having the target model size on the target amount of training data uses an amount of computing resources given by the input compute budget (i.e. [.1 Optimal Performance and Allocations, pg. 16], Each value of the compute budget Cmin has an associated optimal model size N”, wherein optimal model parameters may be found by comparing test loss curves for each parameter).  

Claim 15:
Kaplan teaches the method of claim 2, wherein for each of the plurality of trial allocation tuples, determining the performance measure corresponding to the trial allocation tuple comprises:	 training a trial machine learning model having the trial model size on the trial amount of training data using a learning rate schedule that is selected based on the trial amount of training data (i.e. [Results for L(N, Smin) and Performance with Model Size and Compute, pg. 13], “Now we will use Smin defined in Equation (5.4) to obtain a simple and universal fit for the dependence of the loss on model size and training time in the infinite data limit… We include all training steps after the warmup period of the learning rate schedule, and find a fit to the data with the parameters”, wherein each model size and compute trial calculation includes training dating using a general learning rate schedule based on the trial being conducted to prevent overfitting. The examiner notes that the BRI for basing a learning rate on a trial amount of training data would encompass a scenario where a single learning rate is selected based on the presence of any amount of training data in a trial)
Claim 16:
 Kaplan teaches the method of claim 1, wherein the allocation mapping causes the target model size and the target amount of training data to increase at substantially a same rate in response to an increase in the compute budget (i.e. [6.1 Optimal Performance and Allocations, pg. 16], Optimal model size grows very rapidly with Cmin, increasing by 5x for each 10x increase in compute).  

Claim 17:
Kaplan teaches the method of claim 1, wherein the machine learning task comprises a language modeling task (i.e. [1 Introduction, pg. 2], in this work we will empirically investigate the dependence of language modeling loss on all of these factors).  

Claim 18:
Kaplan teaches the method of claim 1, wherein the machine learning model comprises a neural network model (i.e. [1 Introduction, pg. 2], One might expect language modeling performance to depend on model architecture, the size of neural models, the computing power used to train them, and the data available for this training process. In this work we will empirically investigate the dependence of language modeling loss on all of these factors”, wherein it is noted that the language model is a neural network model).  

Claim 19:
	Claim 19 is the media claim reciting similar limitations to Claim 1 and is rejected for similar reasons. 

Claim 20:
	Claim 20 is the system claim reciting similar limitations to Claim 1 and is rejected for similar reasons. 
	Kaplan further teaches
	one or more computers; and one or more storage devices communicatively (i.e. [2.2 Training Procedures, pg. 7], “Unless otherwise noted, we train models with the Adam optimizer [KB14] for a fixed 2.5 × 105 steps with a batch size of 512 sequences of 1024 tokens. Due to memory constraints, our largest models (more than 1B parameters) were trained with Adafactor [SS18]”, wherein it is noted that the deep learning frameworks of the Adam Optimizer and Adafactor require processors (GPUs) coupled to storage (RAM) in order to run)


Claim Rejections - 35 USC § 103
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 6 & 9 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020, January 23). Scaling laws for neural language models. arXiv.org. https://arxiv.org/abs/2001.08361, hereinafter “Kaplan” and further in light of U.S. Patent Application Publication NO. 20140137117 “Ding”.
Claim 6:
Kaplan teaches the method of claim 5, wherein determining a performance curve for a trial model size comprises:	 determining the performance curve for the trial model size by(i.e. [4 Charting the Infinite Data Limit and Overfitting, pg. 10], “We will empirically demonstrate that the optimally trained test loss accords with the scaling law of Equation (1.5). This provides guidance on how much data we would need to train models of increasing size while keeping overfitting under control”, where performance measures such as loss may be used to find an optimal model size on a specific compute budget).  
	While Kaplan teaches charting a learning curve for a loss performing metric according to test trialed amounts of size parameter data, Kaplan may not explicitly teach that value on a calculated performance curve charted by
	Interpolating.
	However, Ding teaches
Interpolating (i.e. para. [0035], Fig. 1, a target system comprising the same host computer and virtualization technology as illustrated in FIG. 1… Assuming continuity between measured points in the multi-dimensional space, the overhead at points between these measured points can be estimated using multi-dimensional interpolation. FIG. 1 illustrates an interpolation in one dimension, utilization, for determining that the overhead that can be expected on a target system of the same configuration (Dell PE 1850 running five virtual machines under VMware ESX at 1256 MHz with one core per CPU, one thread per core) with a 45% utilization is approximately 8%, as illustrated by the dotted lines in FIG. 1.”, wherein it is noted that the concept of interpolating performance data may be used in a case where continuity between measured performance points can be assumed)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to add Interpolating, to Kaplan’s performance calculations to find optimal machine learning model allocations, with how specifically multi-dimensional machine performance data may be interpolated, as taught by Ding. One would have been motivated to combine Ding and Kaplan as interpolation of performance data based on the particular set of conditions at the target system may help provide a clearer and more advantageous estimate of performance associated with a given allocation of computer systems.

Claim 9:
 Kaplan teaches the method of claim 8, wherein determining a performance curve for a compute budget comprises:	 determining the performance curve for the compute budget by (i.e. [5.2 Results for L(N, Smin) and Performance with Model Size and Compute, pg. 14], “The data and fits can be visualized in a different and more interesting way, as shown in Figure 11. There we study the test loss as a function of model size while fixing either the total non-embedding compute C used in training”, wherein it is noted that each trial size consisting of a number of parameters used in a fixed compute budged model may have their loss performance measured and mapped in a curve according to the threshold of a fixed compute budget).  
While Kaplan teaches charting a learning curve for a loss performing metric according to test trialed amounts of size parameter data, Kaplan may not explicitly teach that value on a calculated performance curve charted by
	Interpolating.
	However, Ding teaches
Interpolating (i.e. para. [0035], Fig. 1, a target system comprising the same host computer and virtualization technology as illustrated in FIG. 1… Assuming continuity between measured points in the multi-dimensional space, the overhead at points between these measured points can be estimated using multi-dimensional interpolation. FIG. 1 illustrates an interpolation in one dimension, utilization, for determining that the overhead that can be expected on a target system of the same configuration (Dell PE 1850 running five virtual machines under VMware ESX at 1256 MHz with one core per CPU, one thread per core) with a 45% utilization is approximately 8%, as illustrated by the dotted lines in FIG. 1.”, wherein it is noted that the concept of interpolating performance data may be used in a case where continuity between measured performance points can be assumed)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to add Interpolating, to Kaplan’s performance calculations to find optimal machine learning model allocations, with how specifically multi-dimensional machine performance data may be interpolated, as taught by Ding. One would have been motivated to combine Ding and Kaplan as interpolation of performance data based on the particular set of conditions at the target system may help provide a clearer and more advantageous estimate of performance associated with a given allocation of computer systems.


Claim(s) 13-14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020, January 23). Scaling laws for neural language models. arXiv.org. https://arxiv.org/abs/2001.08361, hereinafter “Kaplan” and further in light of U.S. Patent Application Publication NO. 20210374954 “Yarlagadda”.

Claim 13:
Kaplan teaches the method of claim 11, wherein fitting the values of the set of parameters of the performance estimation function based on the performance measures corresponding to the plurality of trial allocation tuples comprises:	 fitting the values of the set of parameters of the performance estimation function to minimize, for each trial allocation tuple, a measure of error (i.e. [6 Learning Rate Schedules and Error Analysis, pg. 26], “Run-to-run variation is at the level of 0.05 in the loss, so averaging multiple runs is necessary to validate performance changes smaller than this level”, wherein a general measure of error between runs was found) 
While Kaplan teaches a general measure of error analysis for variation between calculated trial runs, Kaplan may not explicitly teach wherein a measure a measure of error is found between:	 (i) the performance measure corresponding to the trial allocation tuple, and (ii) a predicted performance measure generated by processing the trial model size and the trial amount of training data defined by the trial allocation tuple using the performance estimation function.  
However, Yarlagadda teaches 
a measure of error if found between:	 (i) the performance measure corresponding to the trial allocation tuple (i.e. para. [0048], “in general, the higher the loss metric, the more the output may have deviated from the expected result of the input. Conversely, the lower the loss metric, the lower the output may have deviated from the expected result”, wherein an actual performance measurement may correspond to an actual result), and (ii) a predicted performance measure generated by processing the trial model size and the trial amount of training data defined by the trial allocation tuple using the performance estimation function (i.e. para. [0048], “The loss metric may be determined using many outputs from the site prediction model 235 generated using many examples from the training dataset 245. The loss metric may indicate a degree of deviation of the output (e.g., the predicted primary site 340) from the site prediction model 235 from the expected result (e.g., the primary site label 310) as indicated in the training dataset”, wherein a measure of error may be found between predicted model data and expected data).  
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to add wherein a measure of error is found between a the performance measure and the predicted performance measure, to Kaplan’s performance calculations to find and test estimates of optimal machine learning model allocations, with how a predictive model may have a measure of error calculated between a measured performance measure and an estimated performance measure, as taught by Yarlagadda. One would have been motivated to combine Yarlagadda and Kaplan as measure of predicted and actual measurements may indicate a degree of deviation that helps a user further understand the reliability of their predictive model. 

Claim 14:
Kaplan and Yarlagadda teaches the method of claim 13.
Yarlagadda further teaches wherein the measure of error comprises a Huber loss (i.e. para. [0048], the loss metric may be calculated in accordance with any number of loss functions, such as a Huber loss).  

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
U.S. Patent Application Publication NO. 20210110302 “Nam”, teaches in para. [0035],  The machine learning models may require different constraints, and the hyper-parameter may be optimized under the constraints.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DAVID H TAN whose telephone number is (571)272-7433. The examiner can normally be reached M-F 7:30-4:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Cesar Paula can be reached at (571) 272-4128. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/D.T./Examiner, Art Unit 2145                 


/CESAR B PAULA/Supervisory Patent Examiner, Art Unit 2145
Read full office action
Prosecution Timeline

Mar 28, 2023
Application Filed
Jan 20, 2026
Non-Final Rejection mailed — §102, §103
Mar 17, 2026
Applicant Interview (Telephonic)
Mar 17, 2026
Examiner Interview Summary
Precedent Cases

Applications granted by this same examiner with similar technology

17/508,593
Patent 12626184
ELECTRONIC DEVICE FOR UPDATING ARTIFICIAL INTELLIGENCE MODEL AND OPERATING METHOD THEREOF
4y 6m to grant Granted May 12, 2026
17/731,183
Patent 12626097
Ensemble Time Series Model for Forecasting
4y 0m to grant Granted May 12, 2026
15/784,004
Patent 12443336
INTERACTIVE USER INTERFACE FOR DYNAMICALLY UPDATING DATA AND DATA ANALYSIS AND QUERY PROCESSING
8y 0m to grant Granted Oct 14, 2025
17/171,087
Patent 12282863
METHOD AND SYSTEM OF USER IDENTIFICATION BY A SEQUENCE OF OPENED USER INTERFACE WINDOWS
4y 2m to grant Granted Apr 22, 2025
17/558,232
Patent 12182378
METHODS AND SYSTEMS FOR OBJECT SELECTION
3y 0m to grant Granted Dec 31, 2024
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
31%
Grant Probability
48%
With Interview (+17.0%)
3y 11m (~9m remaining)
Median Time to Grant
Low
PTA Risk
Based on 99 resolved cases by this examiner. Grant probability derived from career allowance rate.