Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 02/11/2026 has been entered.
Remarks
This Office Action is responsive to Applicants' Amendment filed on February 11, 2026, in which claims 1-8, 15, and 20 are amended. No claims have been newly added or cancelled. Claims 1-20 are currently pending.
Response to Arguments
With regards to the objections to claim 5 and 15 for minor informalities, claims 5 and 15 have been amended to correct the informalities and thus the objections are withdrawn.
With regards to the rejections of claims 1-9, 10-18, and 20 under 35 U.S.C 103 as being unpatentable over Krishna et al. “Neural Architecture Search with Reinforce and Masked Attention Autoregressive Density Estimators”, in view of Kobayashi et al. (U.S. Patent Application Pub. No. US 2017/0061329 A1), further in view of Roth et al. (U.S. Patent No. 12,462,377 B1), Examiner finds Applicant’s arguments that the claims as amended overcome the rejections are persuasive, however the arguments are moot in view of a new grounds of rejection, as presented below.
Claim Objections
Claim 8 objected to because of the following informality: wherein execution of the instructions cause the processor to determine to initialize the second set of hyperparameter search operations is based on determining the computational resource limitation is in a first state should read “wherein execution of the instructions cause the processor to determine to initialize the second set of hyperparameter search operations based on determining the computational resource limitation is in a first state”. Appropriate correction is required.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-9, 11-18, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Krishna et al. “Neural Architecture Search with Reinforce and Masked Attention Autoregressive Density Estimators”, hereinafter Krishna, in view of Kobayashi et al. (U.S. Patent Application Pub. No. US 2017/0061329 A1), hereinafter Kobayashi, further in view of Wever et al. “ML-Plan for Unlimited-Length Machine Learning Pipelines”, hereinafter Wever, further in view of Roth et al. (U.S. Patent No. 12,462,377 B1), hereinafter Roth. Claims 11-18 are considered first to maintain consistency with the previous office action.
Regarding claim 11,
Krishna teaches A computer-implemented method comprising:
at a hyperparameter computing device, implementing, in a first hyperparameter configuration state, a first set of hyperparameter search operations, the first set of hyperparameter search operations including: selecting a first set of hyperparameters using a hyperparameter determination model from a hyperparameter dataset [including a plurality of hyperparameter types organized in a hierarchical structure,] ((Krishna Pg. 3) “1. sample a batch of action strings based on the current state of the policy network and fetch the corresponding rewards from the environment”, (Krishna Pg. 3) “sequences of hyperparameter values (henceforth, referred to as strings)”, (Krishna Pg. 7) “Sample B valid hyper-parameter strings…using the policy network”, sampling a batch of sequences of hyperparameter values using a policy network corresponds to selecting a set of hyperparameters using a hyperparameter determination model, a state of a policy network for sampling hyperparameters corresponds to a hyperparameter configuration state, a string of hyperparameter values corresponds to a hyperparameter dataset, Krishna does not teach a hierarchical structure for organizing hyperparameter types)
and each hyperparameter in the hyperparameter dataset has a probability value corresponding to a respective configuration; ((Krishna Pg. 2) “RL methods based on policy gradient (Zoph & Le, 2016; Zoph et al., 2018) specify a policy network, parametrized by θ, to learn a desired probability distribution over values of the hyperparameters, P(a1, a2, …, aN; θ), where ai denotes the value of the ith hyper-parameter”, a probability distribution over the values of each hyperparameter corresponds to each hyperparameter in the set having a probability value corresponding to its value, broadest reasonable interpretation of a respective configuration includes a value of the respective hyperparameter)
and the first set of hyperparameters is selected based on: [the hierarchical structure,] the probability value of each hyperparameter, ((Krishna Pg. 2) “RL methods based on policy gradient (Zoph & Le, 2016; Zoph et al., 2018) specify a policy network, parametrized by θ, to learn a desired probability distribution over values of the hyperparameters, P(a1, a2, ..., aN; θ), where ai denotes the value of the ith hyper-parameter”, Krishna does not teach selecting a set of hyperparameters based on a hierarchical structure) and a machine learning model to be trained using the first set of hyperparameters; (Krishna Pg. 7, procedure REMAADE shows that hyperparameters are used to determine the best machine learning architecture a*, which is to be trained: (Krishna Pg. 7) “These cells are assembled together in a predefined manner to form an overall convolutional neural network architecture that is trained on the CIFAR-10 dataset”)
PNG
media_image1.png
306
762
media_image1.png
Greyscale
the first set of performance data including information indicating a performance of each hyperparameter of the first set of hyperparameters; ((Krishna Pg. 3) “We set up the training regime such that the policy network learns to assign higher probabilities to those sequences of hyperparameter values (henceforth, referred to as strings) that yield a higher accuracy on the cross-validation dataset”)
using a probability determination engine to evaluate the performance data to determine an updated probability value to each hyperparameter of the first set of hyperparameters; ((Krishna Pg. 3) “We set up the training regime such that the policy network learns to assign higher probabilities to those sequences of hyperparameter values (henceforth, referred to as strings) that yield a higher accuracy on the cross-validation dataset”, (Krishna Pg. 4) “We optimize the objective via gradient ascent where the gradient can be estimated using the Reinforce rule in (Williams, 1992). The optimization procedure alternates between two steps until we exhaust the exploration budget: 1. sample a batch of action strings based on the current state of the policy network and fetch the corresponding rewards from the environment 2. update the policy network’s parameters using policy gradient”, the objective of the policy network, which corresponds to the probability determination engine, is to assign the highest probabilities to the hyperparameters which give the best accuracy, therefore optimizing the policy network based on performance data of rewards from the environment to update the policy network’s parameters corresponds to determining updated probabilities for the hyperparameters using a probability determination engine)
and using a hyperparameter determination model update engine, and the updated probability value for each hyperparameter, to change from the first hyperparameter configuration state to a second hyperparameter configuration state for subsequent uses of the hyperparameter determination model ((Krishna Pg. 4) “We optimize the objective via gradient ascent where the gradient can be estimated using the Reinforce rule in (Williams, 1992). The optimization procedure alternates between two steps until we exhaust the exploration budget:…2. update the policy network’s parameters using policy gradient”, the policy network’s parameters correspond to a hyperparameter configuration state as the policy network provides higher probabilities for hyperparameters that achieve greater accuracies, and is updated at step 2 to optimize the objective function, which is optimized to provide the higher probabilities to hyperparameters that are more accurate, so optimizing the parameters creates a configuration that provides more optimal hyperparameters)
Kobayashi teaches the following further limitations that Krishna does not teach:
using a model training engine and the first set of hyperparameters, training the machine learning model, ((Kobayashi [0221]) “(S75) The step execution unit 138 learns a model m by using the machine learning algorithm ai, the hyperparameter vector θh, and the training data Dt”, the step execution unit trains models and thus corresponds to a model training engine)
wherein training the machine learning model includes: obtaining the machine learning model from machine learning model data, ((Kobayashi [0213]) “The step execution unit 138 receives a specified machine learning algorithm and sample size from the learning control unit 135”)
configuring the machine learning model with the first set of hyperparameters, ((Kobayashi [0209]) “In response to a request from the step execution unit 138, the hyperparameter adjustment unit 137 generates a hyperparameter vector applied to a machine learning algorithm to be executed by the step execution unit 138”)
and training the machine learning model configured with the first set of hyperparameters using training data; ((Kobayashi [0221]) “(S75) The step execution unit 138 learns a model m by using the machine learning algorithm ai, the hyperparameter vector θh, and the training data Dt”)
executing the trained machine learning model ((Kobayashi [0213]) “by using the data stored in the data storage unit 121 and the acquired hyperparameter vector, the step execution unit 138 executes a learning step of the specified machine learning algorithm with the specified sample size. The step execution unit 138 repeats machine learning using a plurality of hyperparameter vectors in a single learning step”, executing a learning step of a machine learning algorithm is executing a machine learning model) to generate output results; ((Kobayashi [0214]) “Next, the step execution unit 138 selects a model that indicates the best prediction performance from a plurality of models that correspond to the plurality of hyperparameter vectors. The step execution unit 138 outputs the selected model, the prediction performance thereof, the hyperparameter vector used to generate the model, and the execution time”, a selected model, its prediction performance, its hyperparameter vector, and its execution time all correspond to output results)
using a model validation engine, comparing the output results to expected results, [to obtain a first set of performance data,] ((Kobayashi [0222]) “(S76) The step execution unit 138 calculates the prediction performance p of the model m by using the learned model m and the test data Ds”, the step execution unit acts as a model validation engine as it validates the model performance, Krishna also teaches obtaining a set of performance data, (Kobayashi [0005]) “Next, by using test data indicating a known case different from the training data, the computer compares a result predicted by the model with the known result”, a model-predicted result corresponds to output results, a known result corresponds to expected results, so use of test data to calculate prediction performance includes both)
At the time of filing, one of ordinary skill in the art would have motivation to combine Krishna and Kobayashi by using the hyperparameter search process taught by Krishna, and using it to train at least one machine learning model, which is then executed in order to gather performance data, as taught by Kobayashi, as it is well known in the art to train machine learning models using optimal hyperparameters, and to iteratively compare the results of a machine learning model to test results to evaluate its performance, as both yields the predictable benefit of enabling better accuracy for the machine learning model. Such a combination would be obvious.
Wever teaches the following further limitations that neither Krishna nor Kobayashi teaches:
…a hyperparameter dataset including a plurality of hyperparameter types organized in a hierarchical structure, (Wever Pg. 3, Fig. 2 shows a plurality of hyperparameter types organized in a hierarchical structure, although Wever does not refer to variables for selection of preprocessing or learning algorithm as hyperparameters, as Wever states: (Wever Pg. 1) “these approaches, such as auto-sklearn and Auto-WEKA, have one variable for a pre-processing algorithm, one variable for the learning algorithm, and one variable for each parameter of each algorithm”, Applicant’s specification and drawings treats these variables as falling under the definition of hyperparameter, e.g. in [0031] “For example, hyperparameter type a1 202 illustrates exemplary pre-processing hyperparameters that one or more machine learning models may be configured with”)
PNG
media_image2.png
506
852
media_image2.png
Greyscale
wherein: the plurality of hyperparameter types includes: a pre-processing hyperparameter type in a first hierarchical level, an algorithm hyperparameter type in a second hierarchical level that is under the first hierarchical level, and a [kernel] hyperparameter type [and a regularizer hyperparameter type both] in a third hierarchical level that is under the second hierarchical level; (Wever Pg. 3, Fig. 2 shows a “Preprocessor” hyperparameter at the first hierarchical level in the “Machine Learning Pipeline” box, a “Base Classifier” (i.e. an algorithm) hyperparameter in the second hierarchical level in the “Adaboost” box, and a general “Hyper-parameters” type in the third hierarchical level in the “RandomForest” box)
and the first set of hyperparameters is selected based on: the hierarchical structure, … ((Wever Pg. 2) “AutoML seeks to automatically compose and parametrize machine learning algorithms into ML pipelines with the goal to optimize a given metric, e.g., predictive accuracy…In general, complete pipelines can be viewed as a hierarchical composition structure as in the example shown on the right-hand side of Figure 2. Furthermore, machine learning algorithms usually have hyperparameters that need to be chosen specifically for this algorithm. Thus, a hierarchical view of a machine learning pipeline represents its natural structure particularly well”)
At the time of filing, one of ordinary skill in the art would have motivation to combine Krishna, Kobayashi, and Wever by using the combined hyperparameter search and model training process taught jointly by Krishna and Kobayashi, and including a hierarchical organization for the types of parameters in the hyperparameter dataset for selection, as taught by Wever, as Wever states: (Wever Pg. 2) “machine learning algorithms usually have hyperparameters that need to be chosen specifically for this algorithm. Thus, a hierarchical view of a machine learning pipeline represents its natural structure particularly well”. Such a combination would be obvious.
Roth teaches the following further limitation that neither Krishna nor Kobayashi teach and that Wever does not explicitly teach:
and a kernel hyperparameter type and a regularizer hyperparameter type [both in a third hierarchical level that is under the second hierarchical level;] ((Roth Cols. 15-16, lines 64-4) “In at least one embodiment, hyperparameters can be tuned in certain categories, as may include data preprocessing (such as translating words to vectors), CNN architecture definition (for example, filter sizes, number of filters), stochastic gradient descent (SGD) parameters (for example, learning rate), and regularization or refinement (for example, dropout probability), among other such options”, a CNN architecture definition hyperparameter category including filter sizes and number of filters corresponds to a kernel hyperparameter type (kernel and filter are synonyms), a regularization hyperparameter category corresponds to a regularizer hyperparameter type, Wever but not Roth teaches hyperparameters at a third hierarchical level)
At the time of filing, one of ordinary skill in the art would have motivation to combine Krishna, Kobayashi, Wever, and Roth by using the combined hyperparameter search and model training process, including use of a hyperparameter dataset with hyperparameter types organized in a hierarchical structure, with hyperparameters at a third hierarchical level, taught jointly by Krishna, Kobayashi, and Wever, and including kernel and regularizer hyperparameter types, as taught by Roth, as kernel and regularization hyperparameters are well-known in the art, and selection of these hyperparameters following selection of pre-processing and algorithm type allows for their more suitable selection to increase accuracy, as their efficacy is dependent on attributes of the data after pre-processing and the selected algorithm. Such a combination would be obvious.
Regarding claim 12,
Krishna, Kobayashi, Wever, and Roth jointly teach The computer-implemented method of claim 11,
Krishna further teaches:
wherein determining an updated probability value to each hyperparameter of the first set of hyperparameters includes: determining a first respective vector value and a second respective vector value for each hyperparameter of the first set of hyperparameters ((Krishna Pg. 4) “Let qi ∈ Rd, ∀i = 1, … , N, be query vectors for each of the N hyper-parameters. We also maintain value vectors for the values that each hyper-parameter can take”)
by mapping each hyperparameter of the first set of hyperparameters and corresponding assigned value ((Krishna Pg. 3) “1. Embedding layer maps hyper-parameters and hyper-parameter values to a d dimensional vector space”)
At the time of filing, one of ordinary skill in the art would have motivation to combine the method jointly taught by Krishna, Kobayashi, Wever, and Roth for the parent claim of claim 12, claim 11. No new embodiments are introduced, so the reason to combine is the same as for the parent claim.
Regarding claim 13,
Krishna, Kobayashi, Wever, and Roth jointly teach The computer-implemented method of claim 12,
Krishna further teaches:
wherein determining an updated probability value to each hyperparameter of the first set of hyperparameters includes: based on the first respective vector value and the second respective vector value for each hyperparameter of the first set of hyperparameters, ((Krishna Pg. 4) “Inspired by XL-net (Yang et al., 2019), we use a two-stream masked attention based architecture comprising query and key vectors to compose Hθ. A notable departure from XL-net is that since we are not predicting probabilities for a position but for a given hyper-parameter, we let the query vector of the target hyper-parameter attend to preceding key vectors”)
determining hyperparameter dependencies for each hyperparameter of the first set of hyperparameters by utilizing a simplified transformer with a two-stream masked attention based architecture ((Krishna Pg. 2) “We present a 2-stream attention based architecture for capturing dependencies between hyper-parameters”, Krishna Pg. 5 Figure 2 shows the use of simplified transformers within the two-stream masked attention based architecture)
PNG
media_image3.png
554
516
media_image3.png
Greyscale
At the time of filing, one of ordinary skill in the art would have motivation to combine the method jointly taught by Krishna, Kobayashi, Wever, and Roth for the parent claim of claim 13, claim 12. No new embodiments are introduced, so the reason to combine is the same as for the parent claim.
Regarding claim 14,
Krishna, Kobayashi, Wever, and Roth jointly teach The computer-implemented method of claim 13,
Krishna further teaches:
wherein determining an updated ((Krishna Pg. 3) “The optimization procedure alternates between two steps until we exhaust the exploration budget: 1. sample a batch of action strings based on the current state of the policy network and fetch the corresponding rewards from the environment”, iteratively determining rewards for a hyperparameter string corresponds to updating its probability value) probability value to each hyperparameter of the first set of hyperparameters includes: determining the updated probability value for each hyperparameter of the first set of hyperparameters, ((Krishna Pg. 3) “3. Density Estimation Layer computes the probability density for a string”, (Krishna Pg. 3) “those sequences of hyperparameter values (henceforth, referred to as strings”, ) based on the hyperparameter dependencies for each hyperparameter of the first set of hyperparameters ((Krishna Pg. 3) “2. Context Representation layer models dependencies between hyper-parameters as specified by the auto-regressive factorization”)
At the time of filing, one of ordinary skill in the art would have motivation to combine the method jointly taught by Krishna, Kobayashi, Wever, and Roth for the parent claim of claim 14, claim 13. No new embodiments are introduced, so the reason to combine is the same as for the parent claim.
Regarding claim 15,
Krishna, Kobayashi, Wever, and Roth jointly teach The computer-implemented method of claim 14,
Krishna further teaches:
wherein the probability value and the updated probability value are probability densities ((Krishna Pg. 3) “3. Density Estimation Layer computes the probability density for a string”, (Krishna Pg. 3) “those sequences of hyperparameter values (henceforth, referred to as strings”)
At the time of filing, one of ordinary skill in the art would have motivation to combine the method jointly taught by Krishna, Kobayashi, Wever, and Roth for the parent claim of claim 15, claim 14. No new embodiments are introduced, so the reason to combine is the same as for the parent claim.
Regarding claim 16,
Krishna, Kobayashi, Wever, and Roth jointly teach The computer-implemented method of claim 12,
Krishna further teaches:
further comprising: based on the assigned value of each hyperparameter of the first set of hyperparameters, changing the first hyperparameter configuration state into a second hyperparameter configuration state; and implementing, in the second hyperparameter configuration state, a second set of hyperparameter search operations ((Krishna Pg. 3) “The optimization procedure alternates between two steps until we exhaust the exploration budget: 1. sample a batch of action strings based on the current state of the policy network and fetch the corresponding rewards from the environment 2. update the policy network’s parameters using policy gradient”, sampling a batch of action strings corresponds to a set of hyperparameter search operations, updating a policy network corresponds to changing a hyperparameter configuration state into another hyperparameter configuration state)
At the time of filing, one of ordinary skill in the art would have motivation to combine the method jointly taught by Krishna, Kobayashi, Wever, and Roth for the parent claim of claim 16, claim 12. No new embodiments are introduced, so the reason to combine is the same as for the parent claim.
Regarding claim 17,
Krishna, Kobayashi, Wever, and Roth jointly teach The computer-implemented method of claim 16,
Krishna further teaches:
further comprising: determining a state of a computational resource limitation ((Krishna Pg. 3) “The exploration budget can be quantified in units of computation such as number of GPU/TPU hours for training architectures or, alternately, as the number of times the policy network can query the environment to fetch the architecture’s score. In case of the latter, it is assumed that all architectures consume identical compute for getting trained”, both units of computation and queries to fetch from the environment are computational resources)
wherein the computational resource limitation is a predetermined number of sampling/resampling cycles; ((Krishna Pg. 8) “To benchmark ReMAADE on NASBench-101, we investigate short term performance (exploration budget of 150 architectures), and medium term performance (exploration budget of 3200 architectures)”, both 150 and 3200 architectures to sample are predetermined numbers of sampling cycles)
and based on the determined state of the computational resource limitation, determining whether to initialize the second set of hyperparameter search operations ((Krishna Pg. 3) “The optimization procedure alternates between two steps until we exhaust the exploration budget”)
At the time of filing, one of ordinary skill in the art would have motivation to combine the method jointly taught by Krishna, Kobayashi, Wever, and Roth for the parent claim of claim 17, claim 16. No new embodiments are introduced, so the reason to combine is the same as for the parent claim.
Regarding claim 18,
Krishna, Kobayashi, Wever, and Roth jointly teach The computer-implemented method of claim 17,
Krishna further teaches:
wherein determining to initialize the second set of hyperparameter search operations is based on determining the computational resource limitation is in a first state ((Krishna Pg. 3) “The optimization procedure alternates between two steps until we exhaust the exploration budget”, a state of exhaustion corresponds to a first state)
At the time of filing, one of ordinary skill in the art would have motivation to combine the method jointly taught by Krishna, Kobayashi, Wever, and Roth for the parent claim of claim 18, claim 17. No new embodiments are introduced, so the reason to combine is the same as for the parent claim.
Regarding claim 1,
Claim 1 recites a system that performs the function of the method of claim 11. Specifically, claim 1 recites: A system comprising: a hyperparameter computing device comprising a processor and a non-transitory computer-readable medium storing instructions that, when executed, cause the processor to: [perform the method of claim 11].
Kobayashi states: (Kobayashi [0208]) “Each of the hyperparameter adjustment unit 137 and the step execution unit 138 may be realized by using a program module executed by the CPU, for example”) and (Kobayashi [0066]) “The CPU 101 is a processor which includes an arithmetic circuit that executes program instructions. The CPU 101 loads at least a part of programs or data held in the HDD 103 to the RAM 102 and executes the program”.
At the time of filing, one of ordinary skill in the art would have motivation to combine Krishna, Kobayashi, Wever, and Roth for the same reasons as described in the combination statements for claim 11. All other limitations in claim 1 are substantially the same as those in claim 11, therefore the same rationale for rejection applies.
Regarding claim 2,
Claim 2 recites a system with a processor for performing the function of the method of claim 12. All other limitations in claim 2 are substantially the same as those in claim 12, therefore the same rationale for rejection applies.
Regarding claim 3,
Claim 3 recites a system with a processor for performing the function of the method of claim 13. All other limitations in claim 3 are substantially the same as those in claim 13, therefore the same rationale for rejection applies.
Regarding claim 4,
Claim 4 recites a system with a processor for performing the function of the method of claim 14. All other limitations in claim 4 are substantially the same as those in claim 14, therefore the same rationale for rejection applies.
Regarding claim 5,
Claim 5 recites a system with a processor for performing the function of the method of claim 15. All other limitations in claim 5 are substantially the same as those in claim 15, therefore the same rationale for rejection applies.
Regarding claim 6,
Krishna, Kobayashi, Wever, and Roth jointly teach The system of claim 1, wherein execution of the instructions, further cause the processor to:
Krishna further teaches:
based on the updated probability value of each hyperparameter of the first set of hyperparameters, changing the first hyperparameter configuration state into a second hyperparameter configuration state; and implementing, in the second hyperparameter configuration state, a second set of hyperparameter search operations ((Krishna Pg. 3) “The optimization procedure alternates between two steps until we exhaust the exploration budget: 1. sample a batch of action strings based on the current state of the policy network and fetch the corresponding rewards from the environment 2. update the policy network’s parameters using policy gradient”, sampling a batch of action strings corresponds to a set of hyperparameter search operations, updating a policy network corresponds to changing a hyperparameter configuration state into another hyperparameter configuration state)
At the time of filing, one of ordinary skill in the art would have motivation to combine the system jointly taught by Krishna, Kobayashi, Wever, and Roth for the parent claim of claim 16, claim 11. No new embodiments are introduced, so the reason to combine is the same as for the parent claim.
Regarding claim 7,
Krishna, Kobayashi, Wever, and Roth jointly teach The system of claim 6, wherein execution of the instructions, further cause the processor to:
Krishna further teaches:
determine a state of a computational resource limitation ((Krishna Pg. 3) “The exploration budget can be quantified in units of computation such as number of GPU/TPU hours for training architectures or, alternately, as the number of times the policy network can query the environment to fetch the architecture’s score. In case of the latter, it is assumed that all architectures consume identical compute for getting trained”, both units of computation and queries to fetch from the environment are computational resources)
wherein the computational resource limitation is a predetermined number of sampling/resampling cycles; ((Krishna Pg. 8) “To benchmark ReMAADE on NASBench-101, we investigate short term performance (exploration budget of 150 architectures), and medium term performance (exploration budget of 3200 architectures)”, both 150 and 3200 architectures to sample are predetermined numbers of sampling cycles)
and based on the determined state of the computational resource limitation, determining whether to initialize the second set of hyperparameter search operations ((Krishna Pg. 3) “The optimization procedure alternates between two steps until we exhaust the exploration budget”)
At the time of filing, one of ordinary skill in the art would have motivation to combine the system jointly taught by Krishna, Kobayashi, Wever, and Roth for the parent claim of claim 7, claim 6. No new embodiments are introduced, so the reason to combine is the same as for the parent claim.
Regarding claim 8,
Krishna, Kobayashi, Wever, and Roth jointly teach The system of claim 7, wherein execution of the instructions cause the processor
Krishna further teaches:
to determine to initialize the second set of hyperparameter search operations is based on determining the computational resource limitation is in a first state ((Krishna Pg. 3) “The optimization procedure alternates between two steps until we exhaust the exploration budget”, a state of exhaustion corresponds to a first state)
At the time of filing, one of ordinary skill in the art would have motivation to combine the system jointly taught by Krishna, Kobayashi, Wever, and Roth for the parent claim of claim 8, claim 7. No new embodiments are introduced, so the reason to combine is the same as for the parent claim.
Regarding claim 9,
Krishna, Kobayashi, Wever, and Roth jointly teach The system of claim 1,
Krishna further teaches:
wherein the first set of performance data is generated based on a validation score ((Krishna Pg. 8) “For all NAS algorithms, during a trial, we track the best random validation error achieved after t explorations and the corresponding random test error”, the trial generates performance data)
At the time of filing, one of ordinary skill in the art would have motivation to combine the system jointly taught by Krishna, Kobayashi, Wever, and Roth for the parent claim of claim 9, claim 1. No new embodiments are introduced, so the reason to combine is the same as for the parent claim.
Regarding claim 20,
Claim 1 recites a computer-readable medium storing instructions for performing the function of the method of claim 11. Specifically, claim 20 recites: A non-transitory computer readable medium having instructions stored thereon, wherein the instructions, when executed by one or more processors, cause a system to: [perform the method of claim 11].
Kobayashi states: (Kobayashi [0019]) “According to one aspect, there is provided a non-transitory computer-readable recording medium storing a computer program that causes a computer to perform a procedure”.
At the time of filing, one of ordinary skill in the art would have motivation to combine Krishna, Kobayashi, Wever, and Roth for the same reasons as described in the combination statement for claim 11. All other limitations in claim 20 are substantially the same as those in claim 11, therefore the same rationale for rejection applies.
Claims 10 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Krishna, in view of Kobayashi, further in view of Wever, further in view of Roth, further in view of Yu et al. (U.S. Patent Application Pub. No. US 2018/0121814 A1), hereinafter Yu.
Regarding claim 10,
Krishna, Kobayashi, Wever, and Roth jointly teach The system of claim 1,
Yu teaches the following further limitation that neither Krishna, nor Kobayashi, nor Wever, nor Roth teaches:
wherein the first set of hyperparameters are selected randomly ((Yu [0024]) “In one example, when the hyperparameter tuning system 120 starts by randomly selecting hyperparameters”)
At the time of filing, one of ordinary skill in the art would have motivation to combine Krishna, Kobayashi, Wever, Roth, and Yu by applying the random hyperparameter selection technique taught by Yu to the hyperparameter searching system taught jointly by Krishna, Kobayashi, Wever, and Roth, as random initialization is a common technique known in the art which yields the predictable benefit of avoiding searches consistently ending at a local minimum/maximum while failing to find the global minimum/maximum. Such a combination would be obvious.
Regarding claim 19,
Claim 19 recites a method that performs the function of the system of claim 10. All other limitations in claim 19 are substantially the same as those in claim 10, therefore the same rationale for rejection applies.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Yang et al. “XLNet: Generalized Autoregressive Pretraining for Language Understanding” also discloses a two-stream masked attention based architecture and transformers, though not explicitly simplified transformers.
Bergstra et al. “Algorithms for Hyper-Parameter Optimization” also discloses hyperparameter searching, including generating performance data and assigning values to hyperparameters based on it, but not assigning vectors to hyperparameters or determining a state of computational resource limitation.
Gallicchio et al. “Randomized Machine Learning Approaches: Recent Developments and Challenges” discloses basic hyperparameter initialization and search techniques.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to VICTOR A NAULT whose telephone number is (703) 756-5745. The examiner can normally be reached M - F, 12 - 8.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached at (571) 270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/V.A.N./Examiner, Art Unit 2124
/Kevin W Figueroa/Primary Examiner, Art Unit 2124