Last updated: May 29, 2026
Application No. 18/340,457
REINFORCEMENT MACHINE LEARNING WITH HYPERPARAMETER TUNING

Final Rejection §103
Filed
Jun 23, 2023
Examiner
KIM, SEHWAN
Art Unit
2129
Tech Center
2100 — Computer Architecture & Software
Assignee
International Business Machines Corporation
OA Round
2 (Final)
Interview Optional

— +65.9% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 60% grant rate with +65.9% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 146 resolved cases, 2023–2026
Examiner Intelligence

KIM, SEHWAN View full profile →
Grants 60% of resolved cases
Career Allowance Rate
87 granted / 146 resolved
+4.6% vs TC avg
Strong +66% interview lift
Without
With
+65.9%
Interview Lift
resolved cases with interview
Typical timeline
4y 0m
Avg Prosecution
26 currently pending
Career history
180
Total Applications
across all art units
Statute-Specific Performance

§101
5.2%
-34.8% vs TC avg
§103
87.7%
+47.7% vs TC avg
§102
2.2%
-37.8% vs TC avg
§112
4.7%
-35.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 146 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Examiner’s Note
Providing supporting paragraph(s) for each limitation of amended/new claim(s) in Remarks is strongly requested for clear and definite claim interpretations by Examiner (e.g., to avoid rejections under 35 U.S.C § 112(a) “Lack of written description”)
Applicant can schedule an interview at any stage of the prosecution (e.g., Non-Final, Final, and After-Final) to discuss any issues related to, for example, rejections under 35 U.S.C § 101 and § 102/103, for moving toward allowance.

Priority
Acknowledgment is made of applicant's claim for the present application filed on 06/23/2023.

Response to Arguments
Applicant’s arguments regarding 35 USC § 103 with respect to the independent claims have been considered but are moot because the arguments are directed to amended limitation(s) that has/have not been previously examined.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-3, 5, 7-9, 11, 13-16, 18, 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Jaderberg et al. (Population Based Training of Neural Networks) in view of Chen et al. (Towards Learning Universal Hyperparameter Optimizers with Transformers)

Regarding claim 1
Jaderberg teaches
A method of training a reinforcement learning agent comprising:	
(Jaderberg [sec(s) 4] “In this section we will apply Population Based Training to different learning problems. We describe the specific form of PBT for deep reinforcement learning in Sect. 4.1 when applied to optimising UNREAL (Jaderberg et al., 2016) on DeepMind Lab 3D environment tasks (Beattie et al., 2016), Feudal Networks (Vezhnevets et al., 2017) on the Atari Learning Environment games (Bellemare et al., 2013), and the StarCraft II environment baseline agents (Vinyals et al., 2017).”;)

(Note: Hereinafter, if a limitation has bold brackets (i.e. [·]) around claim languages, the bracketed claim languages indicate that they have not been taught yet by the current prior art reference but they will be taught by another prior art reference afterwards.)

[training], via at least one processor, a machine learning model based on training data to generate a set of hyperparameters for training the reinforcement learning agent, wherein the training data includes encoded information comprising [rollout encodings capturing] information about agent policy and environment dynamics from prior hyperparameter tuning sessions for a plurality of different reinforcement learning environments and reinforcement learning agents;
(Jaderberg [fig(s) 1] “Common paradigms of tuning hyperparameters: sequential optimisation and parallel search, as compared to the method of population based training introduced in this work. … (c) Population based training starts like parallel search, randomly sampling hyperparameters and weight initialisations. However, each training run asynchronously evaluates its performance periodically. If a model in the population is under-performing, it will exploit the rest of the population by replacing itself with a better performing model, and it will explore new hyperparameters by modifying the better model’s hyperparameters, before training is continued. This process allows hyperparameters to be optimised online, and the computational resources to be focused on the hyperparameter and weight space that has most chance of producing good results. The result is a hyperparameter tuning method that while very simple, results in faster learning, lower computational resources, and often better solutions” [algorithm 1] “produce new hyperparameters h” [sec(s) 3] “In order to achieve this, PBT uses two methods called independently on each member of the population (each worker): exploit, which, given performance of the whole population, can decide whether the worker should abandon the current solution and instead focus on a more promising one; and explore, which given the current solution and hyperparameters proposes new ones to better explore the solution space. … The specific form of exploit and explore depends on the application. In this work we focus on optimising neural networks for reinforcement learning, supervised learning, and generative modelling with PBT (Sect. 4). In these cases, step is a step of gradient descent (with e.g. SGD or RMSProp (Tieleman & Hinton, 2012)), eval is the mean episodic return or validation set performance of the metric we aim to optimise, exploit selects another member of the population to copy the weights and hyperparameters from, and explore creates new hyperparameters for the next steps of gradient-based learning by either perturbing the copied hyperparameters or resampling hyperparameters from the originally defined prior distribution. A member of the population is deemed ready to exploit and explore when it has been trained with gradient descent for a number of steps since the last change to the hyperparameters, such that the number of steps is large enough to allow significant gradient-based learning to have occurred.” [sec(s) 4.1] “In this section we apply Population Based Training to the training of neural network agents with reinforcement learning (RL) where we aim to find a policy π to maximise expected episodic return Eπ[R] within an environment. … Secondly, since PBT is copying the weights of good performing agents during the exploitation phase, agents which are lucky in environment exploration are quickly propagated to more workers, meaning that all members of the population benefit from the exploration luck of the remainder of the population. … (b) Truncation selection where we rank all agents in the population by episodic reward. If the current agent is in the bottom 20% of the population, we sample another agent uniformly from the top 20% of the population, and copy its weights and hyperparameters.” [sec(s) 4.2] “In this scenario, the task is to encode a source sequence of words in one language, and output a sequence in a different target language.” [sec(s) A.4] “This model has the same number of parameters as the base configuration that runs on 8 GPUs (6.5 × 107)”; e.g., “exploit” and “explore” along with “Population based training” read(s) on “machine learning model”.)

determining, by the machine learning model, the set of hyperparameters for training the reinforcement learning agent;
(Jaderberg [fig(s) 1] “(c) Population based training starts like parallel search, randomly sampling hyperparameters and weight initialisations. However, each training run asynchronously evaluates its performance periodically. If a model in the population is under-performing, it will exploit the rest of the population by replacing itself with a better performing model, and it will explore new hyperparameters by modifying the better model’s hyperparameters, before training is continued. This process allows hyperparameters to be optimised online, and the computational resources to be focused on the hyperparameter and weight space that has most chance of producing good results. The result is a hyperparameter tuning method that while very simple, results in faster learning, lower computational resources, and often better solutions” [algorithm 1] “produce new hyperparameters h” [sec(s) 3] “In this work we focus on optimising neural networks for reinforcement learning, supervised learning, and generative modelling with PBT (Sect. 4). In these cases, step is a step of gradient descent (with e.g. SGD or RMSProp (Tieleman & Hinton, 2012)), eval is the mean episodic return or validation set performance of the metric we aim to optimise, exploit selects another member of the population to copy the weights and hyperparameters from, and explore creates new hyperparameters for the next steps of gradient-based learning by either perturbing the copied hyperparameters or resampling hyperparameters from the originally defined prior distribution.”;)

training, via the at least one processor, the reinforcement learning agent according to the set of hyperparameters; and
(Jaderberg [fig(s) 1] [algorithm 1] [sec(s) 3] “The specific form of exploit and explore depends on the application. In this work we focus on optimising neural networks for reinforcement learning, supervised learning, and generative modelling with PBT (Sect. 4). In these cases, step is a step of gradient descent (with e.g. SGD or RMSProp (Tieleman & Hinton, 2012)), eval is the mean episodic return or validation set performance of the metric we aim to optimise, exploit selects another member of the population to copy the weights and hyperparameters from, and explore creates new hyperparameters for the next steps of gradient-based learning by either perturbing the copied hyperparameters or resampling hyperparameters from the originally defined prior distribution.” [sec(s) 4] “In this section we will apply Population Based Training to different learning problems. We describe the specific form of PBT for deep reinforcement learning in Sect. 4.1 when applied to optimising UNREAL (Jaderberg et al., 2016) on DeepMind Lab 3D environment tasks (Beattie et al., 2016), Feudal Networks (Vezhnevets et al., 2017) on the Atari Learning Environment games (Bellemare et al., 2013), and the StarCraft II environment baseline agents (Vinyals et al., 2017).” [sec(s) 4.1] “In this section we apply Population Based Training to the training of neural network agents with reinforcement learning (RL) where we aim to find a policy π to maximise expected episodic return Eπ[R] within an environment.”;)

adjusting, by the machine learning model, the set of hyperparameters based on information from testing of the reinforcement learning agent.
(Jaderberg [fig(s) 1] [algorithm 1] “produce new hyperparameters h” [sec(s) 3] “The specific form of exploit and explore depends on the application. In this work we focus on optimising neural networks for reinforcement learning, supervised learning, and generative modelling with PBT (Sect. 4). In these cases, step is a step of gradient descent (with e.g. SGD or RMSProp (Tieleman & Hinton, 2012)), eval is the mean episodic return or validation set performance of the metric we aim to optimise, exploit selects another member of the population to copy the weights and hyperparameters from, and explore creates new hyperparameters for the next steps of gradient-based learning by either perturbing the copied hyperparameters or resampling hyperparameters from the originally defined prior distribution.” [sec(s) 4] “In this section we will apply Population Based Training to different learning problems. We describe the specific form of PBT for deep reinforcement learning in Sect. 4.1 when applied to optimising UNREAL (Jaderberg et al., 2016) on DeepMind Lab 3D environment tasks (Beattie et al., 2016), Feudal Networks (Vezhnevets et al., 2017) on the Atari Learning Environment games (Bellemare et al., 2013), and the StarCraft II environment baseline agents (Vinyals et al., 2017).” [sec(s) 4.1] “In this section we apply Population Based Training to the training of neural network agents with reinforcement learning (RL) where we aim to find a policy π to maximise expected episodic return Eπ[R] within an environment.”;)

However, Jaderberg does not appear to explicitly teach:
[training], via at least one processor, a machine learning model based on training data to generate a set of hyperparameters for training the reinforcement learning agent, wherein the training data includes encoded information comprising [rollout encodings capturing] information about agent policy and environment dynamics from prior hyperparameter tuning sessions for a plurality of different reinforcement learning environments and reinforcement learning agents;

(Note: Hereinafter, if a limitation has one or more bold underlines, the one or more underlined claim languages indicate that they are taught by the current prior art reference, while the one or more non-underlined claim languages indicate that they have been taught already by one or more previous art references.)

Chen teaches
training, via at least one processor, a machine learning model based on training data to generate a set of hyperparameters for training the reinforcement learning agent, wherein the training data includes encoded information comprising rollout encodings capturing information about agent policy and environment dynamics from prior hyperparameter tuning sessions for a plurality of different reinforcement learning environments and reinforcement learning agents;
(Chen [fig(s) 1] “It is trained to predict both hyperparameter suggestions (in green) and response function values (in red).” [sec(s) 1] “We introduce a serialization scheme to convert a combination of any metadata and an optimization trajectory into text, represented as a sequence of tokens, and formulate the HPO task as a sequence modeling problem. We adopt a supervised learning approach, by learning to predict parameters and hyperparameter response functions from offline tuning data (See Fig. 1). In order to further improve optimization performance, we augment the model by utilizing its own function prediction during inference (Section 4.3). Extensive experiments on both public and private datasets demonstrate the OPTFORMER’s competitive tuning and generalization abilities.” [sec(s) 2] “HPO aims to find a set of hyperparameters x from search space X to maximize a model performance metric, y = f(x), often referred to as a response function. Table 1 shows an example of HPO experimental data. Following the HPO nomenclature [2, 28], an experimental study consists of metadata (m) and a history of trials (h). The metadata contains arbitrary unstructured information, including but not limited to descriptions of the problem, optimization algorithm, names, types and value ranges of hyperparameters. The history after t trials, ht = (x1, y1, . . . , xt, yt), contains a sequence of trials, each of which consists of a parameter suggestion x and function value y. The goal of the meta-learning approach for HPO is to learn the shared knowledge among the objective functions f from a dataset of multiple tuning experiments represented as studies and to obtain an optimal HPO algorithm for new hyperparameter tuning tasks from a similar distribution to those in the dataset. … An HPO algorithm π maps the metadata and history to a distribution over hyperparameter suggestions, i.e. π(xt+1|m, ht). Using the terminology of offline RL [29], we refer to the algorithm used to generate the trajectories in a dataset as the behavior policy πb.” [sec(s) 4.1 Study tokenization] “To improve scalability, we compress the textual representation of metadata m by removing redundant phrases and punctuation (e.g., "parameter", quotes) and encoding keywords (e.g., "name", "algorithm") and enumerating types (e.g. "DOUBLE") into single tokens. … See Appendix A.2 for further details on tokenization.” [sec(s) 4] “In this section, we provide a universal interface for modeling HPO studies with mixed textual and numerical information as a sequence of discrete tokens. We train our OPTFORMER as a generative model on a given dataset and explain how to use the OPTFORMER’s parameter and function prediction abilities to implement an HPO policy.” [sec(s) 4.3] “Augmented HPO policies with function prediction: At best, the learned policy πprior can only perform as well as the original policy πb when using behavioral cloning. However, we can take advantage of the model’s simultaneous function prediction ability to improve the policy with model-based planning or offline RL techniques.”; e.g., “trajectories” read(s) on “rollouts”.)

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Jaderberg with the training of Chen.
One of ordinary skill in the art would have been motived to combine in order to provide competitive optimization performance comparable with the existing, long-tried GP-based baselines.
(Chen [sec(s) 8] “By training on a diverse set of synthetic and real-world tuning trajectories, we demonstrated the capacity of a single Transformer model to imitate 7 fundamentally different HPO policies, learn to make well calibrated few-shot function predictions, and provide competitive optimization performance on unseen test functions comparable with the existing, long-tried GP-based baselines. Many extensions are readily conceivable for future exploration.”)

Regarding claim 2
The combination of Jaderberg, Chen teaches claim 1.

Chen further teaches
wherein the machine learning model includes a decision-transformer.
(Chen [fig(s) 1] “It is trained to predict both hyperparameter suggestions (in green) and response function values (in red).” [sec(s) 1] “We adopt a supervised learning approach, by learning to predict parameters and hyperparameter response functions from offline tuning data (See Fig. 1). In order to further improve optimization performance, we augment the model by utilizing its own function prediction during inference (Section 4.3). Extensive experiments on both public and private datasets demonstrate the OPTFORMER’s competitive tuning and generalization abilities.” [sec(s) 2] “The Transformer model is an efficient attention-based neural network architecture for sequence modeling [17]. We adopt the T5 Transformer encoder-decoder architecture [30].”;)

The combination of Jaderberg, Chen is combinable with Chen for the same rationale as set forth above with respect to claim 1.

Regarding claim 3
The combination of Jaderberg, Chen teaches claim 1.

wherein training the machine learning model further comprises: (See claim 1)

Chen further teaches
producing encodings of a set of rollouts from the prior hyperparameter tuning sessions by a second machine learning model to produce the training data, wherein the set of rollouts indicates policies of corresponding reinforcement learning agents and environment dynamics for the hyperparameter tuning sessions.
(Chen [fig(s) 1] “It is trained to predict both hyperparameter suggestions (in green) and response function values (in red).” [sec(s) 1] “We introduce a serialization scheme to convert a combination of any metadata and an optimization trajectory into text, represented as a sequence of tokens, and formulate the HPO task as a sequence modeling problem. We adopt a supervised learning approach, by learning to predict parameters and hyperparameter response functions from offline tuning data (See Fig. 1). In order to further improve optimization performance, we augment the model by utilizing its own function prediction during inference (Section 4.3). Extensive experiments on both public and private datasets demonstrate the OPTFORMER’s competitive tuning and generalization abilities.” [sec(s) 2] “HPO aims to find a set of hyperparameters x from search space X to maximize a model performance metric, y = f(x), often referred to as a response function. … An HPO algorithm π maps the metadata and history to a distribution over hyperparameter suggestions, i.e. π(xt+1|m, ht). Using the terminology of offline RL [29], we refer to the algorithm used to generate the trajectories in a dataset as the behavior policy πb.” [sec(s) 4.1 Study tokenization] “To improve scalability, we compress the textual representation of metadata m by removing redundant phrases and punctuation (e.g., "parameter", quotes) and encoding keywords (e.g., "name", "algorithm") and enumerating types (e.g. "DOUBLE") into single tokens. … See Appendix A.2 for further details on tokenization.” [sec(s) 4] “We train our OPTFORMER as a generative model on a given dataset and explain how to use the OPTFORMER’s parameter and function prediction abilities to implement an HPO policy. [sec(s) 4.3] “Augmented HPO policies with function prediction: At best, the learned policy πprior can only perform as well as the original policy πb when using behavioral cloning. However, we can take advantage of the model’s simultaneous function prediction ability to improve the policy with model-based planning or offline RL techniques.”; e.g., tokenizer read(s) on “second machine learning model”. In addition, e.g., “trajectories” read(s) on “rollouts”.)

The combination of Jaderberg, Chen is combinable with Chen for the same rationale as set forth above with respect to claim 1.

Regarding claim 5
The combination of Jaderberg, Chen teaches claim 3.

wherein training the machine learning model further comprises: (See claim 1)

Chen further teaches
determining rollouts from the hyperparameter tuning sessions, wherein the determined rollouts indicate policies of the reinforcement learning agents with respect to the environments; and
(Chen [fig(s) 1] “It is trained to predict both hyperparameter suggestions (in green) and response function values (in red).” [sec(s) 1] “We adopt a supervised learning approach, by learning to predict parameters and hyperparameter response functions from offline tuning data (See Fig. 1). In order to further improve optimization performance, we augment the model by utilizing its own function prediction during inference (Section 4.3). Extensive experiments on both public and private datasets demonstrate the OPTFORMER’s competitive tuning and generalization abilities.” [sec(s) 2] “HPO aims to find a set of hyperparameters x from search space X to maximize a model performance metric, y = f(x), often referred to as a response function. … An HPO algorithm π maps the metadata and history to a distribution over hyperparameter suggestions, i.e. π(xt+1|m, ht). Using the terminology of offline RL [29], we refer to the algorithm used to generate the trajectories in a dataset as the behavior policy πb.” [sec(s) 4.1 Study tokenization] “To improve scalability, we compress the textual representation of metadata m by removing redundant phrases and punctuation (e.g., "parameter", quotes) and encoding keywords (e.g., "name", "algorithm") and enumerating types (e.g. "DOUBLE") into single tokens. … See Appendix A.2 for further details on tokenization.” [sec(s) 4] “We train our OPTFORMER as a generative model on a given dataset and explain how to use the OPTFORMER’s parameter and function prediction abilities to implement an HPO policy. [sec(s) 4.3] “Augmented HPO policies with function prediction: At best, the learned policy πprior can only perform as well as the original policy πb when using behavioral cloning. However, we can take advantage of the model’s simultaneous function prediction ability to improve the policy with model-based planning or offline RL techniques.”;)

training the second machine learning model with the determined rollouts to produce the encodings.
(Chen [fig(s) 1] [sec(s) 1] “We adopt a supervised learning approach, by learning to predict parameters and hyperparameter response functions from offline tuning data (See Fig. 1). In order to further improve optimization performance, we augment the model by utilizing its own function prediction during inference (Section 4.3). Extensive experiments on both public and private datasets demonstrate the OPTFORMER’s competitive tuning and generalization abilities.” [sec(s) 4.1 Study tokenization] “To improve scalability, we compress the textual representation of metadata m by removing redundant phrases and punctuation (e.g., "parameter", quotes) and encoding keywords (e.g., "name", "algorithm") and enumerating types (e.g. "DOUBLE") into single tokens. … The shortened text string is then converted to a sequence of tokens via the SentencePiece tokenizer [44] (see Table 2 for an example). … See Appendix A.2 for further details on tokenization.” [sec(s) 4] “We train our OPTFORMER as a generative model on a given dataset and explain how to use the OPTFORMER’s parameter and function prediction abilities to implement an HPO policy.”; For more details on “SentencePiece tokenizer [44]” and its training, please refer to Kudo et al. (Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing).)

The combination of Jaderberg, Chen is combinable with Chen for the same rationale as set forth above with respect to claim 1.

Regarding claim 7
The combination of Jaderberg, Chen teaches claim 1.

wherein training the machine learning model further comprises: (See claim 1)

generating a series of rollouts for a new environment;
(Chen [fig(s) 1] “It is trained to predict both hyperparameter suggestions (in green) and response function values (in red).” [sec(s) 1] “We adopt a supervised learning approach, by learning to predict parameters and hyperparameter response functions from offline tuning data (See Fig. 1). In order to further improve optimization performance, we augment the model by utilizing its own function prediction during inference (Section 4.3). Extensive experiments on both public and private datasets demonstrate the OPTFORMER’s competitive tuning and generalization abilities.” [sec(s) 2] “HPO aims to find a set of hyperparameters x from search space X to maximize a model performance metric, y = f(x), often referred to as a response function. … An HPO algorithm π maps the metadata and history to a distribution over hyperparameter suggestions, i.e. π(xt+1|m, ht). Using the terminology of offline RL [29], we refer to the algorithm used to generate the trajectories in a dataset as the behavior policy πb.” [sec(s) 4] “We train our OPTFORMER as a generative model on a given dataset and explain how to use the OPTFORMER’s parameter and function prediction abilities to implement an HPO policy.” [sec(s) 6] “The train/test subsets of RealWorldData are split temporally to avoid information leak (see Appendix C for details).” [sec(s) 6.2] “In this section, we assess the OPTFORMER’s ability to learn the conditional distribution of the function value as a few-shot function regressor. Specifically, for every function in each test dataset, we repeatedly sample up to 200 random trials”; e.g., “trajectories” read(s) on “rollouts”.)

determining encodings for a selected set of rollouts by a second machine learning model;
(Chen [fig(s) 1] “It is trained to predict both hyperparameter suggestions (in green) and response function values (in red).” [sec(s) 1] “We introduce a serialization scheme to convert a combination of any metadata and an optimization trajectory into text, represented as a sequence of tokens, and formulate the HPO task as a sequence modeling problem. We adopt a supervised learning approach, by learning to predict parameters and hyperparameter response functions from offline tuning data (See Fig. 1). In order to further improve optimization performance, we augment the model by utilizing its own function prediction during inference (Section 4.3). Extensive experiments on both public and private datasets demonstrate the OPTFORMER’s competitive tuning and generalization abilities.” [sec(s) 4.1 Study tokenization] “To improve scalability, we compress the textual representation of metadata m by removing redundant phrases and punctuation (e.g., "parameter", quotes) and encoding keywords (e.g., "name", "algorithm") and enumerating types (e.g. "DOUBLE") into single tokens. … See Appendix A.2 for further details on tokenization.” [sec(s) 4] “We train our OPTFORMER as a generative model on a given dataset and explain how to use the OPTFORMER’s parameter and function prediction abilities to implement an HPO policy.”; e.g., tokenizer read(s) on “second machine learning model”.)

predicting a set of hyperparameters for a corresponding reinforcement learning agent by the machine learning model based on the encodings; and
(Chen [fig(s) 1] “It is trained to predict both hyperparameter suggestions (in green) and response function values (in red).” [sec(s) 1] “We adopt a supervised learning approach, by learning to predict parameters and hyperparameter response functions from offline tuning data (See Fig. 1).” [sec(s) 2] “HPO aims to find a set of hyperparameters x from search space X to maximize a model performance metric, y = f(x), often referred to as a response function. … An HPO algorithm π maps the metadata and history to a distribution over hyperparameter suggestions, i.e. π(xt+1|m, ht). Using the terminology of offline RL [29], we refer to the algorithm used to generate the trajectories in a dataset as the behavior policy πb.” [sec(s) 4] “We train our OPTFORMER as a generative model on a given dataset and explain how to use the OPTFORMER’s parameter and function prediction abilities to implement an HPO policy.” [sec(s) 4.1 Study tokenization] “To improve scalability, we compress the textual representation of metadata m by removing redundant phrases and punctuation (e.g., "parameter", quotes) and encoding keywords (e.g., "name", "algorithm") and enumerating types (e.g. "DOUBLE") into single tokens. … See Appendix A.2 for further details on tokenization.” [sec(s) 4.3] “However, we can take advantage of the model’s simultaneous function prediction ability to improve the policy with model-based planning or offline RL techniques.”;)

evaluating the machine learning model based on performance of the corresponding reinforcement learning agent after training according to the predicted set of hyperparameters.
(Chen [fig(s) 1] “It is trained to predict both hyperparameter suggestions (in green) and response function values (in red).” [sec(s) 1] “We adopt a supervised learning approach, by learning to predict parameters and hyperparameter response functions from offline tuning data (See Fig. 1).” [sec(s) 2] “HPO aims to find a set of hyperparameters x from search space X to maximize a model performance metric, y = f(x), often referred to as a response function. … An HPO algorithm π maps the metadata and history to a distribution over hyperparameter suggestions, i.e. π(xt+1|m, ht). Using the terminology of offline RL [29], we refer to the algorithm used to generate the trajectories in a dataset as the behavior policy πb.” [sec(s) 4] “We train our OPTFORMER as a generative model on a given dataset and explain how to use the OPTFORMER’s parameter and function prediction abilities to implement an HPO policy.” [sec(s) 4.1 Study tokenization] “To improve scalability, we compress the textual representation of metadata m by removing redundant phrases and punctuation (e.g., "parameter", quotes) and encoding keywords (e.g., "name", "algorithm") and enumerating types (e.g. "DOUBLE") into single tokens. … See Appendix A.2 for further details on tokenization.” [sec(s) 4.3] “However, we can take advantage of the model’s simultaneous function prediction ability to improve the policy with model-based planning or offline RL techniques.” [sec(s) 6] “The train/test subsets of RealWorldData are split temporally to avoid information leak (see Appendix C for details).”;)

The combination of Jaderberg, Chen is combinable with Chen for the same rationale as set forth above with respect to claim 1.

Regarding claim 8
The claim is a system claim corresponding to the method claim 1, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim.

Regarding claim 9
The claim is a system claim corresponding to the method claim 3, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim.

Regarding claim 11
The claim is a system claim corresponding to the method claim 5, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim.

Regarding claim 13
The claim is a system claim corresponding to the method claim 7, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim.

Regarding claim 14
The claim is a computer program product claim corresponding to the method claim 1, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim.

Regarding claim 15
The claim is a computer program product claim corresponding to the method claim 2, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim.

Regarding claim 16
The claim is a computer program product claim corresponding to the method claim 3, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim.

Regarding claim 18
The claim is a computer program product claim corresponding to the method claim 5, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim.

Regarding claim 20
The claim is a computer program product claim corresponding to the method claim 7, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim.

Claim(s) 4, 10, 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Jaderberg et al. (Population Based Training of Neural Networks) in view of Chen et al. (Towards Learning Universal Hyperparameter Optimizers with Transformers) in view of Micheli et al. (TRANSFORMERS ARE SAMPLE-EFFICIENT WORLD MODELS)

Regarding claim 4
The combination of Jaderberg, Chen teaches claim 3.

However, the combination of Jaderberg, Chen does not appear to explicitly teach:
wherein the second machine learning model includes an autoencoder.

Micheli teaches
wherein the second machine learning model includes an autoencoder.
(Micheli [fig(s) 1] [sec(s) 1] “In the present work, we introduce IRIS (Imagination with auto-Regression over an Inner Speech), an agent trained in the imagination of a world model composed of a discrete autoencoder and an autoregressive Transformer. IRIS learns behaviors by accurately simulating millions of trajectories. Our approach casts dynamics learning as a sequence modeling problem, where an autoencoder builds a language of image tokens and a Transformer composes that language over time. With minimal tuning, IRIS outperforms a line of recent methods (Kaiser et al., 2020; Hessel et al., 2018; Laskin et al., 2020; Yarats et al., 2021; Schwarzer et al., 2021) for sample-efficient RL in the Atari 100k benchmark (Kaiser et al., 2020). After only two hours of real-time experience, it achieves a mean human normalized score of 1.046, and reaches superhuman performance on 10 out of 26 games. We describe IRIS in Section 2 and present our results in Section 3.”;)

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Jaderberg, Chen with the autoencoder of Micheli.
One of ordinary skill in the art would have been motived to combine in order to provide a rich gameplay experience when training in imagination by illustrating the generative capabilities of the world model.
(Micheli [sec(s) 5] “We showed that its world model acquires a deep understanding of game mechanics, resulting in pixel perfect predictions in some games. We also illustrated the generative capabilities of the world model, providing a rich gameplay experience when training in imagination. Ultimately, with minimal tuning compared to existing battle-hardened agents, IRIS opens a new path towards efficiently solving complex environments.”)

Regarding claim 10
The claim is a system claim corresponding to a combination of the method claims 2 and 4, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the combination of the method claims.

Regarding claim 17
The claim is a computer program product claim corresponding to the method claim 4, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim.

Claim(s) 6, 12, 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Jaderberg et al. (Population Based Training of Neural Networks) in view of Chen et al. (Towards Learning Universal Hyperparameter Optimizers with Transformers) in view of Memarian et al. (Self-Supervised Online Reward Shaping in Sparse-Reward Environments)

Regarding claim 6
The combination of Jaderberg, Chen teaches claim 3.

wherein training the machine learning model further comprises: (See claim 1)

Chen further teaches
[ranking] the encodings for a hyperparameter tuning session based on [rewards] observed in the set of rollouts for the hyperparameter tuning session; and
(Chen [fig(s) 1] “It is trained to predict both hyperparameter suggestions (in green) and response function values (in red).” [sec(s) 1] [sec(s) 2] “HPO aims to find a set of hyperparameters x from search space X to maximize a model performance metric, y = f(x), often referred to as a response function. … An HPO algorithm π maps the metadata and history to a distribution over hyperparameter suggestions, i.e. π(xt+1|m, ht). Using the terminology of offline RL [29], we refer to the algorithm used to generate the trajectories in a dataset as the behavior policy πb.” [sec(s) 4.1 Study tokenization] “To improve scalability, we compress the textual representation of metadata m by removing redundant phrases and punctuation (e.g., "parameter", quotes) and encoding keywords (e.g., "name", "algorithm") and enumerating types (e.g. "DOUBLE") into single tokens. … See Appendix A.2 for further details on tokenization.” [sec(s) 4] “We train our OPTFORMER as a generative model on a given dataset and explain how to use the OPTFORMER’s parameter and function prediction abilities to implement an HPO policy.”; e.g., “trajectories” read(s) on “rollouts”.)

concatenating encodings of selected rollouts of the hyperparameter tuning session to produce a resulting encoding for the training data for the hyperparameter tuning session.
(Chen [fig(s) 1] [sec(s) 1] “We introduce a serialization scheme to convert a combination of any metadata and an optimization trajectory into text, represented as a sequence of tokens, and formulate the HPO task as a sequence modeling problem. We adopt a supervised learning approach, by learning to predict parameters and hyperparameter response functions from offline tuning data (See Fig. 1).” [sec(s) 2] “HPO aims to find a set of hyperparameters x from search space X to maximize a model performance metric, y = f(x), often referred to as a response function. … An HPO algorithm π maps the metadata and history to a distribution over hyperparameter suggestions, i.e. π(xt+1|m, ht). Using the terminology of offline RL [29], we refer to the algorithm used to generate the trajectories in a dataset as the behavior policy πb.” [sec(s) 4.1 Study tokenization] “To improve scalability, we compress the textual representation of metadata m by removing redundant phrases and punctuation (e.g., "parameter", quotes) and encoding keywords (e.g., "name", "algorithm") and enumerating types (e.g. "DOUBLE") into single tokens. … See Appendix A.2 for further details on tokenization.” [sec(s) 4] “We train our OPTFORMER as a generative model on a given dataset and explain how to use the OPTFORMER’s parameter and function prediction abilities to implement an HPO policy.” [sec(s) 4.2] “After tokenization, the converted historical sequence is as follows: 
    PNG
    media_image1.png
    109
    1568
    media_image1.png
    Greyscale
. (2)”; e.g., eq (2) read(s) on “concatenating”.)

The combination of Jaderberg, Chen is combinable with Chen for the same rationale as set forth above with respect to claim 1.

However, the combination of Jaderberg, Chen does not appear to explicitly teach:
[ranking] the encodings for a hyperparameter tuning session based on [rewards] observed in the set of rollouts for the hyperparameter tuning session; and

Memarian teaches
ranking the encodings for a hyperparameter tuning session based on rewards observed in the set of rollouts for the hyperparameter tuning session; and
(Memarian [sec(s) Abs] “We introduce Self-supervised Online Reward Shaping (SORS), which aims to improve the sample efficiency of any RL algorithm in sparse-reward environments by automatically densifying rewards.” [sec(s) I] “SORS uses the sparse reward as a self-supervised learning signal to rank the trajectories generated by the agent during learning. … Any reward function induces a total order over the trajectory space by means of the discounted return it assigns to trajectories. We then provide a theorem that indicates any two reward functions that induce the same total order over the trajectory space, induce identical sets of optimal policies under mild assumptions on the dynamics of the environment. The objective function that SORS optimizes for reward inference encourages the dense reward function to induce the same total order as the sparse reward over the trajectory space” [sec(s) II.C] “Second, our algorithm does not require a human in the loop—instead we leverage the environment’s sparse feedback to rank the collected trajectories and then use the set of ranked trajectories for inferring a dense reward function to accelerate policy learning” [sec(s) IV] “Although SORS does not use any extra information, we hypothesize that the additional reward shaping module may improves learning since (1) we can leverage a deep neural network in inferring the relevant features that may be infeasible for a human to define when writing down a reward function, and (2) SORS performs credit assignment not only when learning a value function / policy (as in standard RL), but also by inferring a new reward function from the automatically-ranked trajectories that it collects.”;)

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Jaderberg, Chen with the ranking based on rewards of Memarian.
One of ordinary skill in the art would have been motived to combine in order to significantly improve the sample efficiency of the state-of-the-art baseline algorithm and achieve comparable sample efficiency to a baseline that uses a hand-designed dense reward function.
(Memarian [sec(s) I] “Our empirical results on several sparse reward MuJoCo [14] locomotion tasks show that SORS can significantly improve the sample efficiency of the state-of-the-art baseline algorithm, namely Soft-Actor-Critic (SAC). SORS even achieves comparable sample efficiency to a baseline that uses a hand-designed dense reward function.”)

Regarding claim 12
The claim is a system claim corresponding to the method claim 6, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim.

Regarding claim 19
The claim is a computer program product claim corresponding to the method claim 6, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim.

Prior Art
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Kudo et al. (SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing) teaches tokenization.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SEHWAN KIM whose telephone number is (571)270-7409. The examiner can normally be reached Mon - Thu 7:00 AM - 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael J Huntley can be reached on (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/SEHWAN KIM/Examiner, Art Unit 2129
Read full office action
Prosecution Timeline

Jun 23, 2023
Application Filed
Feb 05, 2026
Non-Final Rejection mailed — §103
Mar 22, 2026
Interview Requested
Apr 17, 2026
Applicant Interview (Telephonic)
Apr 17, 2026
Examiner Interview Summary
Apr 22, 2026
Response Filed
May 11, 2026
Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/046,492
Patent 12619853
DECISION-MAKING DEVICE, UNMANNED SYSTEM, DECISION-MAKING METHOD, AND PROGRAM
5y 6m to grant Granted May 05, 2026
17/899,124
Patent 12619921
PREDICTIVE FOG DATA CENTER MIGRATION
3y 8m to grant Granted May 05, 2026
18/000,845
Patent 12608592
AUTOMATED ELECTRIC SUBMERSIBLE PUMP (ESP) FAILURE ANALYSIS
3y 4m to grant Granted Apr 21, 2026
15/360,454
Patent 12602595
SYSTEM AND METHOD OF USING A KNOWLEDGE REPRESENTATION FOR FEATURES IN A MACHINE LEARNING CLASSIFIER
9y 4m to grant Granted Apr 14, 2026
16/453,380
Patent 12602580
Dataset Dependent Low Rank Decomposition Of Neural Networks
6y 9m to grant Granted Apr 14, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
60%
Grant Probability
99%
With Interview (+65.9%)
4y 0m (~1y 1m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 146 resolved cases by this examiner. Grant probability derived from career allowance rate.