Last updated: May 29, 2026
Application No. 18/023,342
METHOD FOR A STATE ENGINEERING FOR A REINFORCEMENT LEARNING SYSTEM, COMPUTER PROGRAM PRODUCT, AND REINFORCEMENT LEARNING SYSTEM

Non-Final OA §102§103§112
Filed
Feb 24, 2023
Priority
Aug 27, 2020 — nonprovisional of PCTEP2020073948
Examiner
WU, NICHOLAS S
Art Unit
2148
Tech Center
2100 — Computer Architecture & Software
Assignee
Siemens Aktiengesellschaft
OA Round
1 (Non-Final)
This examiner grants 51% of cases after interview

— +39.5% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 43 resolved cases, 2023–2026
Examiner Intelligence

WU, NICHOLAS S View full profile →
Grants 51% of resolved cases
Career Allowance Rate
22 granted / 43 resolved
-3.8% vs TC avg
Strong +40% interview lift
Without
With
+39.5%
Interview Lift
resolved cases with interview
Typical timeline
3y 11m
Avg Prosecution
32 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
2.8%
-37.2% vs TC avg
§103
94.4%
+54.4% vs TC avg
§112
2.8%
-37.2% vs TC avg
Black line = Tech Center average estimate • Based on career data from 43 resolved cases
Office Action

§102 §103 §112
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 112: Indefiniteness 
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 8-12 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Regarding claim 8, the recites the limitation wherein the method is used for a manufacturing system. There is insufficient antecedent basis for this limitation in the claim because the term “the method” lacks antecedent basis.
Regarding claims 9-12, the claims are rejected for at least their dependence on claim 8. 

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claim 7 is rejected under 35 U.S.C. 102(a)(1) as being anticipated by Kimura, Non-Patent Literature “DAQN: Deep Auto-encoder and Q-Network” (“Kimura”).
Regarding claim 7, Kimura discloses:
A reinforcement learning system comprising: an autoencoder coupled to a reinforcement learning network (RLN), (Kimura, pg. 1 col. 2, “The contributions of this paper are to clarify the benefit of introducing the generative model into the deep reinforcement learning (especially, the auto-encoder into the deep q-network) [A reinforcement learning system comprising: an autoencoder coupled to a reinforcement learning network (RLN),]”).
the autoencoder including an encoder part and a decoder part, wherein the reinforcement learning system is configured to: train the autoencoder in a first step; (Kimura, pg. 2 col. 1, “The method first trains inputs by a deep auto-encoder for pre-training the network. The auto-encoder has encoder and decoder components [the autoencoder including an encoder part and a decoder part, wherein the reinforcement learning system is configured to: train the autoencoder in a first step;]”).
train the RLN in a second step with values representing a quality of the RLN or the training of the autoencoder; (Kimura, pg. 2 col. 1, “Next, the method removes decoder-layers of the auto-encoder network, and adds a fully-connected layer at the top of encoder-layers for discrete actions. Note that the weights of this added layer are initialized by random values. Then, the method trains the policy by deep q-network [1], [2], which is initialized by the pre-trained network parameters from previous steps [train the RLN in a second step with values representing a quality of the RLN or the training of the autoencoder;].”).
and retrain the encoder part in a third step using results of the second step. (Kimura, pg. 2 col. 1 and see Figure 2, “Next, the method removes decoder-layers of the auto-encoder network, and adds a fully-connected layer at the top of encoder-layers for discrete actions. Note that the weights of this added layer are initialized by random values. Then, the method trains the policy by deep q-network [1], [2], which is initialized by the pre-trained network parameters from previous steps; the encoder part is combined with additional layers to make the q-network, or RLN, thus updating the q-network updates the encoder as well (i.e. and retrain the encoder part in a third step using results of the second step.)”).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1 and 4 are rejected under 35 U.S.C. 103 as being unpatentable over Kimura, Non-Patent Literature “DAQN: Deep Auto-encoder and Q-Network” (“Kimura”) in view of Gabel, et al., Non-Patent Literature “Scaling Adaptive Agent-Based Reactive Job-Shop Scheduling to Large-Scale Problems” (“Gabel”).
Regarding claim 1, Kimura discloses:
A method for an automatic state engineering for a reinforcement learning system, (Kimura, pg. 1 col. 1, “Also, performing training trials on the actual robot takes time and has risks of hurting the robot and environment. Thus, reducing the number of giving the reward and taking an action is necessary. Moreover, when we consider introducing into the real environment, state inputs will have a wider diversity. For example, if there is a state of “in front of a human”, there are almost infinite variations of the visual representation of the human [A method for an automatic state engineering for a reinforcement learning system,].”).
wherein an autoencoder is coupled to a reinforcement learning network (RLN), (Kimura, pg. 1 col. 2, “The contributions of this paper are to clarify the benefit of introducing the generative model into the deep reinforcement learning (especially, the auto-encoder into the deep q-network) [wherein an autoencoder is coupled to a reinforcement learning network (RLN),]”).
the autoencoder including an encoder part and a decoder part, the method comprising: training the autoencoder; (Kimura, pg. 2 col. 1, “The method first trains inputs by a deep auto-encoder for pre-training the network. The auto-encoder has encoder and decoder components [the autoencoder including an encoder part and a decoder part, the method comprising: training the autoencoder;]”).
training the RLN with values representing a quality of the RLN or the training of the RLN; (Kimura, pg. 2 col. 1, “Next, the method removes decoder-layers of the auto-encoder network, and adds a fully-connected layer at the top of encoder-layers for discrete actions. Note that the weights of this added layer are initialized by random values. Then, the method trains the policy by deep q-network [1], [2], which is initialized by the pre-trained network parameters from previous steps [training the RLN with values representing a quality of the RLN or the training of the RLN;].”).
and retraining the encoder part of the autoencoder using results of the training of the RLN, (Kimura, pg. 2 col. 1 and see Figure 2, “Next, the method removes decoder-layers of the auto-encoder network, and adds a fully-connected layer at the top of encoder-layers for discrete actions. Note that the weights of this added layer are initialized by random values. Then, the method trains the policy by deep q-network [1], [2], which is initialized by the pre-trained network parameters from previous steps; the encoder part is combined with additional layers to make the q-network, or RLN, thus updating the q-network updates the encoder as well (i.e. and retraining the encoder part of the autoencoder using results of the training of the RLN,)”).
While Kimura teaches a system that combines an autoencoder with a reinforcement learning network and a reinforcement learning agent, Kimura does not explicitly teach:
wherein manufacturing scheduling of a product is controlled by a reinforcement learning agent of the reinforcement learning network and learned by the reinforcement learning system, wherein each reinforcement learning agent of the reinforcement learning network is configured to control one product with a job specification, and wherein the method further comprises providing a value representing a suitability of each of the processing entities for an optimization goal.
Gabel teaches:
wherein the method is used for a manufacturing system, the manufacturing system including processing entities that are interconnected, (Gabel, pg. 259 col. 1, “In production scheduling, tasks have to be allocated to a limited number of resources in such a manner that one or more objectives are optimized…Here, each resource is equipped with a scheduling agent that makes the decision on which job to process next based solely on its partial view on the plant. As each agent follows its own decision policy, thus rendering a central control unnecessary [wherein the method is used for a manufacturing system, the manufacturing system including processing entities that are interconnected,]”).
wherein manufacturing scheduling of a product is controlled by a reinforcement learning agent of the reinforcement learning network and learned by the reinforcement learning system, wherein each reinforcement learning agent of the reinforcement learning network is configured to control one product with a job specification, (Gabel, pg. 259 col. 1, “In production scheduling, tasks have to be allocated to a limited number of resources in such a manner that one or more objectives are optimized…Here, each resource is equipped with a scheduling agent that makes the decision on which job to process next based solely on its partial view on the plant. As each agent follows its own decision policy, thus rendering a central control unnecessary [wherein each reinforcement learning agent of the reinforcement learning network is configured to control one product with a job specification,]…employ reinforcement learning (RL, [3]) to let the scheduling agents acquire their control policies on their own on the basis of trial and error by repeated interaction within their environment. After that learning phase, each agent will have obtained a purposive, reactive behavior for the respective environment [wherein manufacturing scheduling of a product is controlled by a reinforcement learning agent of the reinforcement learning network and learned by the reinforcement learning system,].”).
and wherein the method further comprises providing a value representing a suitability of each of the processing entities for an optimization goal. (Gabel, pg. 259 col. 1, “In production scheduling, tasks have to be allocated to a limited number of resources in such a manner that one or more objectives are optimized [and wherein the method further comprises providing a value representing a suitability of each of the processing entities for an optimization goal.]”). 
Kimura and Gabel are both in the same field of endeavor (i.e. reinforcement learning). It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to combine Kimura and Gabel to teach the above limitation(s). The motivation for doing so is that having decentralized learning agents improves the adaptability of a system due to a constantly changing environment (cf. Gabel, pg. 266 col. 1, “This way, we obtain a reactive scheduling system, where the final schedule is not calculated beforehand, viz before execution time, where online dispatching decisions are made, and where the local dispatching policies are aligned with the global optimization goal. Hence, not just the adaptation of the agents’ behavior during learning is decentralized, but also decision-making during application proceeds without a centralized control.”). 
Regarding claim 4, Kimura in view of Gabel teaches the method of claim 1. Gabel further teaches:
wherein the manufacturing scheduling is a self-learning manufacturing scheduling, and the manufacturing is a flexible manufacturing system, (Gabel, abstract, “In this work, we adopt an alternative view on scheduling problems where each resource is equipped with an adaptive agent that, independent of other agents, makes job dispatching decisions based on its local view on the plant and employs reinforcement learning to improve its dispatching strategy; having adaptive scheduling is interpreted as a self-learning scheduling and having adaptive scheduling is interpreted as a flexible manufacturing system (i.e. wherein the manufacturing scheduling is a self-learning manufacturing scheduling, and the manufacturing is a flexible manufacturing system,).”).
wherein the method further comprises: producing at least one product; and applying training on an optimization goal of the at least one product. (Gabel, pg. 1 col. 1, “In production scheduling [wherein the method further comprises: producing at least one product;], tasks have to be allocated to a limited number of resources in such a manner that one or more objectives are optimized [and applying training on an optimization goal of the at least one product.].”).
It would have been obvious to one of ordinary skill in the art before the effective filling date of the present application to combine the teachings of Gabel with the teachings of Kimura for the same reasons disclosed in claim 1.

Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over Kimura, Non-Patent Literature “DAQN: Deep Auto-encoder and Q-Network” (“Kimura”) in view of Gabel, et al., Non-Patent Literature “Scaling Adaptive Agent-Based Reactive Job-Shop Scheduling to Large-Scale Problems” (“Gabel”) and further in view of Lange, et al., Non-Patent Literature “Deep Auto-Encoder Neural Networks in Reinforcement Learning” (“Lange”).
	Regarding claim 2, Kimura in view of Gabel teaches the method of claim 1. Kimura further teaches wherein the training of the autoencoder and the training of the RLN are performed iteratively, (Kimura, pg. 2 col. 1 and see Figure 2, “Next, the method removes decoder-layers of the auto-encoder network, and adds a fully-connected layer at the top of encoder-layers for discrete actions. Note that the weights of this added layer are initialized by random values. Then, the method trains the policy by deep q-network [1], [2], which is initialized by the pre-trained network parameters from previous steps; the encoder part is first pre-trained and then combined with additional layers to make the q-network, or RLN, thus the autoencoder and RLN are trained iteratively (i.e. and retraining the encoder part of the autoencoder using results of the training of the RLN,)”).
However, the combination does not explicitly teach switching between the training of the autoencoder and the training of the RLN after a defined number of steps of training the reinforcement learning agent.
	Lange teaches switching between the training of the autoencoder and the training of the RLN after a defined number of steps of training the reinforcement learning agent. (Lange, pg. 3 col. 2 and see Figure 2, “For an optimal learning curve, the auto-encoder must be re-trained whenever new data is available. But since the expected improvement of the feature space given just a few more observations is rather limited—as a fair compromise between optimal feature spaces and computing time—the re-training of the auto-encoder is only triggered with every doubling of the number of collected observations; Figure 2 shows that the training of the encoder and the RLN follow an iterative approach (i.e. switching between the training of the autoencoder and the training of the RLN after a defined number of steps of training the reinforcement learning agent.).”).
Kimura, in view of Gabel, and Lange are both in the same field of endeavor (i.e. reinforcement learning). It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to combine Kimura, in view of Gabel, and Lange to teach the above limitation(s). The motivation for doing so is that retraining the encoder after the initial training helps keep the encoder updated to changes in the environment (cf. Lange, pg. 3 col. 2, “For an optimal learning curve, the auto-encoder must be re-trained whenever new data is available.”).

Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Kimura, Non-Patent Literature “DAQN: Deep Auto-encoder and Q-Network” (“Kimura”) in view of Gabel, et al., Non-Patent Literature “Scaling Adaptive Agent-Based Reactive Job-Shop Scheduling to Large-Scale Problems” (“Gabel”) and further in view of Hsaio, et al., Non-Patent Literature “Learning a Multi-Modal Policy via Imitating Demonstrations with Mixed Behaviors” (“Hsaio”).
	Regarding claim 3, Kimura in view of Gabel teaches the method of claim 1. Gabel further teaches wherein there are at least two reinforcement learning agent instantiations of the RLN, wherein each reinforcement learning agent of the at least two reinforcement learning agent instantiations has an optimization goal for the training of the reinforcement learning agent, (Gabel, pg. 259 col. 1, “In production scheduling, tasks have to be allocated to a limited number of resources in such a manner that one or more objectives are optimized…Here, each resource is equipped with a scheduling agent that makes the decision on which job to process next based solely on its partial view on the plant [wherein there are at least two reinforcement learning agent instantiations of the RLN,]. As each agent follows its own decision policy, thus rendering a central control unnecessary…employ reinforcement learning (RL, [3]) to let the scheduling agents acquire their control policies on their own on the basis of trial and error by repeated interaction within their environment. After that learning phase, each agent will have obtained a purposive, reactive behavior for the respective environment [wherein each reinforcement learning agent of the at least two reinforcement learning agent instantiations has an optimization goal for the training of the reinforcement learning agent,].”).
It would have been obvious to one of ordinary skill in the art before the effective filling date of the present application to combine the teachings of Gabel with the teachings of Kimura for the same reasons disclosed in claim 1.
	While the combination teaches separate reinforcement learning agents with individual optimization goals, the combination does not explicitly teach an autoencoder with conditional information: wherein the autoencoder uses condition information about the reinforcement learning agent that separates the encoding of the respective optimization goal of the reinforcement learning agent.
Hsaio teaches wherein the autoencoder uses condition information about the reinforcement learning agent that separates the encoding of the respective optimization goal of the reinforcement learning agent. (Hsaio, pg. 2, “we propose an approach based on the variational autoencoder with a categorical latent variable that jointly learns an encoder and a decoder. The encoder infers discrete latent factors corresponding to different behaviors from demonstrations [wherein the autoencoder uses condition information about the reinforcement learning agent]…We propose to use the categorical latent variable to learn the multi-modal policy for two reasons. First, demonstrations with mixed behaviors are inherently discrete in many cases since typically there exist salient differences between behaviors…Second, using the categorical latent variable makes the learned policy more controllable. The categorical latent variable can discover the salient variations in the data and result in simple representations of different behaviors, namely the categories. As a result, each category corresponds to a specific behavior and the learned policy can be controlled to reproduce a behavior by simply conditioning on a categorical vector [that separates the encoding of the respective optimization goal of the reinforcement learning agent.].”).
Kimura, in view of Gabel, and Hsaio are both in the same field of endeavor (i.e. autoencoders). It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to combine Kimura, in view of Gabel, and Hsaio to teach the above limitation(s). The motivation for doing so is that using a conditional variable enables an autoencoder to generate more stable class specific encodings (cf. Hsaio, pg. 2, “using the categorical latent variable makes the learned policy more controllable. The categorical latent variable can discover the salient variations in the data and result in simple representations of different behaviors, namely the categories. As a result, each category corresponds to a specific behavior and the learned policy can be controlled to reproduce a behavior by simply conditioning on a categorical vector.”).

Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Kimura, Non-Patent Literature “DAQN: Deep Auto-encoder and Q-Network” (“Kimura”) in view of Gabel, et al., Non-Patent Literature “Scaling Adaptive Agent-Based Reactive Job-Shop Scheduling to Large-Scale Problems” (“Gabel”) and further in view of Wiehe, et al., Non-Patent Literature “Sampled Policy Gradient for Learning to Play the Game Agar.io” (“Wiehe”).
	Regarding claim 5, Kimura in view of Gabel teaches the method of claim 1. Kimura further teaches:
wherein the reinforcement learning agent is a Deep Q-Network DQN-Agent (Kimura, pg. 1 col. 1, “The deep reinforcement learning algorithms, such as deep q-network method [1], [2], deep deterministic policy gradients [3], and asynchronous advantage actor-critic [4], are the suitable methods for implementing interactive robot (agent) [wherein the reinforcement learning agent is a Deep Q-Network DQN-Agent].”).
and gradients that result from each calculation of a reinforcement learning agent network update are then passed to the encoder part of the autoencoder to update weights of the encoder part (Kimura, pg. 2 col. 1 and see Figure 2, “Therefore, the loss function of the deep q-network is, L(θ) = E(st,at,rt+1,st+1)∼D[(y − Q(st, at; θ))2]  (4) y = { rt+1 terminal rt+1 + γ maxa Q(st+1, a; θ −) non-terminal, (5) where θ is the parameters of deep q-network, D is an experience replay memory [11], Q(st, at; θ) will be calculated by the deep structure, and γ is a discount factor. θ − is the weights that are updated fixed duration; this technique was also used in the original deep q-network method [1]; Figure 2 shows that the q-network includes the encoder part thus updating the weights of the q-network using a loss function is interpreted as using gradients to update the encoder (i.e. and gradients that result from each calculation of a reinforcement learning agent network update are then passed to the encoder part of the autoencoder to update weights of the encoder part).”). 
However, the combination does not explicitly teach using a Sampled Policy Gradient algorithm.
Wiehe teaches using a Sampled Policy Gradient algorithm. (Wiehe, abstract, “Sampled Policy Gradient (SPG). SPG samples in the action space to calculate an approximated policy gradient by using the critic to evaluate the samples [using a Sampled Policy Gradient algorithm.].”).
Kimura, in view of Gabel, and Wiehe are both in the same field of endeavor (i.e. reinforcement learning). It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to combine Kimura, in view of Gabel, and Wiehe to teach the above limitation(s). The motivation for doing so is that using SPG improves a reinforcement learning model’s ability to search the action space (cf. Wiehe, abstract, “This sampling allows SPG to search the action-Q-value space more globally than deterministic policy gradient (DPG), enabling it to theoretically avoid more local optima.”).

Claims 8 and 11 is rejected under 35 U.S.C. 103 as being unpatentable over Kimura, Non-Patent Literature “DAQN: Deep Auto-encoder and Q-Network” (“Kimura”) in view of Gabel, et al., Non-Patent Literature “Scaling Adaptive Agent-Based Reactive Job-Shop Scheduling to Large-Scale Problems” (“Gabel”) and further in view of Ketz, et al., US Pre-Grant Publication 2020/0134426A1 (“Ketz”).
Regarding claim 8, the claim is similar to claim 1 and Kimura in view of Gabel teaches the similar limitations found in claim 1. However, the combination does not explicitly teach the limitations In a non-transitory computer-readable storage medium that stores instructions executable by one or more processors. 
Ketz teaches In a non-transitory computer-readable storage medium that stores instructions executable by one or more processors (Ketz, ⁋14, “a non-transitory computer-readable storage medium having software instructions stored therein, which, when executed by a processor, cause the processor to [In a non-transitory computer-readable storage medium that stores instructions executable by one or more processors]”).
Kimura, in view of Gabel and Ketz are both in the same field of endeavor (i.e. reinforcement learning). It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to combine Kimura, in view of Gabel and Ketz to teach the above limitation(s). The motivation for doing so is that a computer and its components are required in order to run a machine learning process.
Regarding claim 11, the claim is similar to claim 4 and rejected under the same rationales.

Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Kimura, Non-Patent Literature “DAQN: Deep Auto-encoder and Q-Network” (“Kimura”) in view of Gabel, et al., Non-Patent Literature “Scaling Adaptive Agent-Based Reactive Job-Shop Scheduling to Large-Scale Problems” (“Gabel”) and further in view of Ketz, et al., US Pre-Grant Publication 2020/0134426A1 (“Ketz”) and Lange, et al., Non-Patent Literature “Deep Auto-Encoder Neural Networks in Reinforcement Learning” (“Lange”).
	Regarding claim 9, Kimura in view of Gabel and Ketz teaches the non-transitory computer-readable storage medium of claim 8. Kimura further teaches wherein the training of the autoencoder and the training of the RLN are performed iteratively, (Kimura, pg. 2 col. 1 and see Figure 2, “Next, the method removes decoder-layers of the auto-encoder network, and adds a fully-connected layer at the top of encoder-layers for discrete actions. Note that the weights of this added layer are initialized by random values. Then, the method trains the policy by deep q-network [1], [2], which is initialized by the pre-trained network parameters from previous steps; the encoder part is first pre-trained and then combined with additional layers to make the q-network, or RLN, thus the autoencoder and RLN are trained iteratively (i.e. and retraining the encoder part of the autoencoder using results of the training of the RLN,)”).
However, the combination does not explicitly teach switching between the training of the autoencoder and the training of the RLN after a defined number of steps of training the reinforcement learning agent.
	Lange teaches switching between the training of the autoencoder and the training of the RLN after a defined number of steps of training the reinforcement learning agent. (Lange, pg. 3 col. 2 and see Figure 2, “For an optimal learning curve, the auto-encoder must be re-trained whenever new data is available. But since the expected improvement of the feature space given just a few more observations is rather limited—as a fair compromise between optimal feature spaces and computing time—the re-training of the auto-encoder is only triggered with every doubling of the number of collected observations; Figure 2 shows that the training of the encoder and the RLN follow an iterative approach (i.e. switching between the training of the autoencoder and the training of the RLN after a defined number of steps of training the reinforcement learning agent.).”).
Kimura, in view of Gabel and Ketz, and Lange are both in the same field of endeavor (i.e. reinforcement learning). It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to combine Kimura, in view of Gabel and Ketz, and Lange to teach the above limitation(s). The motivation for doing so is that retraining the encoder after the initial training helps keep the encoder updated to changes in the environment (cf. Lange, pg. 3 col. 2, “For an optimal learning curve, the auto-encoder must be re-trained whenever new data is available.”).

Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Kimura, Non-Patent Literature “DAQN: Deep Auto-encoder and Q-Network” (“Kimura”) in view of Gabel, et al., Non-Patent Literature “Scaling Adaptive Agent-Based Reactive Job-Shop Scheduling to Large-Scale Problems” (“Gabel”) and further in view of Ketz, et al., US Pre-Grant Publication 2020/0134426A1 (“Ketz”) and Hsaio, et al., Non-Patent Literature “Learning a Multi-Modal Policy via Imitating Demonstrations with Mixed Behaviors” (“Hsaio”).
Regarding claim 10, Kimura in view of Gabel and Ketz teaches the non-transitory computer-readable storage medium of claim 8. Gabel further teaches wherein there are at least two reinforcement learning agent instantiations of the RLN, wherein each reinforcement learning agent of the at least two reinforcement learning agent instantiations has an optimization goal for the training of the reinforcement learning agent, (Gabel, pg. 259 col. 1, “In production scheduling, tasks have to be allocated to a limited number of resources in such a manner that one or more objectives are optimized…Here, each resource is equipped with a scheduling agent that makes the decision on which job to process next based solely on its partial view on the plant [wherein there are at least two reinforcement learning agent instantiations of the RLN,]. As each agent follows its own decision policy, thus rendering a central control unnecessary…employ reinforcement learning (RL, [3]) to let the scheduling agents acquire their control policies on their own on the basis of trial and error by repeated interaction within their environment. After that learning phase, each agent will have obtained a purposive, reactive behavior for the respective environment [wherein each reinforcement learning agent of the at least two reinforcement learning agent instantiations has an optimization goal for the training of the reinforcement learning agent,].”).
It would have been obvious to one of ordinary skill in the art before the effective filling date of the present application to combine the teachings of Gabel with the teachings of Kimura and Ketz for the same reasons disclosed in claim 8.
	While the combination teaches separate reinforcement learning agents with individual optimization goals, the combination does not explicitly teach an autoencoder with conditional information: wherein the autoencoder uses condition information about the reinforcement learning agent that separates the encoding of the respective optimization goal of the reinforcement learning agent.
Hsaio teaches wherein the autoencoder uses condition information about the reinforcement learning agent that separates the encoding of the respective optimization goal of the reinforcement learning agent. (Hsaio, pg. 2, “we propose an approach based on the variational autoencoder with a categorical latent variable that jointly learns an encoder and a decoder. The encoder infers discrete latent factors corresponding to different behaviors from demonstrations [wherein the autoencoder uses condition information about the reinforcement learning agent]…We propose to use the categorical latent variable to learn the multi-modal policy for two reasons. First, demonstrations with mixed behaviors are inherently discrete in many cases since typically there exist salient differences between behaviors…Second, using the categorical latent variable makes the learned policy more controllable. The categorical latent variable can discover the salient variations in the data and result in simple representations of different behaviors, namely the categories. As a result, each category corresponds to a specific behavior and the learned policy can be controlled to reproduce a behavior by simply conditioning on a categorical vector [that separates the encoding of the respective optimization goal of the reinforcement learning agent.].”).
Kimura, in view of Gabel and Ketz, and Hsaio are both in the same field of endeavor (i.e. autoencoders). It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to combine Kimura, in view of Gabel and Ketz, and Hsaio to teach the above limitation(s). The motivation for doing so is that using a conditional variable enables an autoencoder to generate more stable class specific encodings (cf. Hsaio, pg. 2, “using the categorical latent variable makes the learned policy more controllable. The categorical latent variable can discover the salient variations in the data and result in simple representations of different behaviors, namely the categories. As a result, each category corresponds to a specific behavior and the learned policy can be controlled to reproduce a behavior by simply conditioning on a categorical vector.”).

Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Kimura, Non-Patent Literature “DAQN: Deep Auto-encoder and Q-Network” (“Kimura”) in view of Gabel, et al., Non-Patent Literature “Scaling Adaptive Agent-Based Reactive Job-Shop Scheduling to Large-Scale Problems” (“Gabel”) and further in view of Ketz, et al., US Pre-Grant Publication 2020/0134426A1 (“Ketz”) and Wiehe, et al., Non-Patent Literature “Sampled Policy Gradient for Learning to Play the Game Agar.io” (“Wiehe”).
Regarding claim 12, Kimura in view of Gabel and Ketz teaches the non-transitory computer-readable storage medium of claim 8. Kimura further teaches:
wherein the reinforcement learning agent is a Deep Q-Network DQN-Agent (Kimura, pg. 1 col. 1, “The deep reinforcement learning algorithms, such as deep q-network method [1], [2], deep deterministic policy gradients [3], and asynchronous advantage actor-critic [4], are the suitable methods for implementing interactive robot (agent) [wherein the reinforcement learning agent is a Deep Q-Network DQN-Agent].”).
and gradients that result from each calculation of a reinforcement learning agent network update are then passed to the encoder part of the autoencoder to update weights of the encoder part (Kimura, pg. 2 col. 1 and see Figure 2, “Therefore, the loss function of the deep q-network is, L(θ) = E(st,at,rt+1,st+1)∼D[(y − Q(st, at; θ))2]  (4) y = { rt+1 terminal rt+1 + γ maxa Q(st+1, a; θ −) non-terminal, (5) where θ is the parameters of deep q-network, D is an experience replay memory [11], Q(st, at; θ) will be calculated by the deep structure, and γ is a discount factor. θ − is the weights that are updated fixed duration; this technique was also used in the original deep q-network method [1]; Figure 2 shows that the q-network includes the encoder part thus updating the weights of the q-network using a loss function is interpreted as using gradients to update the encoder (i.e. and gradients that result from each calculation of a reinforcement learning agent network update are then passed to the encoder part of the autoencoder to update weights of the encoder part).”). 
However, the combination does not explicitly teach using a Sampled Policy Gradient algorithm.
Wiehe teaches using a Sampled Policy Gradient algorithm. (Wiehe, abstract, “Sampled Policy Gradient (SPG). SPG samples in the action space to calculate an approximated policy gradient by using the critic to evaluate the samples [using a Sampled Policy Gradient algorithm.].”).
Kimura, in view of Gabel and Ketz, and Wiehe are both in the same field of endeavor (i.e. reinforcement learning). It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to combine Kimura, in view of Gabel and Ketz, and Wiehe to teach the above limitation(s). The motivation for doing so is that using SPG improves a reinforcement learning model’s ability to search the action space (cf. Wiehe, abstract, “This sampling allows SPG to search the action-Q-value space more globally than deterministic policy gradient (DPG), enabling it to theoretically avoid more local optima.”).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NICHOLAS S WU whose telephone number is (571)270-0939. The examiner can normally be reached Monday - Friday 8:00 am - 4:00 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michelle Bechtold can be reached at 571-431-0762. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/N.S.W./Examiner, Art Unit 2148                                                                                                                                                                                                        /MICHELLE T BECHTOLD/Supervisory Patent Examiner, Art Unit 2148
Read full office action
Prosecution Timeline

Feb 24, 2023
Application Filed
Jan 30, 2026
Non-Final Rejection mailed — §102, §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/231,514
Patent 12619880
METHODS, DEVICES AND MEDIA FOR RE-WEIGHTING TO IMPROVE KNOWLEDGE DISTILLATION
5y 0m to grant Granted May 05, 2026
18/882,311
Patent 12488244
APPARATUS AND METHOD FOR DATA GENERATION FOR USER ENGAGEMENT
1y 2m to grant Granted Dec 02, 2025
17/444,687
Patent 12423576
METHOD AND APPARATUS FOR UPDATING PARAMETER OF MULTI-TASK MODEL, AND STORAGE MEDIUM
4y 1m to grant Granted Sep 23, 2025
17/265,476
Patent 12361280
METHOD AND DEVICE FOR TRAINING A MACHINE LEARNING ROUTINE FOR CONTROLLING A TECHNICAL SYSTEM
4y 5m to grant Granted Jul 15, 2025
17/191,518
Patent 12354017
ALIGNING KNOWLEDGE GRAPHS USING SUBGRAPH TYPING
4y 4m to grant Granted Jul 08, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
51%
Grant Probability
91%
With Interview (+39.5%)
3y 11m (~8m remaining)
Median Time to Grant
Low
PTA Risk
Based on 43 resolved cases by this examiner. Grant probability derived from career allowance rate.