DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
This action is in response to submission filed 13 August 2025 for application 16/176,903. Claims 1, 10, and 19 are amended. Currently claims 1-20 are pending and have been examined.
The objection to claim 1 has been withdrawn in view of the amendments made.
Response to Arguments
Regarding Applicant’s arguments, filed 13 August 2025, see pages 8-10, with respect to the USC 103 rejection, Applicant argues that Claims 1, 10, and 19 have been amended to recite "wherein the adjusting and the applying the training information to obtain the action are performed in a parallel operation, in which the applying the training information occurs for first set of newly generated weights and the adjusting occurs for a set of weights generated immediately prior to the first set of newly generated weights." Support for the amendments can be found at least at paragraph [0042]. The cited references do not teach or suggest this feature.
Page 6 of the Office Action indicates that Babaeizadeh does not teach that the adjusting and the applying are performed in parallel. Page 7 cites to Nair as teaching some features, but not the "in parallel" feature described above. Applicants have searched Nair and submit that Nair does not teach this feature. Page 8 of the Office Action cites to Dean as teaching the "in parallel feature." Applicants submit, however, that Dean does not teach the amended version of this feature, which includes additional detail.
In particular, the operations described as being performed "in parallel" in Dean are very generalized. Page 3, section 3, paragraph 1 states that "Distbelief," the software framework allows for "distributed computation in neural networks...." User- defined computations take place in each node, and each node can be distributed to a different machine. The operations are generally being described as being "parallelized." Importantly, however, Dean does not disclose the specific operations recited in claims 1, 10, and 19 provided above - specifically, that "the adjusting and the applying the training information to obtain the action are performed in a parallel operation, in which the applying the training information occurs for first set of newly -generated weights and the adjusting occurs for a set of weights generated immediately prior to the first set of newly generated weights." For the foregoing reasons, Applicant submits that the cited references do not teach each and every feature of claims 1, 10 and 19. Thus, Applicant requests withdrawal of the rejections of claims 1, 10 and 19 and all claims dependent thereon.
Examiner’s Response: Applicant’s arguments have been fully considered but are not persuasive because with respect to the feature “wherein the adjusting and the applying the training information to obtain the action are performed in a parallel operation, in which the applying the training information occurs for first set of newly generated weights and the adjusting occurs for a set of weights generated immediately prior to the first set of newly generated weights” as recited in independent claim 1 (and similarly in independent claims 10 and 19) have been considered but are moot because the new ground of rejection (citing new reference Atiya et al (New Results on Recurrent Network Training: Unifying the Algorithms and Accelerating Convergence, 2000) for teaching the new limitation) does not rely on any reference combination applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Claim Objections
Claim 1 is objected to because of the following informalities: The phrase “… and and …” (lines 11 and line 13) is redundant. Appropriate correction is required.
Claim 1 is objected to because of the following informalities: The phrase “… for first set of newly generated weights …” (line 18) is awkward. Appropriate correction is required.
Claims 1, 10, and 19 is objected to because of the following informalities: The phrase “… the applying the training information …” (last limitation) is awkward. Appropriate correction is required.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Babaeizadeh et al (REINFORCEMENT LEARNING THROUGH ASYNCHRONOUS ADVANTAGE ACTOR-CRITIC ON A GPU, 2017) in view of Nair et al (Massively Parallel Methods for Deep Reinforcement Learning, 2015) and further in view of Atiya et al (New Results on Recurrent Network Training: Unifying the Algorithms and Accelerating Convergence, 2000).
Regarding claim 1
Babaeizadeh teaches: A method for training a neural network, the method comprising ([Page 1, Section 1, Paragraph 2] The DNN model is constantly queried to guide the actions of agents whose gameplay in turn feeds DNN training):
applying training information, by one or more training cores to a target neural network having weights stored in a target network weight memory ([Page 3, Section 3.2, Paragraph 1] the central server propagates new weights to the agents. [Page 10, Paragraph 3] Each time a trainer updates the DNN weights. Note: Also see Figure 1(b) where bottom right green square corresponds to weights stored in a target network weight memory and top left green square corresponds to training cores);
applying the training information, by one or more inference cores that have a different architecture than the one or more training cores, to a prediction neural network having weights stored in a prediction network weights memory to obtain an action to be performed in a simulated environment ([Page 3, Section 3.2, Paragraph 1] the central server propagates new weights to the agents. [Page 10, Paragraph 3] Each time a trainer updates the DNN weights. Note: Also see Figure 1(b) where top right green square corresponds to weights stored in a prediction network weights memory, the GPU on the right (pink rectangle) corresponds to inference cores, top left green square corresponds to training cores, and bottom right green square corresponds to target network weights memory. Bottom left square shows obtaining an action to be performed in a simulator corresponding to a simulated environment);
applying the action to the simulated environment to obtain resulting information from the simulated environment ([Page 4] Figure 1(b). Note: Bottom left square shows applying the action to a simulator corresponding to a simulated environment to obtain resulting information);
However, Babaeizadeh is not relied upon to teach: adjusting, by the one or more training cores, the weights of the prediction neural network based on the resulting information and the action, wherein the adjusting and the applying the training information to obtain the action are performed in a parallel operation, in which the applying the training information occurs for first set of newly generated weights and the adjusting occurs for a set of weights generated immediately prior to the first set of newly generated weights.
Nair teaches, in an analogous system: adjusting, by the one or more training cores, the weights of the prediction neural network based on the resulting information and the action ([Page 3, Column 1, Paragraph 1] In the reinforcement learning (RL) paradigm, the agent interacts sequentially with an environment, with the goal of maximising cumulative rewards. At each step t the agent observes state st, selects an action at, and receives a reward rt. The agent’s policy (ajs) maps states to actions and defines its behavior. [Page 3, Column 1, Last Paragraph] One of the core ideas behind reinforcement learning is to represent the action-value function using a function approximator such as a neural network, Q(s; a) = Q(s; a; ). The parameters of the so-called Q-network are optimized so as to approximately solve the Bellman equation. For example, the Q-learning algorithm iteratively updates the action-value function Q(s; a; ) towards a sample of the Bellman target, r + max a0 Q(s0; a0; ). [Page 3, Section 3.3, Paragraph 2] Second, DQN maintains two separate Q-networks Q(s; a; ) and Q(s; a; ) with current parameters and old parameters respectively. The current parameters may be updated many times per time-step, and are copied into the old parameters after N iterations. At every update iteration i the current parameters are updated so as to minimise the mean-squared Bellman error with respect to old parameters , by optimizing the following loss function (DQN Loss), [Page 3, Section 3.3, Paragraph 3] Specifically, is adjusted. Note: See Algorithm 1. Also see Figure 2 where Target Q network corresponds to the target neural network. Q network in the Actor corresponds to the prediction neural network).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method for training a prediction neural network of Babaeizadeh to incorporate the teachings of Nair to adjust, by the one or more training cores, the weights of the prediction neural network based on the resulting information and the action. One would have been motivated to do this modification because doing so would give the benefit of current parameters being updated many times per time-step as taught by Nair [Page 3, Section 3.3, Paragraph 2].
Atiya teaches, in an analogous system: wherein the adjusting and the applying the training information to obtain the action are performed in a parallel operation, in which the applying the training information occurs for first set of newly generated weights and the adjusting occurs for a set of weights generated immediately prior to the first set of newly generated weights ([Page 705, Column 1, Section VI, Paragraphs 1 and 2] the following two methods: 1) the BTT approach, which is the fastest of the existing gradient-descent- based algorithms and 2) the BTT approach [21], which is a very efficient accelerated technique for recurrent networks. The BTT (h, h') is summarized as follows. 1) Run h the network for steps. 2) Propagate backwards for h' steps (h' > h), and update the weights. 3) Run the network for the next h steps, then propagate backwards h' steps, and update the weights. Continue in a similar manner till the end of the data, and then repeat another cycle. [Page 705, Column 2, Paragraph 3] In the first trial we tune the parameter values and for each of the methods (all runs start from the same initial weights). The way we have done the comparison is to perform the following for each method. Several runs each with different parameter values (learning rate, etc.) are performed. We then choose the parameter values that lead to fastest convergence. We then fix the parameters on these values, and run ten more trials each with different initial weights. For a particular trial we fix the initial weight configuration across the five methods to make the comparison fair. We record the number of iterations needed to reach particular error levels, and obtain the average for each method. We note that for recurrent networks it is always better to start with small weights, because if we have long sequences for large initial weights the states tend to wander off into the saturation region. We have generated the initial weights always in the range from 0.2 to 0.2.We have trained all methods for a maximum of 10 000 iterations. If the method did not reached the prespecified error levels by then, then we declare that it failed to converge on this particular trial. Note: Generating the initial weights always in the range from 0.2 to 0.2 corresponds to adjusting the weights generated immediately prior to the first set of newly generated weights. Run the network for the next h steps, then propagate backwards h' steps, and update the weights corresponds to applying the training information occurs for first set of newly generated weights. Run ten more trials each with different initial weights shows the adjusting of weights generated immediately prior to the first set of newly generated weight in the trial, where each trial corresponds to applying the training information, thereby performing the training and adjusting in a parallel operation.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified combined teachings of Babaeizadeh and Nair to incorporate the teachings of Atiya wherein the adjusting and the applying the training information to obtain the action are performed in a parallel operation, in which the applying the training information occurs for first set of newly generated weights and the adjusting occurs for a set of weights generated immediately prior to the first set of newly generated weights. One would have been motivated to do this modification because doing so would give the benefit of two methods which are fast and efficient as taught by Atiya [Page 705, Column 1, Section VI, Paragraphs 1 and 2].
Regarding claim 2
The system of Babaeizadeh, Nair, and Atiya teaches: The method of claim 1 (as shown above).
Nair further teaches: wherein adjusting the weights of the prediction neural network includes: sampling, by one or more training cores, one or more tuples from a replay memory, where each tuple includes a state sj, an action aj, a reward for the action rj, and a subsequent state sj+i. ([Page 4, Column 1, Last Paragraph] For each learner update k, a minibatch of experience tuples e = (s, a, r, s') is sampled from either a local or global experience replay memory D (see above)).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combined teachings of Babaeizadeh and Atiya to incorporate the teachings of Nair wherein adjusting the weights of the prediction neural network includes: sampling, by one or more training cores, one or more tuples from a replay memory, where each tuple includes a state sj, an action aj, a reward for the action rj, and a subsequent state sj+i. One would have been motivated to do this modification because doing so would give the benefit of current parameters being updated many times per time-step as taught by Nair [Page 3, Section 3.3, Paragraph 2].
Regarding claim 3
The system of Babaeizadeh, Nair, and Atiya teaches: The method of claim 2 (as shown above).
Nair further teaches: wherein adjusting the weights of the prediction neural network further includes: applying, by the one or more training cores, state sj+i to the target neural network and obtaining a highest action score output from the target artificial neural network ([Page 3, Column 1, Paragraph 2] The action-value function Q(s; a) is the expected return after observing state st and taking an action under a policy , Q(s; a) = E [Rtjst = s; at = a; ], and the optimal action-value function is the maximum possible value that can be achieved by any policy, Q (s; a) = argmax Q(s; a). [Page 5, Column 1, Algorithm 1] Execute the action in the environment and observe the reward rt and the next state st+1. Note: Maximum possible value corresponds to highest action score. Also, note in algorithm 1 the 'for loop' from t=1 to T).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combined teachings of Babaeizadeh and Atiya to incorporate the teachings of Nair wherein adjusting the weights of the prediction neural network further includes: applying, by the one or more training cores, state sj+i to the target neural network obtaining a highest action score output from the target artificial neural network. One would have been motivated to do this modification because doing so would give the benefit of current parameters being updated many times per time-step as taught by Nair [Page 3, Section 3.3, Paragraph 2].
Regarding claim 4
The system of Babaeizadeh, Nair, and Atiya teaches: The method of claim 3 (as shown above).
Nair further teaches: wherein adjusting the weights of the prediction neural network further includes: applying, by the one or more training cores, state sj to the prediction artificial neural network to obtain an action score for action aj ([Page 3, Column 1, Paragraph 2] The action-value function Q(s; a) is the expected return after observing state st and taking an action under a policy , Q(s; a) = E [Rtjst = s; at = a; ]. [Page 5, Column 1, Algorithm 1] Initialise the training network for the action-value function Q(s; a; ) with weights and target network Q(s; a; ) with weights = . Note: Action value corresponds to action score).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combined teachings of Babaeizadeh and Atiya to incorporate the teachings of Nair wherein adjusting the weights of the prediction neural network further includes: applying, by the one or more training cores, state sj to the prediction artificial neural network to obtain an action score for action aj. One would have been motivated to do this modification because doing so would give the benefit of current parameters being updated many times per time-step as taught by Nair [Page 3, Section 3.3, Paragraph 2].
Regarding claim 5
The system of Babaeizadeh, Nair, and Atiya teaches: The method of claim 4 (as shown above).
Nair further teaches: wherein adjusting the weights of the prediction neural network further includes: determining, by the one or more training cores, a loss function based on the highest action score output by the target neural network for state sj+i, the action score for action aj output by the prediction neural network, and a reward score rj ([Page 5, Column 1, Algorithm 1] With probability take a random action at or else at = argmax a Q(s; a; ). Calculate the loss Lt = (yt Q(si; ai; )2). Note: Also, see Figure 2. DQN Loss corresponds to loss function).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combined teachings of Babaeizadeh and Atiya to incorporate the teachings of Nair wherein adjusting the weights of the prediction neural network further includes: determining, by the one or more training cores, a loss function based on the highest action score output by the target neural network for state sj+i, the action score for action aj output by the prediction neural network, and a reward score rj. One would have been motivated to do this modification because doing so would give the benefit of current parameters being updated many times per time-step as taught by Nair [Page 3, Section 3.3, Paragraph 2].
Regarding claim 6
The system of Babaeizadeh, Nair, and Atiya teaches: The method of claim 5 (as shown above).
Nair further teaches: wherein adjusting the weights of the prediction neural network further includes: performing, by the one or more training cores, a gradient descent operation on the loss function with respect to the weights of the prediction neural network ([Page 3, Column 2, Section 3.3, Paragraph 3] For each sample (or minibatch), the current parameters are updated by a stochastic gradient descent algorithm. Specifically, is adjusted in the direction of the sample gradient gi of the loss with respect to ,).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combined teachings of Babaeizadeh and Atiya to incorporate the teachings of Nair wherein adjusting the weights of the prediction neural network further includes: performing, by the one or more training cores, a gradient descent operation on the loss function with respect to the weights of the prediction neural network. One would have been motivated to do this modification because doing so would give the benefit of current parameters being updated many times per time-step as taught by Nair [Page 3, Section 3.3, Paragraph 2].
Regarding claim 7
The system of Babaeizadeh, Nair, and Atiya teaches: The method of claim 1 (as shown above).
Nair further teaches: further comprising: periodically updating the weights of the target neural network via a copy engine by copying the weights of the prediction neural network into the target artificial neural network memory ([Page 3, section 3.2] Note: Figure 1 shows "copy every N updates". [Page 4, Figure 4] Note: Figure 4 shows "sync every global N steps" corresponding to copying).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combined teachings of Babaeizadeh and Atiya to incorporate the teachings of Nair to periodically update the weights of the target neural network via a copy engine by copying the weights of the prediction neural network into the target artificial neural network memory. One would have been motivated to do this modification because doing so would give the benefit of current parameters being updated many times per time-step as taught by Nair [Page 3, Section 3.3, Paragraph 2].
Regarding claim 8
The system of Babaeizadeh, Nair, and Atiya teaches: The method of claim 1 (as shown above).
Nair further teaches: further comprising: repeating the applying steps and the adjusting step for each step of an episode of training ([Page 5, Column 1, Algorithm 1] for episode = 1 to M do … for t=1 to T. Note: ‘for’ shows repeating steps for each step t of episodes 1 to M and ‘do’ shows the adjusting).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combined teachings of Babaeizadeh and Atiya to incorporate the teachings of Nair to repeat the applying, selecting, storing, and adjusting steps for each step of an episode of training. One would have been motivated to do this modification because doing so would give the benefit of current parameters being updated many times per time-step as taught by Nair [Page 3, Section 3.3, Paragraph 2].
Regarding claim 9
The system of Babaeizadeh, Nair, and Atiya teaches: The method of claim 8 (as shown above).
Nair further teaches: further comprising: performing multiple episodes of training to train the prediction neural network ([Page 5, Column 1, Algorithm 1] for episode = 1 to M (corresponds to multiple episodes)).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combined teachings of Babaeizadeh and Atiya to incorporate the teachings of Nair to perform multiple episodes of training to train the prediction neural network. One would have been motivated to do this modification because doing so would give the benefit of current parameters being updated many times per time-step as taught by Nair [Page 3, Section 3.3, Paragraph 2].
Regarding claim 10
Babaeizadeh teaches: A machine learning device for training a neural network, the machine learning device comprising ([Page 1, Section 1, Paragraph 2] The DNN model is constantly queried to guide the actions of agents whose gameplay in turn feeds DNN training):
a set of memories including a prediction network weights memory, and a target network weight memory (Note: Also see Figure 1(b) where bottom right green square corresponds to target network weight memory, top right green square corresponds to prediction network weight memory and both these squares are separate from each other);
one or more training cores configured to apply training information to a target neural network having weights stored in the target network weight memory ([Page 3, Section 3.2, Paragraph 1] the central server propagates new weights to the agents. [Page 10, Paragraph 3] Each time a trainer updates the DNN weights. Note: Also see Figure 1(b) where bottom right green square corresponds to weights stored in a target network weight memory and top left green square corresponds to training cores);
one or more inference cores that have a different architecture than the one or more training cores, the one or more inference cores configured to apply the training information to a prediction neural network having weights stored in a prediction network weights memory to obtain an action to be performed in a simulated environment ([Page 3, Section 3.2, Paragraph 1] the central server propagates new weights to the agents. [Page 10, Paragraph 3] Each time a trainer updates the DNN weights. Note: Also see Figure 1(b) where top right green square corresponds to weights stored in a prediction network weight memory, the GPU on the right (pink rectangle) corresponds to inference cores, top left green square corresponds to training cores. Bottom left square shows obtaining an action to be performed in a simulator corresponding to a simulated environment),
a control core configured to apply the action to the simulated environment to obtain resulting information from the simulated environment ([Page 4] Figure 1(b). Note: Bottom left square shows a CPU core applying the action corresponding to a control core configured to apply the action to a simulator corresponding to a simulated environment to obtain resulting information);
However, Nair is not relied upon to teach: a replay memory, and wherein the one or more training cores are configured to adjust the weights of the prediction neural network based on the resulting information and the action, and wherein the adjusting and the applying the training information to obtain the action are performed in a parallel operation, in which the applying the training information occurs for first set of newly generated weights and the adjusting occurs for a set of weights generated immediately prior to the first set of newly generated weights.
Nair teaches, in an analogous system: a replay memory ([Page 4] Note: Figure 2 shows replay memory),
wherein the one or more training cores are configured to adjust the weights of the prediction neural network based on the resulting information and the action ([Page 3, Column 1, Paragraph 1] In the reinforcement learning (RL) paradigm, the agent interacts sequentially with an environment, with the goal of maximising cumulative rewards. At each step t the agent observes state st, selects an action at, and receives a reward rt. The agent’s policy (ajs) maps states to actions and defines its behavior. [Page 3, Column 1, Last Paragraph] One of the core ideas behind reinforcement learning is to represent the action-value function using a function approximator such as a neural network, Q(s; a) = Q(s; a; ). The parameters of the so-called Q-network are optimized so as to approximately solve the Bellman equation. For example, the Q-learning algorithm iteratively updates the action-value function Q(s; a; ) towards a sample of the Bellman target, r + max a0 Q(s0; a0; ). [Page 3, Section 3.3, Paragraph 2] Second, DQN maintains two separate Q-networks Q(s; a; ) and Q(s; a; ) with current parameters and old parameters respectively. The current parameters may be updated many times per time-step, and are copied into the old parameters after N iterations. At every update iteration i the current parameters are updated so as to minimise the mean-squared Bellman error with respect to old parameters , by optimizing the following loss function (DQN Loss), [Page 3, Section 3.3, Paragraph 3] Specifically, is adjusted. Note: See Algorithm 1. Also see Figure 2 where Target Q network corresponds to the target artificial neural network. Learner server has its own processors corresponding to the training cores and also has its own memory corresponding to the target network weight memory. Q network in the Actor corresponds to the prediction artificial neural network. Actor server also has its own memory corresponding to the prediction network weight memory).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method for training a prediction neural network of Babaeizadeh to incorporate the teachings of Nair wherein the one or more training cores are configured to adjust the weights of the prediction neural network based on the resulting information and the action. One would have been motivated to do this modification because doing so would give the benefit of current parameters being updated many times per time-step as taught by Nair [Page 3, Section 3.3, Paragraph 2].
Atiya teaches, in an analogous system: wherein the adjusting and the applying the training information to obtain the action are performed in a parallel operation, in which the applying the training information occurs for first set of newly generated weights and the adjusting occurs for a set of weights generated immediately prior to the first set of newly generated weights ([Page 705, Column 1, Section VI, Paragraphs 1 and 2] the following two methods: 1) the BTT approach, which is the fastest of the existing gradient-descent- based algorithms and 2) the BTT approach [21], which is a very efficient accelerated technique for recurrent networks. The BTT (h, h') is summarized as follows. 1) Run h the network for steps. 2) Propagate backwards for h' steps (h' > h), and update the weights. 3) Run the network for the next h steps, then propagate backwards h' steps, and update the weights. Continue in a similar manner till the end of the data, and then repeat another cycle. [Page 705, Column 2, Paragraph 3] In the first trial we tune the parameter values and for each of the methods (all runs start from the same initial weights). The way we have done the comparison is to perform the following for each method. Several runs each with different parameter values (learning rate, etc.) are performed. We then choose the parameter values that lead to fastest convergence. We then fix the parameters on these values, and run ten more trials each with different initial weights. For a particular trial we fix the initial weight configuration across the five methods to make the comparison fair. We record the number of iterations needed to reach particular error levels, and obtain the average for each method. We note that for recurrent networks it is always better to start with small weights, because if we have long sequences for large initial weights the states tend to wander off into the saturation region. We have generated the initial weights always in the range from 0.2 to 0.2.We have trained all methods for a maximum of 10 000 iterations. If the method did not reached the prespecified error levels by then, then we declare that it failed to converge on this particular trial. Note: Generating the initial weights always in the range from 0.2 to 0.2 corresponds to adjusting the weights generated immediately prior to the first set of newly generated weights. Run the network for the next h steps, then propagate backwards h' steps, and update the weights corresponds to applying the training information occurs for first set of newly generated weights. Run ten more trials each with different initial weights shows the adjusting of weights generated immediately prior to the first set of newly generated weight in the trial, where each trial corresponds to applying the training information, thereby performing the training and adjusting in a parallel operation.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified combined teachings of Babaeizadeh and Nair to incorporate the teachings of Atiya wherein the adjusting and the applying the training information to obtain the action are performed in a parallel operation, in which the applying the training information occurs for first set of newly generated weights and the adjusting occurs for a set of weights generated immediately prior to the first set of newly generated weights. One would have been motivated to do this modification because doing so would give the benefit of two methods which are fast and efficient as taught by Atiya [Page 705, Column 1, Section VI, Paragraphs 1 and 2].
Regarding claim 11
The system of Babaeizadeh, Nair, and Atiya teaches: The machine learning device of claim 10 (as shown above).
Nair further teaches: wherein adjusting the weights of the prediction neural network includes: sampling, by one or more training cores, one or more tuples from the replay memory, where each tuple includes a state sj, an action aj, a reward for the action rj, and a subsequent state sj+i ([Page 4, Column 1, Last Paragraph] For each learner update k, a minibatch of experience tuples e = (s, a, r, s') is sampled from either a local or global experience replay memory D (see above)).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the machine learning device for training a neural network of Babaeizadeh to incorporate the teachings of Nair wherein adjusting the weights of the prediction neural network includes: sampling, by one or more training cores, one or more tuples from the replay memory, where each tuple includes a state sj, an action aj, a reward for the action rj, and a subsequent state sj+i. One would have been motivated to do this modification because doing so would give the benefit of current parameters being updated many times per time-step as taught by Nair [Page 3, Section 3.3, Paragraph 2].
Regarding claim 12
The system of Babaeizadeh, Nair, and Atiya teaches: The machine learning device of claim 11 (as shown above).
Nair further teaches: wherein adjusting the weights of the prediction neural network further includes: applying, by the one or more training cores, state sj+i to the target neural network and obtaining a highest action score output from the target neural network ([Page 3, Column 1, Paragraph 2] The action-value function Q(s; a) is the expected return after observing state st and taking an action under a policy , Q(s; a) = E [Rtjst = s; at = a; ], and the optimal action-value function is the maximum possible value that can be achieved by any policy, Q (s; a) = argmax Q(s; a). [Page 5, Column 1, Algorithm 1] Execute the action in the environment and observe the reward rt and the next state st+1. Note: Maximum possible value corresponds to highest action score. Also, note in algorithm 1 the 'for loop' from t=1 to T).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the machine learning device for training a neural network of Babaeizadeh to incorporate the teachings of Nair wherein adjusting the weights of the prediction neural network further includes: applying, by the one or more training cores, state sj+i to a target neural network having weights stored in a target network weight memory and obtaining a highest action score output from the target neural network. One would have been motivated to do this modification because doing so would give the benefit of current parameters being updated many times per time-step as taught by Nair [Page 3, Section 3.3, Paragraph 2].
Regarding claim 13
The system of Babaeizadeh, Nair, and Atiya teaches: The machine learning device of claim 12 (as shown above).
Nair further teaches: wherein adjusting the weights of the prediction neural network further includes: applying, by the one or more training cores, state sj to the prediction neural network to obtain an action score for action aj ([Page 3, Column 1, Paragraph 2] The action-value function Q(s; a) is the expected return after observing state st and taking an action under a policy , Q(s; a) = E [Rtjst = s; at = a; ]. [Page 5, Column 1, Algorithm 1] Initialise the training network for the action-value function Q(s; a; ) with weights and target network Q(s; a; ) with weights = . Note: Action value corresponds to action score).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the machine learning device for training a neural network of Babaeizadeh to incorporate the teachings of Nair wherein adjusting the weights of the prediction neural network further includes: applying, by the one or more training cores, state sj to the prediction neural network to obtain an action score for action aj. One would have been motivated to do this modification because doing so would give the benefit of current parameters being updated many times per time-step as taught by Nair [Page 3, Section 3.3, Paragraph 2].
Regarding claim 14
The system of Babaeizadeh, Nair, and Atiya teaches: The machine learning device of claim 13 (as shown above).
Nair further teaches: wherein adjusting the weights of the prediction neural network further includes: determining, by the one or more training cores, a loss function based on the highest action score output by the target neural network for state sj+i, the action score for action aj output by the prediction neural network, and a reward score rj ([Page 5, Column 1, Algorithm 1] With probability take a random action at or else at = argmax a Q(s; a; ). Calculate the loss Lt = (yt Q(si; ai; )2). Note: Also, see Figure 2. DQN Loss corresponds to loss function).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the machine learning device for training a neural network of Babaeizadeh to incorporate the teachings of Nair wherein adjusting the weights of the prediction neural network further includes: determining, by the one or more training cores, a loss function based on the highest action score output by the target neural network for state sj+i, the action score for action aj output by the prediction neural network, and a reward score rj. One would have been motivated to do this modification because doing so would give the benefit of current parameters being updated many times per time-step as taught by Nair [Page 3, Section 3.3, Paragraph 2].
Regarding claim 15
The system of Babaeizadeh, Nair, and Atiya teaches: The machine learning device of claim 14 (as shown above).
Nair further teaches: wherein adjusting the weights of the prediction neural network further includes: performing, by the one or more training cores, a gradient descent operation on the loss function with respect to the weights of the prediction neural network ([Page 3, Column 2, Section 3.3, Paragraph 3] For each sample (or minibatch), the current parameters are updated by a stochastic gradient descent algorithm. Specifically, is adjusted in the direction of the sample gradient gi of the loss with respect to ).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the machine learning device for training a neural network of Babaeizadeh to incorporate the teachings of Nair wherein adjusting the weights of the prediction neural network further includes: performing, by the one or more training cores, a gradient descent operation on the loss function with respect to the weights of the prediction neural network. One would have been motivated to do this modification because doing so would give the benefit of current parameters being updated many times per time-step as taught by Nair [Page 3, Section 3.3, Paragraph 2].
Regarding claim 16
The system of Babaeizadeh, Nair, and Atiya teaches: The machine learning device of claim 10 (as shown above).
Nair further teaches: further comprising: a copy engine configured to periodically update the weights of the target neural network by copying the weights of the prediction neural network into the target artificial neural network memory ([Page 3, section 3.2] Note: Figure 1 shows "copy every N updates". [Page 4, Figure 4] Note: Figure 4 shows "sync every global N steps" corresponding to copying).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the machine learning device for training a neural network of Babaeizadeh to incorporate the teachings of Nair to use a copy engine configured to periodically update the weights of the target neural network by copying the weights of the prediction neural network into the target artificial neural network memory. One would have been motivated to do this modification because doing so would give the benefit of current parameters being updated many times per time-step as taught by Nair [Page 3, Section 3.3, Paragraph 2].
Regarding claim 17
The system of Babaeizadeh, Nair, and Atiya teaches: The machine learning device of claim 10 (as shown above).
Nair further teaches: wherein the one or more inference cores, the action selection processor, the tuple storing processor, and the one or more training cores are further configured to: repeat the applying steps and the adjusting for each step of an episode of training ([Page 5, Column 1, Algorithm 1] for episode = 1 to M do … for t=1 to T Note: Shows repeating steps for each step t of episodes 1 to M).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the machine learning device for training a neural network of Babaeizadeh to incorporate the teachings of Nair wherein the one or more inference cores, the action selection processor, the tuple storing processor, and the one or more training cores are further configured to: repeat the applying steps and the adjusting for each step of an episode of training. One would have been motivated to do this modification because doing so would give the benefit of current parameters being updated many times per time-step as taught by Nair [Page 3, Section 3.3, Paragraph 2].
Regarding claim 18
The system of Babaeizadeh, Nair, and Atiya teaches: The machine learning device of claim 17 (as shown above).
Nair further teaches: wherein the one or more inference cores, the action selection processor, the tuple storing processor, and the one or more training cores are further configured to: performing multiple episodes of training to train the prediction neural network ([Page 5, Column 1, Algorithm 1] for episode = 1 to M (corresponds to multiple episodes)).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the machine learning device for training a neural network of Babaeizadeh to incorporate the teachings of Nair wherein the one or more inference cores, the action selection processor, the tuple storing processor, and the one or more training cores are further configured to: performing multiple episodes of training to train the prediction neural network. One would have been motivated to do this modification because doing so would give the benefit of current parameters being updated many times per time-step as taught by Nair [Page 3, Section 3.3, Paragraph 2].
Regarding claim 19
Babaeizadeh teaches: A computing device for training a prediction neural network, the computing device comprising ([Page 1, Section 1, Paragraph 2] The DNN model is constantly queried to guide the actions of agents whose gameplay in turn feeds DNN training):
a central processor configured to interface with an environment by applying actions to the environment and observing resulting information including states and rewards output by the environment ([Page 2, Section 3.1, Paragraph 1] In standard RL, an agent interacts with an environment over a number of discrete time steps. At each time step t, the agent observes a state st and, in the discrete case, selects an action at from the set of valid actions. An agent is guided by policy , a function mapping from states st to actions at. After each action, the agent observes the next state st+1 and receives feedback in the form of a reward rt);
and a machine learning device for training the prediction neural network, the machine learning device comprising: a set of memories including a prediction network weights memory, and a target network weight memory (Note: Also see Figure 1(b) where bottom right green square corresponds to target network weight memory, top right green square corresponds to prediction network weight memory and both the green squares are separate from each other);
one or more training cores configured to apply training information to a target neural network having weights stored in the target network weight memory ([Page 3, Section 3.2, Paragraph 1] the central server propagates new weights to the agents. [Page 10, Paragraph 3] Each time a trainer updates the DNN weights. Note: Also see Figure 1(b) where bottom right green square corresponds to weights stored in a target network weight memory and top left green square corresponds to training cores);
one or more inference cores that have a different architecture than the one or more training cores, the one or more inference cores configured to apply the training information to a prediction neural network having weights stored in the prediction network weights memory to obtain an action to be performed in the environment ([Page 3, Section 3.2, Paragraph 1] the central server propagates new weights to the agents. [Page 10, Paragraph 3] Each time a trainer updates the DNN weights. Note: Also see Figure 1(b) where top right green square corresponds to weights stored in a prediction network weight memory, the GPU on the right (pink rectangle) corresponds to inference cores, top left green square corresponds to training cores. Bottom left square shows obtaining an action to be performed in a simulator corresponding to a simulated environment),
wherein the central processor is configured to apply the action to the environment to obtain resulting information from the environment ([Page 4] Figure 1(b). Note: Bottom left square shows applying the action to a simulator corresponding to a simulated environment to obtain resulting information);
However, Nair is not relied upon to teach: a replay memory, and wherein the one or more training cores are configured to adjust the weights of the prediction neural network based on the resulting information and the action, wherein the adjusting and the applying the training information to obtain the action are performed in a parallel operation, in which the applying the training information occurs for first set of newly generated weights and the adjusting occurs for a set of weights generated immediately prior to the first set of newly generated weights.
Nair teaches, in an analogous system: a replay memory ([Page 4] Note: Figure 2 shows replay memory);
wherein the one or more training cores are configured to adjust the weights of the prediction neural network based on the resulting information and the action ([Page 3, Column 1, Paragraph 1] In the reinforcement learning (RL) paradigm, the agent interacts sequentially with an environment, with the goal of maximising cumulative rewards. At each step t the agent observes state st, selects an action at, and receives a reward rt. The agent’s policy (ajs) maps states to actions and defines its behavior. [Page 3, Column 1, Last Paragraph] One of the core ideas behind reinforcement learning is to represent the action-value function using a function approximator such as a neural network, Q(s; a) = Q(s; a; ). The parameters of the so-called Q-network are optimized so as to approximately solve the Bellman equation. For example, the Q-learning algorithm iteratively updates the act