DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant's arguments filed 08/08/2025 have been fully considered but they are not persuasive.
Regarding the 103 rejections, after further search and consideration, applicant’s arguments that Duan does not teach the amended limitations are not persuasive. Additionally, applicant’s arguments with respect to Narang are moot as Narang is no longer relied upon in the rejections.
Alleged no teaching of modifying weights or biases of a policy and value neural networks based on observational data
In Remarks/Arguments pg. 10-11, applicant contends:
“Duan et al., Xu et al., Palanisamy et al., and Narang et al. do not disclose all of the amended elements as recited in claim 1, and similarly claims 11 and 18. The Office Action concedes: Duan in view of Xu and Palanisamy... the combination does not explicitly teach: A. wherein the geographic context data identifies a geographic location in which the modification of the policy network occurred, and wherein the temporal context data identifies at least one of a time of day or a season in which the modification of the policy network occurred (emphasis added). Therefore, Duan et al., Xu et al., Palanisamy et al. does not disclose "modifies at least one of weights or biases of the policy neural network and the value neural network of the trainable neural network model based on observational data of the vehicle functional unit resulting in a modified trainable neural network model;”
The relevant claim limitations appear to be: wherein the trainable neural network model comprises a policy neural network and a value neural network…modifies at least one of weights or biases of the policy neural network and the value neural network of the trainable neural network model based on observational data of the vehicle functional unit resulting in a modified trainable neural network model; in claim 1. Duan teaches:
(Duan, pg. 2 col. 1, “This method comprehensively considers both high-level maneuver selection and low-level motion control in both lateral and longitudinal directions. We firstly decompose the driving tasks in to three maneuvers, including driving in lane, right lane change and left lane change, and learn the sub-policy for each maneuver. Different state spaces and reward functions are designed for each maneuver. Then, a master policy is learned to choose the maneuver policy to be executed in the current state. All policies including master policy and maneuver policies are represented by fully-connected neural networks”).
(Duan, pg. 3 col. 1, “the master policy only contains one maneuver policy network and one maneuver value network.”).
(Duan, pg. 3 col. 1, “Inspired by the A3C algorithm, we use asynchronous parallel car-learners to train the policy π(a|s; θ) and estimate the state value V (s; w), as shown in Fig. 2 [20]. Each car-learner contains its own policy networks and a value networks, and makes decision according to their policy outputs. The multiple car-learners interact with different parts of the environment in parallel. Meanwhile, each car-learner computes gradients with respect to the parameters of the value network and the policy network at each step. Then, the average gradients are applied to update the shared policy network and shared value network at each step”).
In other words, Duan teaches the trainable neural network model made of a policy and value neural network. The master policy is interpreted as the trainable neural network as it learns the different policies to apply to the vehicle. The cited portions above shows that Duan teaches that the master policy is made up of policy and value neural networks. Duan further teaches the modifying at least one of the weights or biases of the policy neural network and the value neural network. The car-learners are interpreted as the maneuver networks and the shared networks are interpreted as the trainable neural network model. Therefore, when gradient data from the car-learner policy and value networks are sent back up to the shared networks, this is interpreted as modifying at least a weight or bias of the trainable neural network model because one of ordinary skill knows that gradient updates to a neural network modifies weights of the neural network. Additionally, the car-learners being deployed in an environment is interpreted as the modifications, or gradients, being based on observational data because the car-learners learn based on the state of the environment. Therefore, applicant’s arguments are not persuasive.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-2, 5-8, 10-13, 16, 18-20, and 23 are rejected under 35 U.S.C. 103 as being unpatentable over Duan, et al., Non-Patent Literature “Hierarchical Reinforcement Learning for Self-Driving Decision-Making without Reliance on Labeled Driving Data” (“Duan”) in view of Xu, et al., Non-Patent Literature “Knowledge Transfer in Multi-Task Deep Reinforcement Learning for Continuous Control” (“Xu”) and further in view of Palanisamy, et al., US Pre-Grant Publication 2020/0139973A1 (“Palanisamy”) and Siddiqui, et al., US Pre-Grant Publication 2020/0353943A1 (“Siddiqui”).
Regarding claim 1 and analogous claims 11 and 18, Duan discloses:
receives, via an in-vehicle network of a vehicle, a…neural network model from a domain chief associated with a set of devices of the vehicle; (Duan, pg. 2 col. 2, “The H-RL decision making framework consists of two parts: 1) the high-level maneuver selection and 2) low-level motion control. In the maneuver selection level, a master policy; the master policy is interpreted as the domain chief [i.e. from a domain chief] is used to choose the maneuver to be executed in the current state. In the motion control level, the corresponding maneuver policy; the corresponding maneuver policy is interpreted as a neural network model [i.e. receives, via an in-vehicle network of a vehicle, a…neural network model] would be activated and output front wheel angle and acceleration commands to the actuators. [associated with a set of devices of the vehicle;]”).
constructs a trainable neural network model using the…neural network model for domain comprising a subset of the set of devices of the vehicle, wherein the trainable neural network model comprises a policy neural network and a value neural network; (Duan, pg. 2 col. 1, “This method comprehensively considers both high-level maneuver selection and low-level motion control in both lateral and longitudinal directions. We firstly decompose the driving tasks in to three maneuvers, including driving in lane, right lane change and left lane change, and learn the sub-policy for each maneuver. Different state spaces and reward functions are designed for each maneuver. Then, a master policy is learned [constructs a trainable neural network model] to choose the maneuver policy to be executed in the current state; the master policy learning to select the correct maneuver is interpreted as using the maneuver policies in the training process [i.e. using the…neural network model]. All policies including master policy and maneuver policies are represented by fully-connected neural networks”). (Duan, pg. 2 col. 2, “Therefore, each maneuver in this study contains a steer policy network (SP-Net) and an accelerate policy network (AP-Net), which carry out lateral and longitudinal control respectively. [for domain comprising a subset of the set of devices of the vehicle]”). (Duan, pg. 3 col. 1, “the master policy only contains one maneuver policy network and one maneuver value network [wherein the trainable neural network model comprises a policy neural network and a value neural network;].”).)
dynamically controls, using the trainable neural network model, a vehicle functional unit to operate the subset of the set of devices of the vehicle; (Duan, pg. 2 col. 2, “The H-RL decision making framework consists of two parts: 1) the high-level maneuver selection and 2) low-level motion control. In the maneuver selection level, a master policy is used to choose the maneuver to be executed in the current state [dynamically controls, using the trainable neural network model,]. In the motion control level, the corresponding maneuver policy would be activated and output front wheel angle and acceleration commands to the actuators [a vehicle functional unit to operate the subset of the set of devices of the vehicle;]. In the example shown in Fig. 1, the master policy chooses maneuver 1 as the current maneuver, then the corresponding maneuver policy would be activated and output front wheel angle and acceleration commands to the actuators.”).
modifies at least one of weights or biases of the policy neural network and the value neural network of the trainable neural network model based on observational data of the vehicle functional unit resulting in a modified trainable neural network model; (Duan, pg. 3 col. 1, “Inspired by the A3C algorithm, we use asynchronous parallel car-learners to train the policy π(a|s; θ) and estimate the state value V (s; w), as shown in Fig. 2 [20]. Each car-learner contains its own policy networks and a value networks, and makes decision according to their policy outputs. The multiple car-learners interact with different parts of the environment in parallel [based on observational data of the vehicle functional unit resulting in a modified trainable neural network model;]. Meanwhile, each car-learner computes gradients with respect to the parameters of the value network and the policy network at each step. Then, the average gradients are applied to update the shared policy network and shared value network at each step; the individual car-learner models are interpreted as the maneuver networks and the shared networks are interpreted as the trainable model thus gradient updates to the shared networks is interpreted as modifying weights of the value and policy neural networks [i.e. modifies at least one of weights or biases of the policy neural network and the value neural network of the trainable neural network model]”).
generates gradient data by comparing a policy neural network and a value neural network of the pre-trained neural network model to the policy neural network and the value neural network of the modified trainable neural network model, wherein the gradient data corresponds to experience gained from modifying the trainable neural network model that is represented in the modified at least one of the weights or the biases of the policy neural network and the value neural network of the trainable neural network model; (Duan, pg. 3 col. 1, “Inspired by the A3C algorithm, we use asynchronous parallel car-learners to train the policy π(a|s; θ) and estimate the state value V (s; w), as shown in Fig. 2 [20]. Each car-learner contains its own policy networks and a value networks, and makes decision according to their policy outputs. The multiple car-learners interact with different parts of the environment in parallel. Meanwhile, each car-learner computes gradients with respect to the parameters of the value network and the policy network at each step. Then, the average gradients are applied to update the shared policy network and shared value network at each step; the individual car-learner models are interpreted as the maneuver networks and the shared networks are interpreted as the trainable model thus gradient updates to the shared networks is interpreted as comparing policy and value networks to the policy and value networks of the modified trainable neural network model [i.e. generates gradient data by comparing a policy neural network and a value neural network of the pre-trained neural network model to the policy neural network and the value neural network of the modified trainable neural network model,]. These car-learners synchronize their local network parameters from the shared networks before they make new decisions; gradients are interpreted as modifications to at least one of weights or biases as the policy and value networks are neural networks [i.e. wherein the gradient data corresponds to experience gained from modifying the trainable neural network model that is represented in the modified at least one of the weights or the biases of the policy neural network and the value neural network of the trainable neural network model;]. We refer to this training algorithm as asynchronous parallel reinforcement learning (APRL).”).
sends, via the in-vehicle network, the gradient data and policy metadata as input to a machine learning process of the domain chief that modifies a domain neural network model based on the input, (Duan, pg. 3 col. 1, “Inspired by the A3C algorithm, we use asynchronous parallel car-learners to train the policy π(a|s; θ) and estimate the state value V (s; w), as shown in Fig. 2 [20]. Each car-learner contains its own policy networks and a value networks, and makes decision according to their policy outputs. The multiple car-learners interact with different parts of the environment in parallel. Meanwhile, each car-learner computes gradients with respect to the parameters of the value network and the policy network at each step. Then, the average gradients are applied to update the shared policy network and shared value network at each step; the shared value and policy networks are interpreted as the master policy and therefore the domain model [i.e. sends, via the in-vehicle network, the gradient data and policy metadata as input to a machine learning process of the domain chief that modifies a domain neural network model based on the input,]. These car-learners synchronize their local network parameters from the shared networks before they make new decisions. We refer to this training algorithm as asynchronous parallel reinforcement learning (APRL).”).
While Duan teaches using a domain chief to generate models for controlling a vehicle, Duan does not explicitly teach:
A system, comprising: a memory that stores computer executable components; and a processor that executes at least one of the computer-executable components that:
pre-trained neural network
wherein the policy metadata comprises at least one of geographic context data or temporal context data associated with the dynamic control of the vehicle functional unit related to the modification of the at least one of the weights or the biases of the policy neural network of the trainable neural network model,
wherein the geographic context data identifies a geographic location in which the modification of the at least one of the weights or the biases of the policy neural network occurred, and wherein the temporal context data identifies at least one of a time of day or a season in which the modification of the at least one of the weights or the biases of the policy neural network occurred.
Xu teaches pre-trained neural network (Xu, pg. 2 Section 1, “In this paper, we present a Knowledge Transfer based Multi-task Deep Reinforcement Learning framework (KTM-DRL) for continuous control, which enables a single DRL agent to achieve expert-level performance on multiple different tasks by learning from task-specific teachers. In KTM-DRL, the multi-task agent leverages an offline knowledge transfer algorithm [pre-trained neural network] designed particularly for the actor-critic architecture to quickly learn a control policy from the experience of task-specific teachers.”).
Duan and Xu are both in the same field of endeavor (i.e. reinforcement learning). It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to combine Duan and Xu to teach the above limitation(s). The motivation for doing so is that incorporating transfer learning improves the learning speed of a model (cf. Xu, pg. 2 Section 1, “the multi-task agent leverages an offline knowledge transfer algorithm designed particularly for the actor-critic architecture to quickly learn a control policy from the experience of task-specific teachers.”).
While Duan in view of Xu teaches using a domain chief to generate models for controlling a vehicle using policy and value neural networks, the combination does not explicitly teach:
A system, comprising: a memory that stores computer executable components; and a processor that executes at least one of the computer-executable components that:
wherein the policy metadata comprises at least one of geographic context data or temporal context data associated with the dynamic control of the vehicle functional unit related to the modification of the at least one of the weights or the biases of the policy neural network of the trainable neural network model,
wherein the geographic context data identifies a geographic location in which the modification of the at least one of the weights or the biases of the policy neural network occurred, and wherein the temporal context data identifies at least one of a time of day or a season in which the modification of the at least one of the weights or the biases of the policy neural network occurred.
Palanisamy teaches:
A system, comprising: a memory that stores computer executable components; and a processor that executes at least one of the computer-executable components that: (Palanisamy, ⁋54, “The controller 34 includes at least one processor 44 [and a processor that executes at least one of the computer-executable components that:] and a computer readable storage device or media 46. The processor 44 can be any custom made or commercially available processor, a central processing unit (CPU)… The computer-readable storage device or media 46 may be implemented using any of a number of known memory devices such as PROMs (programmable read-only memory), EPROMs (electrically PROM), EEPROMs (electrically erasable PROM), flash memory, or any other electric, magnetic, optical, or combination memory devices capable of storing data, some of which represent executable instructions [A system, comprising: a memory that stores computer executable components;], used by the controller 34 in controlling the autonomous vehicle 10.”).
wherein the policy metadata comprises at least one of geographic context data or temporal context data associated with the dynamic control of the vehicle functional unit related to the modification of the at least one of the weights or the biases of the policy neural network of the trainable neural network model, (Palanisamy, ⁋16, “In one embodiment, the actor-critic network architecture is based on a Deep Recurrent Deterministic Policy Gradient (DRDPG) algorithm that considers both temporal attention at the temporal attention module that learns to weigh the importance of previous frames [or temporal context data associated] of image data at any given frame of the image data, and spatial attention at the spatial attention module that learns the importance of different locations [wherein the policy metadata comprises at least one of geographic context data] in the any given frame of the image data.”). (Palanisamy, ⁋9, “System, methods and a controller are provided for controlling an autonomous vehicle. In one embodiment, a method and a system are provided for learning lane-change policies via an actor-critic network architecture/system. Each lane-change policy describes one or more actions selected to be taken by an autonomous vehicle. The lane-change policies each comprise a high-level action and associated low-level actions. The high-level actions comprise: a left lane-change, lane following, and a right lane-change. Each of the associated low-level actions comprises a steering angle command parameter and an acceleration-brake rate parameter…The actor network includes a convolutional neural network (CNN), a spatial attention module, a temporal attention module, and at least one fully connected layer. [the dynamic control of the vehicle functional unit related to the modification of the at least one of the weights or the biases of the policy neural network of the trainable neural network model,]”).
Duan, in view of Xu, and Palanisamy are both in the same field of endeavor (i.e. reinforcement learning). It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to combine Duan, in view of Xu, and Palanisamy to teach the above limitation(s). The motivation for doing so is that geographical and temporal contexts improve policy selection (cf. Palanisamy, ⁋16, “The spatial attention module and the temporal attention module collectively improve lane-change policy selection of the actor network.”).
While Duan in view of Xu and Palanisamy teaches using a domain chief to generate models for controlling a vehicle with geographic and temporal contexts when modifications to the policy network happen, the combination does not explicitly teach:
wherein the geographic context data identifies a geographic location in which the modification of the at least one of the weights or the biases of the policy neural network occurred, and wherein the temporal context data identifies at least one of a time of day or a season in which the modification of the at least one of the weights or the biases of the policy neural network occurred.
Siddiqui teaches:
wherein the geographic context data identifies a geographic location in which the modification of the at least one of the weights or the biases of the policy neural network occurred, (Siddiqui, ⁋62, “The machine learning network 130 may be of various network types (e.g., an artificial recurrent neural network (RNN), a feedforward neural network (FNN), or other neural network types) [the policy neural network].”). (Siddiqui, ⁋64, “In one embodiment, the system 100 trains the machine learning network 130, with a reinforcement learning method by processing the driving scenario data 122 where the driving scenario data 122 includes dynamic objects with trajectories, an initial condition (e.g. a state) and goals (e.g., an intended destination). The system 100 applies a policy to all of the dynamic objects or a sub-group of the dynamic objects based on a particular type of the dynamic object, and the system 100 runs an iterative learning simulation until the goals are achieved or until the dynamic objects crash into one another. After the iterative learning simulation process is performed by the system 100, the system 100 evaluates trajectories, and scores the trajectories based on a predetermined criteria…Based on the evaluation, the system 100 updates or modifies policy parameters to better fit the predetermined criteria; during learning, states determine the current context and based on the context a policy is selected. The performance of the selected policy determines whether the policy is updated [i.e. in which the modification of the at least one of the weights or the biases of the policy neural network occurred,].”). (Siddiqui, ⁋66, “The system 100 may train the machine learning network 130 using one or more of the following driving categories as input states…geographical location and/or time of day) [wherein the geographic context data identifies a geographic location].”).
and wherein the temporal context data identifies at least one of a time of day or a season in which the modification of the at least one of the weights or the biases of the policy neural network occurred. (Siddiqui, ⁋62, “The machine learning network 130 may be of various network types (e.g., an artificial recurrent neural network (RNN), a feedforward neural network (FNN), or other neural network types) [the policy neural network].”). (Siddiqui, ⁋64, “In one embodiment, the system 100 trains the machine learning network 130, with a reinforcement learning method by processing the driving scenario data 122 where the driving scenario data 122 includes dynamic objects with trajectories, an initial condition (e.g. a state) and goals (e.g., an intended destination). The system 100 applies a policy to all of the dynamic objects or a sub-group of the dynamic objects based on a particular type of the dynamic object, and the system 100 runs an iterative learning simulation until the goals are achieved or until the dynamic objects crash into one another. After the iterative learning simulation process is performed by the system 100, the system 100 evaluates trajectories, and scores the trajectories based on a predetermined criteria…Based on the evaluation, the system 100 updates or modifies policy parameters to better fit the predetermined criteria; during learning, states determine the current context and based on the context a policy is selected. The performance of the selected policy determines whether the policy is updated [i.e. in which the modification of the at least one of the weights or the biases of the policy neural network occurred,].”). (Siddiqui, ⁋66, “The system 100 may train the machine learning network 130 using one or more of the following driving categories as input states…geographical location and/or time of day) [and wherein the temporal context data identifies at least one of a time of day or a season].”).
Duan, in view of Xu and Palanisamy, and Siddiqui are both in the same field of endeavor (i.e. policy learning). It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to combine Duan, in view of Xu and Palanisamy, and Siddiqui to teach the above limitation(s). The motivation for doing so is that training policy networks using environmental factors improves the accuracy of the policy selected (cf. Siddiqui, see ⁋62-63).
Regarding claim 2 and analogous claims 12 and 19, Duan in view of Xu, Palanisamy, and Siddiqui the system of claim 1. Duan further teaches wherein the observational data comprises at least one input parameter data, output parameter data, or internal state data. (Duan, pg. 2 col. 2, “In addition, if the input to each policy is the whole sensory information [wherein the observational data comprises at least one input parameter data], the learning algorithm must determine which parts of the information are relevant. Hence, we propose to design different meaningful indicators as the state representation for different policies.”).
Regarding claim 5 and analogous claims 13 and 20, Duan in view of Xu, Palanisamy, and Siddiqui teaches the system of claim 1. Xu teaches the pre-trained process as seen in claim 1. Duan further teaches wherein the gradient data is based on respective weights and respective biases of the policy neural network and the value neural network of the pre-trained neural network model and the policy neural network and the value neural network of the modified trainable neural network model. (Duan, pg. 2 col. 1, “All policies including master policy and maneuver policies are represented by fully-connected neural networks; neural networks are interpreted as having weights and biases”). (Duan, pg. 3 col. 1 and Figure 2, “Inspired by the A3C algorithm, we use asynchronous parallel car-learners to train the policy π(a|s; θ) and estimate the state value V (s; w), as shown in Fig. 2 [20]. Each car-learner contains its own policy networks and a value networks, and makes decision according to their policy outputs. The multiple car-learners interact with different parts of the environment in parallel. Meanwhile, each car-learner computes gradients with respect to the parameters of the value network and the policy network at each step; each car learner is interpreted as a maneuver policy and Figure 2 shows gradient data from each car learner updating the shared networks, or trainable neural network [i.e. wherein the gradient data is based on respective weights and respective biases of the policy neural network and the value neural network of the pre-trained neural network model]. Then, the average gradients are applied to update the shared policy network and shared value network at each step [and the policy neural network and the value neural network of the modified trainable neural network model.]. These car-learners synchronize their local network parameters from the shared networks before they make new decisions. We refer to this training algorithm as asynchronous parallel reinforcement learning (APRL).”).
Regarding claim 6, Duan in view of Xu, Palanisamy, and Siddiqui teaches the system of claim 1. Xu teaches the pre-trained process as seen in claim 1. Duan further teaches wherein the at least one of the computer executable components further: replaces the pre-trained neural network model with an updated pre-trained neural network model received via the in-vehicle network from the domain chief. (Duan, pg. 3 col. 1 and Figure 2, “Then, the average gradients are applied to update the shared policy network and shared value network at each step. These car-learners synchronize their local network parameters from the shared networks before they make new decisions; synchronizing parameters is interpreted as updating the neural network model [i.e. wherein the at least one of the computer executable components further: replaces the pre-trained neural network model with an updated pre-trained neural network model received via the in-vehicle network from the domain chief.]. We refer to this training algorithm as asynchronous parallel reinforcement learning (APRL).”).
Regarding claim 7, Duan in view of Xu, Palanisamy, and Siddiqui teaches the system of claim 6. Xu teaches the pre-trained process as seen in claim 1. Duan further teaches wherein the at least one of the computer executable components further: constructs an updated trainable neural network model using the updated pre-trained neural network; and replaces the modified trainable neural network model with the updated trainable neural network model. (Duan, pg. 3 col. 1 and Figure 2, “The multiple car-learners interact with different parts of the environment in parallel. Meanwhile, each car-learner computes gradients with respect to the parameters of the value network and the policy network at each step. Then, the average gradients are applied to update the shared policy network and shared value network at each step.; updating the shared networks is interpreted as constructing and replacing the trainable model with an updated trainable model [i.e. wherein the at least one of the computer executable components further: constructs an updated trainable neural network model using the updated pre-trained neural network; and replaces the modified trainable neural network model with the updated trainable neural network model.]”).
Regarding claim 8 and analogous claims 16 and 23, Duan in view of Xu, Palanisamy, and Siddiqui teaches the system of claim 1. Palanisamy teaches the policy metadata as seen in claim 1. Duan further teaches wherein the at least one of the computer executable components further: modifies the domain neural network model based on respective gradient data and respective policy metadata from a plurality of domains comprising the domain within the in-vehicle network. (Duan, pg. 2 col. 2 and Figure 2, “To learn each policy of the H-RL, we should first decompose the driving task into several driving maneuvers, such as driving in lane, left/right lane change, etc. It should be noted that a driving maneuvers may include many similar driving behaviors. For example, the driving-in-lane maneuver is a combination of many similar behaviors including lane keeping, car following and free driving. Then, we can learn the sub-policy, which is also called maneuver policy, for each maneuver driven by independent sub-goals. The reward function for sub-policy learning can be easily designed by considering only the corresponding maneuver instead of the whole driving task. Then we learn a master policy [wherein the at least one of the computer executable components further: modifies the domain neural network model based] to activate certain maneuver policy according to the driving task; each maneuver policy is interpreted as a domain thus multiple maneuvers are interpreted as a plurality of domains (i.e. from a plurality of domains comprising the domain within the in-vehicle network.). Although we need to consider the entire driving task while training the master policy, the associated reward function can also be very simple because you do not have to worry about how to control the actuators to achieve each maneuver.; In Figure 2, the shared networks are interpreted as the master policy or domain model thus the car learners, or maneuver policy networks, updates the domain model using respective gradient and policy data [i.e. on respective gradient data and respective policy metadata]”).
Regarding claim 10, Duan in view of Xu, Palanisamy, and Siddiqui teaches the system of claim 1. Duan further teaches wherein the trainable neural network model is modified using a reinforcement learning technique. (Duan, pg. 3 see Figure 1, Figure 1 shows that the master policy is modified through a hierarchical reinforcement learning technique [i.e. wherein the trainable neural network model is modified using a reinforcement learning technique.]).
Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Duan, et al., Non-Patent Literature “Hierarchical Reinforcement Learning for Self-Driving Decision-Making without Reliance on Labeled Driving Data” (“Duan”) in view of Xu, et al., Non-Patent Literature “Knowledge Transfer in Multi-Task Deep Reinforcement Learning for Continuous Control” (“Xu”) and further in view of Palanisamy, et al., US Pre-Grant Publication 2020/0139973A1 (“Palanisamy”), Siddiqui, et al., US Pre-Grant Publication 2020/0353943A1 (“Siddiqui”), and Li, et al., US Patent Publication 12072954B1 (“Li”).
Regarding claim 9, Duan in view of Xu, Palanisamy, and Siddiqui teaches the system of claim 1. Palanisamy teaches the policy metadata as seen in claim 1. Xu teaches the pre-trained process as seen in claim 1.
While Duan in view of Xu, Palanisamy, and Siddiqui teaches a domain chief that uses pre-trained template models, the combination does not explicitly teach wherein the domain chief receives the pre-trained neural network model via the in-vehicle network from a vehicle chief that comprises a vehicle neural network model that is modified based on respective gradient data and respective policy metadata generated by a plurality of domain chiefs that includes the domain chief.
Li teaches wherein the domain chief receives the pre-trained neural network model via the in-vehicle network from a vehicle chief that comprises a vehicle neural network model that is modified based on respective gradient data and respective policy metadata generated by a plurality of domain chiefs that includes the domain chief. (Li, col. 3 lines 19-29, “FIG. 1 illustrates an example of a federated learning system 100 comprising edge devices 102(1) . . . (N) and a central server 104, according to at least one embodiment. In at least one embodiment, edge devices 102 transmit their respective local models as updates 110 to central server 104 and receive global models as updates 112 from central server 104; central server is interpreted as the vehicle chief and the local models are interpreted as the domain chiefs [i.e. wherein the domain chief receives the pre-trained neural network model via the in-vehicle network from a vehicle chief that comprises a vehicle neural network model]. In at least one embodiment, edge devices 102 interface to respective storage 120 for training images, hyperparameter storage 122, and model storage 124. In at least one embodiment, a plurality of edge devices 102 are present and may be distributed among distinct domains [generated by a plurality of domain chiefs that includes the domain chief.]”). (Li, col. 5 lines 23-30, “In at least one embodiment, a plurality of edge devices train local models of neural networks on locally-maintained training data over one or more training rounds and at an end of a training round provide their local models to a central server [that is modified based on respective gradient data and respective policy metadata generated by a plurality of domain chiefs]. In at least one embodiment, a central server aggregates local models to generate a global model. In at least one embodiment, a central server provides a global model to edge devices for use in a subsequent training round.”).
Duan, in view of Xu, Palanisamy, and Siddiqui, and Li are both in the same field of endeavor (i.e. machine learning). It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to combine Duan, in view of Xu, Palanisamy, Siddiqui, and Li to teach the above limitation(s). The motivation for doing so is that federated learning improves the security of local data (cf. Li, col. 1 Background, “In some uses of a federated learning system, it is desirable that local training data not be able to be reconstructed from local models uploaded from edge devices, such as where local training data is confidential material such as medical imagery from patients. Processes for training neural networks while maintaining locality of training data can be improved.”).
Claims 14-15 and 21-22 are rejected under 35 U.S.C. 103 as being unpatentable over Duan, et al., Non-Patent Literature “Hierarchical Reinforcement Learning for Self-Driving Decision-Making without Reliance on Labeled Driving Data” (“Duan”) in view of Xu, et al., Non-Patent Literature “Knowledge Transfer in Multi-Task Deep Reinforcement Learning for Continuous Control” (“Xu”) and further in view of Palanisamy, et al., US Pre-Grant Publication 2020/0139973A1 (“Palanisamy”), Siddiqui, et al., US Pre-Grant Publication 2020/0353943A1 (“Siddiqui”), and Nakabayashi, et al., US Pre-Grant Publication 2021/0383215A1 (“Nakabayashi”).
Regarding claim 14 and analogous claim 21, Duan in view of Xu, Palanisamy, and Siddiqui teaches the computer-implemented method of claim 11. Xu teaches the pre-trained process as seen in claim 11. Duan further teaches further comprising: replacing, by the system, the pre-trained neural network model with an updated pre-trained neural network model received via the in-vehicle network from the domain chief. (Duan, pg. 3 col. 1 and Figure 2, “Then, the average gradients are applied to update the shared policy network and shared value network at each step. These car-learners synchronize their local network parameters from the shared networks before they make new decisions; synchronizing parameters is interpreted as updating the neural network model [i.e. wherein the at least one of the computer executable components further: replaces the pre-trained neural network model with an updated pre-trained neural network model received via the in-vehicle network from the domain chief.]. We refer to this training algorithm as asynchronous parallel reinforcement learning (APRL).”).
While Duan in view of Xu, Palanisamy, and Siddiqui teaches updating the pre-trained models, the combination does not explicitly teach wherein the updated pre-trained neural network model is sent to the domain chief via an extravehicular network from a model catalog repository that stores template models trained using crowd-sourced data obtained from a plurality of vehicles.
Nakabayashi teaches wherein the updated pre-trained neural network model is sent to the domain chief via an extravehicular network from a model catalog repository that stores template models trained using crowd-sourced data obtained from a plurality of vehicles. (Nakabayashi, ⁋63-64, “In step S14, the server 1 compares model information and vehicle information of trained models of other vehicles stored in the model database; multiple other trained models of other vehicles is interpreted as crowd-sourced data from a plurality of vehicles (i.e. via an extravehicular network from a model catalog repository that stores template models trained using crowd-sourced data obtained from a plurality of vehicles.) with the model information and the vehicle information of the transmission source vehicle, which have been received in step S12. Then, the server 1 selects a trained model of which the training conditions most closely match (equal to or the closest to) the transmission source vehicle from the trained models stored in the model database, as the trained model for transfer learning; selecting for transfer learning is interpreted as sending the updated pre-trained model back to the vehicle through the domain chief (i.e. wherein the updated pre-trained neural network model is sent to the domain chief via an extravehicular network). Here, the selected trained model for transfer learning is an example of a “model of a particular vehicle”. In the present embodiment, the server 1 quantifies the degree of matching of training conditions for each trained model stored in the model database, based on the items for determining the degree of matching of training conditions that are included in the model information and the vehicle information, such as the number of hidden layers and the number of nodes in each hidden layer, the vehicle type, the vehicle specifications, the distance traveled, and so forth. Thus, the server 1 selects the trained model with the highest degree of matching as the trained model for transfer learning.”).
Duan, in view of Xu, Palanisamy, and Siddiqui, and Nakabayashi are both in the same field of endeavor (i.e. machine learning). It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to combine Duan, in view of Xu, Palanisamy, and Siddiqui, and Nakabayashi to teach the above limitation(s). The motivation for doing so is using a model repository reduces the training time for new models (cf. Nakabayashi, ⁋5, “a model training system and a server in which the computation amount necessary for training can be reduced, and the amount of time necessary for training can be shortened, with regard to creating a trained model.”).
Regarding claim 15 and analogous claim 22, Duan in view of Xu, Palanisamy, Siddiqui, and Nakabayashi teaches the computer-implemented method of claim 14. Xu teaches the pre-trained process as seen in claim 11. Nakabayashi teaches the updated…neural network from a model repository as seen in claim 14. Duan further teaches further comprising: constructing, by the system, a new trainable neural network model based on the updated pre-trained neural network model; and replacing, by the system, the modified trainable neural network model with the new trainable neural network model. (Duan, pg. 3 col. 1 and Figure 2, “The multiple car-learners interact with different parts of the environment in parallel. Meanwhile, each car-learner computes gradients with respect to the parameters of the value network and the policy network at each step. Then, the average gradients are applied to update the shared policy network and shared value network at each step.; updating the shared networks is interpreted as constructing and replacing the trainable model with a new trainable model [i.e. further comprising: constructing, by the system, a new trainable neural network model based on the updated pre-trained neural network model; and replacing, by the system, the modified trainable neural network model with the new trainable neural network model.]”).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Heess, et al., US11210585 B1 discloses a multi-level system which includes a high-level controller neural network, low-level controller network, and subsystem. The high-level controller neural network receives an input observation and processes the input observation to generate a high-level output defining a control signal for the low-level controller. The low-level controller neural network receives a designated component of an input observation and processes the designated component and an input control signal to generate a low-level output that defines an action to be performed by the agent in response to the input observation.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NICHOLAS S WU whose telephone number is (571)270-0939. The examiner can normally be reached Monday - Friday 8:00 am - 4:00 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michelle Bechtold can be reached on 571-431-0762. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/N.S.W./Examiner, Art Unit 2148
/MICHELLE T BECHTOLD/Supervisory Patent Examiner, Art Unit 2148