Last updated: April 19, 2026
Application No. 17/773,378
METHOD FOR GENERATING LANE CHANGING DECISION-MAKING MODEL, METHOD FOR LANE CHANGING DECISION-MAKING OF UNMANNED VEHICLE AND ELECTRONIC DEVICE

Non-Final OA §101§102§103§112
Filed
Apr 29, 2022
Examiner
HOANG, AMY P
Art Unit
2143
Tech Center
2100 — Computer Architecture & Software
Assignee
Momenta (Suzhou) Technology Co. Ltd.
OA Round
1 (Non-Final)
Interview Optional

— +64.2% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 232 resolved cases, 2023–2026
Examiner Intelligence

HOANG, AMY P View full profile →
Grants 70% — above average
Career Allow Rate
163 granted / 232 resolved
+15.3% vs TC avg
Strong +64% interview lift
Without
With
+64.2%
Interview Lift
resolved cases with interview
Typical timeline
3y 3m
Avg Prosecution
31 currently pending
Career history
263
Total Applications
across all art units
Statute-Specific Performance

§101
15.9%
-24.1% vs TC avg
§103
46.0%
+6.0% vs TC avg
§102
17.0%
-23.0% vs TC avg
§112
13.4%
-26.6% vs TC avg
Black line = Tech Center average estimate • Based on career data from 232 resolved cases
Office Action

§101 §102 §103 §112
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . This action is responsive to the application filed on 04/29/2022. Claims 1-3, 6 and 11-18 are presented in the case. Claims 1, 6 and 18 are independent claims.

Priority
Applicant's claim for the benefit of a Chinese Application CN201911181338.0 filed on November 27, 2019 is acknowledged.

Information Disclosure Statement
The information disclosure statement submitted on 04/29/2022 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

Claims 1-2, 11-12 and 17 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor, or for pre-AIA  the applicant regards as the invention.
Claim 1 recites the limitations "the vehicle", “the state variable” and “the target vehicle”. There is insufficient antecedent basis for these limitations in the claim.
Claim 2 recites the limitations "the front vehicle", “the present lane” and “the following vehicle”. There is insufficient antecedent basis for these limitations in the claim.
Claim 12 recites the limitations "the front vehicle", “the present lane” and “the following vehicle”. There is insufficient antecedent basis for these limitations in the claim.
Claims 11 and 17 recite the limitation “the present lane”. There is insufficient antecedent basis for this limitation in the claim.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-3, 6 and 11-18 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. 
Step 1: Claims 1-3, 11-16, 6 and 17 are directed to a method and claim 18 is directed to a device. Therefore, the claims are eligible under Step 1 for being directed to a process, and a machine respectively.
Independent claim 1:
Step 2A Prong 1:  
Claim recites:
A method of generating a lane changing decision-making model, comprising: obtaining a training sample set of vehicular lane changing, wherein the training sample set comprises a plurality of training sample groups, each of the training sample groups comprises a training sample under each time step length in a process that the vehicle completes lane changing based on a planned lane changing trajectory, the training sample comprises a group of state variables and corresponding control variables - Under its broadest reasonable interpretation in light of the specification, this limitation encompasses the mental process of evaluating data and selecting data based on judgement, which is observing, evaluating and judging that is practically capable of being performed in the human mind with the assistance of pen and paper.
obtaining the lane changing decision-making model, wherein the lane changing decision-making model enables the state variables of the target vehicle and the corresponding control variables to be correlated - Under its broadest reasonable interpretation in light of the specification, this limitation encompasses the mental process of evaluating data and selecting data based on judgement, which is observing, evaluating and judging that is practically capable of being performed in the human mind with the assistance of pen and paper.
Step 2A Prong 2:  This judicial exception is not integrated into a practical application because they recite the additional elements:
training a decision-making model based on deep reinforcement learning network by use of the training sample set - the step recited at a high level of generality, and amounts to merely indicating a field of use or technological environment in which the judicial exception is performed (see MPEP § 2106.05(h)).
Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claims are thus directed to the abstract idea.
Step 2B:  The claims do not include additional elements that amount to significantly more than the judicial exception.
The additional elements:
training a decision-making model based on deep reinforcement learning network by use of the training sample set - generally linking the use of the judicial exception to indicate a field of use or technological environment. Thus, the additional elements, viewed individually or in combination, do not provide an inventive concept or otherwise amount to significantly more than the abstract idea itself.
Accordingly, these additional elements do not amount to significantly more than the judicial exception. As such, the claims are ineligible.
Dependent claim 2:
Step 2A Prong 1:  The claim recites the abstract ideas of claim 1.
Step 2A Prong 2:  This judicial exception is not integrated into a practical application because they recite the additional elements:
wherein the training sample set is obtained in the following manner:
a vehicle is enabled to complete lane changing according to a rule-based optimization algorithm in a simulator to obtain the state variables of the target vehicle, the front vehicle in the present lane of the target vehicle and the following vehicle in the target lane under each time step length during a process of multiple lane changings and the corresponding control variables - These additional elements are recited at a high level of generality and merely invokes a generic computer machinery as a tool to perform the underlying abstract ideas and thus fails to integrate the abstract idea into a practical application (See MPEP 2106.05(f)) and amount to mere data gathering which is a form of insignificant extra-solution activity (see MPEP § 2106.05(g)).
Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claims are thus directed to the abstract idea.
Step 2B:  The claims do not include additional elements that amount to significantly more than the judicial exception.
The additional elements:
wherein the training sample set is obtained in the following manner:
a vehicle is enabled to complete lane changing according to a rule-based optimization algorithm in a simulator to obtain the state variables of the target vehicle, the front vehicle in the present lane of the target vehicle and the following vehicle in the target lane under each time step length during a process of multiple lane changings and the corresponding control variables - These additional elements are recited at a high level of generality and merely invokes a generic computer machinery as a tool to perform the underlying abstract ideas and thus fails to integrate the abstract idea into a practical application (See MPEP 2106.05(f)) and amount to mere data gathering which is well known which is a form of insignificant extra-solution activity (see MPEP § 2106.05(g)).
Accordingly, these additional elements do not amount to significantly more than the judicial exception. As such, the claims are ineligible.
Dependent claim 3:
Step 2A Prong 1:  The claim recites the abstract ideas of claim 1.
Step 2A Prong 2:  This judicial exception is not integrated into a practical application because they recite the additional elements:
wherein the decision-making model based on deep reinforcement learning network comprises a learning-based prediction network and a pre-trained rule-based target network - the step recited at a high level of generality, and amounts to merely indicating a field of use or technological environment in which the judicial exception is performed (see MPEP § 2106.05(h)).
Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claims are thus directed to the abstract idea.
Step 2B:  The claims do not include additional elements that amount to significantly more than the judicial exception.
The additional elements:
wherein the decision-making model based on deep reinforcement learning network comprises a learning-based prediction network and a pre-trained rule-based target network - generally linking the use of the judicial exception to indicate a field of use or technological environment. Thus, the additional elements, viewed individually or in combination, do not provide an inventive concept or otherwise amount to significantly more than the abstract idea itself.
Accordingly, these additional elements do not amount to significantly more than the judicial exception. As such, the claims are ineligible.
Independent claims 6 and 18:
Step 2A Prong 1:  
Claims recite:
A method of lane changing decision-making of an unmanned vehicle, comprising:
at a determined lane changing moment, obtaining sensor data in body sensors of a target vehicle - Under its broadest reasonable interpretation in light of the specification, this limitation encompasses the mental process of evaluating data and selecting data based on judgement, which is observing, evaluating and judging that is practically capable of being performed in the human mind with the assistance of pen and paper.
Step 2A Prong 2:  This judicial exception is not integrated into a practical application because they recite the additional elements:
invoking a lane changing decision-making model generated by the method according to claim 1 to obtain a control variable of the target vehicle at each moment during a lane changing process, wherein the lane changing decision-making model enables a state variable of the target vehicle and a corresponding control variable to be correlated - the step recited at a high level of generality, and amounts to more than a recitation of the words "apply it" (or an equivalent) or are more than mere instructions to implement an abstract idea or other exception on a computer (see MPEP § 2106.05(f)).
sending the control variable of each moment during a lane changing process to an actuation mechanism to enable the target vehicle to complete lane changing - the steps recited at a high level of generality, and amounts to mere data transmission which is well known which is a form of insignificant extra-solution activity (see MPEP § 2106.05(g)).
An electronic device comprising one or more processors and a memory, wherein the memory is configured to store program instructions; and the one or more processors are configured to execute the program instructions stored in the memory, and when the one or more processors execute the program instructions stored in the memory, the electronic device is configured to perform the method of lane changing decision-making of an unmanned vehicle according to claim 6 - These limitations amount to components of a general purpose computer that applies a judicial exception, by use of conventional computer functions (see MPEP § 2106.05(b)).
Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claims are thus directed to the abstract idea.
Step 2B:  The claims do not include additional elements that amount to significantly more than the judicial exception.
The additional elements:
invoking a lane changing decision-making model generated by the method according to claim 1 to obtain a control variable of the target vehicle at each moment during a lane changing process, wherein the lane changing decision-making model enables a state variable of the target vehicle and a corresponding control variable to be correlated - the step recited at a high level of generality, and amounts to more than a recitation of the words "apply it" (or an equivalent) or are more than mere instructions to implement an abstract idea or other exception on a computer (see MPEP § 2106.05(f)).
sending the control variable of each moment during a lane changing process to an actuation mechanism to enable the target vehicle to complete lane changing - which is a well-understood, routine, conventional activity similar to receiving or transmitting data over a network described in MPEP 2106.05(d)(II).
An electronic device comprising one or more processors and a memory, wherein the memory is configured to store program instructions; and the one or more processors are configured to execute the program instructions stored in the memory, and when the one or more processors execute the program instructions stored in the memory, the electronic device is configured to perform the method of lane changing decision-making of an unmanned vehicle according to claim 6 - These limitations amount to components of a general purpose computer that applies a judicial exception, by use of conventional computer functions (see MPEP § 2106.05(b)).
Accordingly, these additional elements do not amount to significantly more than the judicial exception. As such, the claims are ineligible.
Dependent claim 11:
Step 2A Prong 1:  The claim recites the abstract ideas of claim 1.
Step 2A Prong 2:  This judicial exception is not integrated into a practical application because they recite the additional elements:
wherein the state variables comprise a pose, a speed and an acceleration of a target vehicle, a pose, a speed and an acceleration of a front vehicle in the present lane of the target vehicle and a pose, a speed and an acceleration of a following vehicle in a target lane; and the control variables comprise a speed and an angular speed of the target vehicle - the step recited at a high level of generality, and amounts to selecting a particular data source or type of data to be manipulated, which is a form of insignificant extra-solution activity (see MPEP § 2106.05(g)).
Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claims are thus directed to the abstract idea.
Step 2B:  The claims do not include additional elements that amount to significantly more than the judicial exception.
The additional elements:
wherein the state variables comprise a pose, a speed and an acceleration of a target vehicle, a pose, a speed and an acceleration of a front vehicle in the present lane of the target vehicle and a pose, a speed and an acceleration of a following vehicle in a target lane; and the control variables comprise a speed and an angular speed of the target vehicle - viewed individually or in combination, describes selecting a particular data source or type of data to be manipulated similar to selecting information, based on types of information and availability of information in a power-grid environment, for collection, analysis and display described in MPEP § 2106.05(g).
Accordingly, these additional elements do not amount to significantly more than the judicial exception. As such, the claims are ineligible.
Dependent claim 12:
Step 2A Prong 1:  The claim recites the abstract ideas of claim 1.
Step 2A Prong 2:  This judicial exception is not integrated into a practical application because they recite the additional elements:
wherein the training sample set is obtained in the following manner:
vehicle data in a vehicular lane changing is sampled from a database storing vehicular lane changing information, wherein the vehicle data comprises the state variables of the target vehicle, the front vehicle in the present lane of the target vehicle and the following vehicle in the target lane under each time step length and the corresponding control variables - the step recited at a high level of generality, and amounts to selecting a particular data source or type of data to be manipulated, which is a form of insignificant extra-solution activity (see MPEP § 2106.05(g)).
Step 2B:  The claims do not include additional elements that amount to significantly more than the judicial exception.
The additional elements:
wherein the training sample set is obtained in the following manner:
vehicle data in a vehicular lane changing is sampled from a database storing vehicular lane changing information, wherein the vehicle data comprises the state variables of the target vehicle, the front vehicle in the present lane of the target vehicle and the following vehicle in the target lane under each time step length and the corresponding control variables - viewed individually or in combination, describes selecting a particular data source or type of data to be manipulated similar to selecting information, based on types of information and availability of information in a power-grid environment, for collection, analysis and display described in MPEP § 2106.05(g).
Accordingly, these additional elements do not amount to significantly more than the judicial exception. As such, the claims are ineligible.
Dependent claim 13:
Step 2A Prong 1:  
Claim recites:
for a training sample set pre-added to an experience pool, with any state variable in each group of training samples as an input of the prediction network, obtaining a prediction control variable of the prediction network for a next time step length of the state variable; with a state variable of the next time step length of the state variable in the training sample and a corresponding control variable as an input of the target network, obtaining a value evaluation Q value output by the target network - Under its broadest reasonable interpretation in light of the specification, this limitation encompasses the mental process of evaluating data and selecting data based on judgement, which is observing, evaluating and judging that is practically capable of being performed in the human mind with the assistance of pen and paper.
with the prediction control variable as an input of a pre-constructed environmental simulator, obtaining an environmental reward and a state variable of the next time step length output by the environmental simulator - Under its broadest reasonable interpretation in light of the specification, this limitation encompasses the mental process of evaluating data and selecting data based on judgement, which is observing, evaluating and judging that is practically capable of being performed in the human mind with the assistance of pen and paper.
according to multiple groups of experience data and the Q value output by the target network and corresponding to each group of experience data, calculating and optimizing a loss function to obtain a gradient of change of parameters of the prediction network and updating the parameters of the prediction network until the loss function converges - Under its broadest reasonable interpretation in light of the specification, this limitation encompasses a mathematical concept of a mathematical calculation of calculating using mathematical methods to determine desirability values.
Step 2A Prong 2:  This judicial exception is not integrated into a practical application because they recite the additional elements:
storing the state variable, the corresponding prediction control variable, the environmental reward and the state variable of the next time step length as a group of experience data into the experience pool - the steps recited at a high level of generality, and amounts to mere data storing which is well known which is a form of insignificant extra-solution activity (see MPEP § 2106.05(g)).
Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claims are thus directed to the abstract idea.
Step 2B:  The claims do not include additional elements that amount to significantly more than the judicial exception.
The additional elements:
storing the state variable, the corresponding prediction control variable, the environmental reward and the state variable of the next time step length as a group of experience data into the experience pool - which is a well-understood, routine, conventional activity similar to Storing and retrieving information in memory described in MPEP 2106.05(d)(II).
Accordingly, these additional elements do not amount to significantly more than the judicial exception. As such, the claims are ineligible.
Dependent claim 14:
Step 2A Prong 1:  
Claim recites:
wherein after the number of the groups of the experience data reaches a first preset number, according to multiple groups of experience data and the Q value output by the target network and corresponding to each group of experience data, calculating and optimizing a loss function to obtain a gradient of change of parameters of the prediction network and updating the parameters of the prediction network until the loss function converges - Under its broadest reasonable interpretation in light of the specification, this limitation encompasses a mathematical concept of a mathematical calculation of calculating using mathematical methods to determine desirability values.
Step 2A Prong 2 & Step 2B: There are no additional elements recited so the claims do not provide a practical application and is not considered to be significantly more. As such, the claims are ineligible.
Dependent claim 15:
Step 2A Prong 1:  
Claim recites:
wherein after the number of the groups of the experience data reaches the first preset number, according to the experience data, calculating and optimizing the loss function to obtain the gradient of change of the parameters of the prediction network and updating the parameters of the prediction network until the loss function converges - Under its broadest reasonable interpretation in light of the specification, this limitation encompasses a mathematical concept of a mathematical calculation of calculating using mathematical methods to determine desirability values.
after the number of the updates of the parameters of the prediction network reaches a second preset number, obtaining a prediction control variable with an environmental reward higher than a preset value and a corresponding state variable in the experience pool, or obtaining prediction control variables with environmental rewards ranked in top third preset number and corresponding state variables in the experience pool - Under its broadest reasonable interpretation in light of the specification, this limitation encompasses the mental process of evaluating data and selecting data based on judgement, which is observing, evaluating and judging that is practically capable of being performed in the human mind with the assistance of pen and paper.
Step 2A Prong 2:  This judicial exception is not integrated into a practical application because they recite the additional elements:
adding the prediction control variables and the corresponding state variables to a target network training sample set of the target network to train and update the parameters of the target network - the step recited at a high level of generality, and amounts to more than a recitation of the words "apply it" (or an equivalent) or are more than mere instructions to implement an abstract idea or other exception on a computer (see MPEP § 2106.05(f)).
Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claims are thus directed to the abstract idea.
Step 2B:  The claims do not include additional elements that amount to significantly more than the judicial exception.
The additional elements:
adding the prediction control variables and the corresponding state variables to a target network training sample set of the target network to train and update the parameters of the target network - the step recited at a high level of generality, and amounts to more than a recitation of the words "apply it" (or an equivalent) or are more than mere instructions to implement an abstract idea or other exception on a computer (see MPEP § 2106.05(f)).
Accordingly, these additional elements do not amount to significantly more than the judicial exception. As such, the claims are ineligible.
Dependent claim 16:
Step 2A Prong 1:  The claim recites the abstract ideas of claim 1.
Step 2A Prong 2:  This judicial exception is not integrated into a practical application because they recite the additional elements:
wherein the loss function is a mean square error of a first preset number of value evaluation Q values of the prediction network and the value evaluation Q value of the target network, wherein the value evaluation Q value of the prediction network is about an input state variable, a corresponding prediction control variable and a policy parameter of the prediction network; and the value evaluation Q value of the target network is about a state variable of an input training sample, a corresponding control variable and a policy parameter of the target network - the step recited at a high level of generality, and amounts to selecting a particular data source or type of data to be manipulated, which is a form of insignificant extra-solution activity (see MPEP § 2106.05(g)).
Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claims are thus directed to the abstract idea.
Step 2B:  The claims do not include additional elements that amount to significantly more than the judicial exception.
The additional elements:
wherein the loss function is a mean square error of a first preset number of value evaluation Q values of the prediction network and the value evaluation Q value of the target network, wherein the value evaluation Q value of the prediction network is about an input state variable, a corresponding prediction control variable and a policy parameter of the prediction network; and the value evaluation Q value of the target network is about a state variable of an input training sample, a corresponding control variable and a policy parameter of the target network - viewed individually or in combination, describes selecting a particular data source or type of data to be manipulated similar to selecting information, based on types of information and availability of information in a power-grid environment, for collection, analysis and display described in MPEP § 2106.05(g).
Accordingly, these additional elements do not amount to significantly more than the judicial exception. As such, the claims are ineligible.
Dependent claim 17:
Step 2A Prong 1:  The claim recites the abstract ideas of claim 6.
Step 2A Prong 2:  This judicial exception is not integrated into a practical application because they recite the additional elements:
wherein the sensor data comprises poses, speeds and accelerations of the target vehicle, a front vehicle in the present lane of the target vehicle and a following vehicle in a target lane - the step recited at a high level of generality, and amounts to selecting a particular data source or type of data to be manipulated, which is a form of insignificant extra-solution activity (see MPEP § 2106.05(g)).
Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claims are thus directed to the abstract idea.
Step 2B:  The claims do not include additional elements that amount to significantly more than the judicial exception.
The additional elements:
wherein the sensor data comprises poses, speeds and accelerations of the target vehicle, a front vehicle in the present lane of the target vehicle and a following vehicle in a target lane - viewed individually or in combination, describes selecting a particular data source or type of data to be manipulated similar to selecting information, based on types of information and availability of information in a power-grid environment, for collection, analysis and display described in MPEP § 2106.05(g).
Accordingly, these additional elements do not amount to significantly more than the judicial exception. As such, the claims are ineligible.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.

Claims 1-3 and 12 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Nishi, US 20180011488 A1.

Regarding independent claim 1, Nishi teaches a method of generating a lane changing decision-making model (Abstract), comprising:
obtaining a training sample set of vehicular lane changing (Fig. 6, 710; [0057] Referring to FIG. 6, in block 710, the processor 58 may receive passively-collected data relating to a vehicle operation to be performed; [0025] Passively-collected data relevant to a freeway merging operation may include, for example, the speeds and accelerations of vehicles traveling in the right-most lane of the freeway proximate an on-ramp, and the speeds and accelerations of sample vehicles traveling along the on-ramp and entering the right-most lane; [0027] A vehicle operation as autonomously controlled by a control policy may be defined as a driving maneuver or set of driving maneuvers performed to accomplish a specific purpose, such as merging onto a freeway or changing lanes), wherein the training sample set comprises a plurality of training sample groups, each of the training sample groups comprises a training sample under each time step length in a process that the vehicle completes lane changing based on a planned lane changing trajectory ([0025] One example of passively collected data is the dataset described in http://www.fhwa.dot.gov/publications/research/operations/06137/, which describes the acquisition of vehicle trajectories around the entrance of a freeway using cameras mounted on the top of a building. In another example, passively collected data may include data collected by vehicle sensors responsive to a maneuver executed by a human driver. Data may be collected and provided to a computing system relating to maneuvers executed by a human driver, the vehicle environmental conditions under which the maneuver was executed, and events occurring in the vehicle surroundings subsequent to the maneuver and/or in response to the maneuver), the training sample comprises a group of state variables and corresponding control variables ([0025] Passively-collected data relevant to a freeway merging operation may include, for example, the speeds and accelerations of vehicles traveling in the right-most lane of the freeway proximate an on-ramp, and the speeds and accelerations of sample vehicles traveling along the on-ramp and entering the right-most lane; [0027] A vehicle operation as autonomously controlled by a control policy may be defined as a driving maneuver or set of driving maneuvers performed to accomplish a specific purpose, such as merging onto a freeway or changing lanes); and
obtaining the lane changing decision-making model by training a decision-making model based on deep reinforcement learning network by use of the training sample set (Fig. 6, 720; [0058] In block 720, the processor and/or other elements of the computing system 14 may iteratively apply a passive actor-critic (PAC) reinforcement learning method as described herein to samples of the passively-collected data. By application of the PAC method, a control policy may be learned which enables a vehicle to be controlled to perform the vehicle operation with a minimum expected cumulative cost), wherein the lane changing decision-making model enables the state variables of the target vehicle and the corresponding control variables to be correlated ([0026] A vehicle control dynamics model 87 may be a stimulus-response model describing how a vehicle responds to various inputs. The vehicle control dynamics model 87 may be used to determine (using passively-collected data) a vehicle control dynamics B(x) for the vehicle in given vehicle state x. The state cost function q(x) is the cost or the vehicle being in the state x, and may be leaned based on known methods such as inverse reinforcement learning. The state cost q(x) and the vehicle control dynamics B(x) may be used for both revising and optimizing a control policy 101 as described herein. The vehicle control dynamics model 87 for any given vehicle may be determined and stored in a memory, such as memory 54; [0027] Referring again to FIG. 1, embodiments of the computing system may also include two learning systems or learning networks, and actor network (or “actor”) 83 and a critic network (or “critic”) 81, that interact with each other. These networks may be implemented using artificial neural networks (ANN's), for example. For purposes described herein, a control policy 101 (also denoted by the variable π) may be defined as a function or other relationship that specifies or determines an action u to be taken by a vehicle in response to each state x of a set of states that the vehicle may be in. Thus, for each state x of the vehicle during execution of an autonomous operation, the vehicle may be controlled to perform an associated action u=π(x). Therefore, the control policy controls operation of the vehicle to autonomously perform an associated operation, for example, freeway merging. The actor 83 may operate on the control policy to revise and/or optimize the policy using information received from the critic and other information. A vehicle operation as autonomously controlled by a control policy may be defined as a driving maneuver or set of driving maneuvers performed to accomplish a specific purpose, such as merging onto a freeway or changing lanes).

Regarding dependent claim 2, Nishi teaches all the limitations as set forth in the rejection of claim 1 that is incorporated. Nishi further teaches wherein the training sample set is obtained in the following manner:
a vehicle is enabled to complete lane changing according to a rule-based optimization algorithm in a simulator to obtain the state variables of the target vehicle, the front vehicle in the present lane of the target vehicle and the following vehicle in the target lane under each time step length during a process of multiple lane changings and the corresponding control variables ([0089] Referring to FIGS. 4 and 5, in one example of implementation of an embodiment of the pAC reinforcement learning method described herein, an autonomous freeway merging operation is simulated. Passively-collected data relating to the freeway merging operation is processed as previously described to learn a control policy configured and optimized for controlling the vehicle so as to perform the vehicle operation with a minimum expected cumulative cost. The vehicle may then be controlled in accordance with the learned control policy to perform the freeway merging operation; [0090] The freeway merging operation may have a four-dimensional state space and a one-dimensional action space. The passive dynamics A (xt) of the vehicle environment dynamics and the vehicle control dynamics B(x) may be expressed as:

    PNG
    media_image1.png
    160
    314
    media_image1.png
    Greyscale

where the subscript “0” denotes the vehicle labeled “0” that is behind the merging vehicle on the freeway's rightmost lane (referred to as the “following vehicle”), the subscript “1” denotes the vehicle labeled “1” which is the merging automated vehicle on the ramp RR, and the subscript “2” denotes the vehicle labeled “2” (also referred to as the “leading vehicle”) that is in front of the merging vehicle 1 on the right-most lane of the freeway. dx12 and dv12 denote the merging vehicle's relative position and velocity from the leading vehicle, and the term α0 (x) represents the acceleration of the following vehicle 0. The parameters α, β, and γ are model parameters (for example, as used in the Gazis-Herman-Rothery (GHR) family of car-following models) which may be tuned to approximate human driving behaviors in traffic environments).

Regarding dependent claim 3, Nishi teaches all the limitations as set forth in the rejection of claim 1 that is incorporated. Nishi further teaches wherein the decision-making model based on deep reinforcement learning network comprises a learning-based prediction network and a pre-trained rule-based target network ([0027] Referring again to FIG. 1, embodiments of the computing system may also include two learning systems or learning networks, and actor network (or “actor”) 83 and a critic network (or “critic”) 81, that interact with each other).

Regarding dependent claim 12, Nishi teaches all the limitations as set forth in the rejection of claim 1 that is incorporated. Nishi further teaches wherein the training sample set is obtained in the following manner:
vehicle data in a vehicular lane changing is sampled from a database storing vehicular lane changing information, wherein the vehicle data comprises the state variables of the target vehicle, the front vehicle in the present lane of the target vehicle and the following vehicle in the target lane under each time step length and the corresponding control variables ([0025] The memory 54 may contain data 60 and/or instructions 56 (e.g., program logic) executable by the processor(s) 58 to execute various functions. Data 60 may include passively collected data relating to the vehicle operation to be controlled by the control policy … Passively-collected data relevant to a freeway merging operation may include, for example, the speeds and accelerations of vehicles traveling in the right-most lane of the freeway proximate an on-ramp, and the speeds and accelerations of sample vehicles traveling along the on-ramp and entering the right-most lane; [0027] A vehicle operation as autonomously controlled by a control policy may be defined as a driving maneuver or set of driving maneuvers performed to accomplish a specific purpose, such as merging onto a freeway or changing lanes).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 6, 11 and 17-18 are rejected under 35 U.S.C. 103 as being unpatentable over Nishi, in view of GRAVES, US 20210004006 A1.

Regarding dependent claim 11, Nishi teaches all the limitations as set forth in the rejection of claim 1 that is incorporated. Nishi further teaches wherein the state variables comprise a pose, a speed and an acceleration of a target vehicle, a pose, a speed and an acceleration of a front vehicle in the present lane of the target vehicle and a pose, a speed and an acceleration of a following vehicle in a target lane ([0071] one or more sensors may be provided which are configured to detect the proximity, distance, speed, and other information relating to vehicles moving adjacent or within a certain distance of the vehicle 11; Fig. 4; [0090] The freeway merging operation may have a four-dimensional state space and a one-dimensional action space. The passive dynamics A (xt) of the vehicle environment dynamics and the vehicle control dynamics B(x) may be expressed as:

    PNG
    media_image1.png
    160
    314
    media_image1.png
    Greyscale

where the subscript “0” denotes the vehicle labeled “0” that is behind the merging vehicle on the freeway's rightmost lane (referred to as the “following vehicle”), the subscript “1” denotes the vehicle labeled “1” which is the merging automated vehicle on the ramp RR, and the subscript “2” denotes the vehicle labeled “2” (also referred to as the “leading vehicle”) that is in front of the merging vehicle 1 on the right-most lane of the freeway. dx12 and dv12 denote the merging vehicle's relative position and velocity from the leading vehicle, and the term α0 (x) represents the acceleration of the following vehicle 0. The parameters α, β, and γ are model parameters (for example, as used in the Gazis-Herman-Rothery (GHR) family of car-following models) which may be tuned to approximate human driving behaviors in traffic environments).
Nishi does not explicitly disclose the control variables comprise a speed and an angular speed of the target vehicle.
However, in the same field of endeavor, GRAVES teaches
the control variables comprise a speed and an angular speed of the target vehicle ([0031] The present disclosure describes examples of training and deploying (or testing) a set of predictors that may be used to generate a set of predictions that are used to generate a vehicle action. A vehicle action may be a vehicle action for steering control (referred to hereinafter as a steering control action), a vehicle action for speed control (referred to hereinafter as a speed control action), and a vehicle action for steering and speed control (referred to as a steering and speed control action), based on a current state of the vehicle which includes current and past digital images of the surrounding environment, and a most recent (e.g. last) vehicle action. A steering control action may be for changing a steering angle of the vehicle. A speed control action may be for changing a target speed of the vehicle. A steering and speed control action may be for changing both a steering angle and a target speed of the vehicle; [0057] The predictive control system 400 receives sensor data from the environment sensors 110 (e.g. sensors mounted on the vehicle and the vehicle sensor integrated into the vehicle and vehicle data 188 generated by the vehicle control system 115 as described above, and generates vehicle actions to be executed. The vehicle actions (e.g., steering control action, a speed control action, or a steering and speed control action) may be processed by the vehicle control system 115, and inputted to the drive control system 150. In some embodiments, a steering control action causes a change in vehicle steering and speed control action causes change in vehicle speed).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teaching of generate vehicle actions for steering control and speed control of an autonomous vehicle based on a state of the vehicle which based on data collected by sensors as suggested in GRAVES into Nishi’s system because both of these systems are addressing autonomously controlling a vehicle to perform a vehicle operation. This modification would have been motivated by the desire to provide improvements in techniques for generating steering control using digital images (GRAVES, [0005]).

Regarding independent claim 6, Nishi teaches a lane changing decision-making model generated by the method according to claim 1 as set forth above.
Nishi does not explicitly disclose a method of lane changing decision-making of an unmanned vehicle, comprising:
at a determined lane changing moment, obtaining sensor data in body sensors of a target vehicle;
invoking a lane changing decision-making model generated by the method according to claim 1 to obtain a control variable of the target vehicle at each moment during a lane changing process, wherein the lane changing decision-making model enables a state variable of the target vehicle and a corresponding control variable to be correlated; and
sending the control variable of each moment during a lane changing process to an actuation mechanism to enable the target vehicle to complete lane changing.
However, in the same field of endeavor, GRAVES teaches
a method of lane changing decision-making of an unmanned vehicle ([0007] the present disclosure describes a method for predictive control of an autonomous vehicle; [0138] FIG. 7 is a flowchart illustrating an example method 700 for deployment of the trained predictive control system 400), comprising:
at a determined lane changing moment, obtaining sensor data in body sensors of a target vehicle ([0140] At 702, the state sub-module 410 determines the state st at the current time t, using sensor data gathered from the sensors 110 (in particular, a digital image received from the camera(s) 112, and vehicle data obtained from information received from the vehicle sensors 111));
invoking a lane changing decision-making model  ([0141] At 704, a set of lane centeredness predictors 404 generate a set of future lane centeredness predictions Plane, and a set of road angle predictors 406 generates a set of future road angle predictions Pangle, based on the determined current state st), wherein the lane changing decision-making model enables a state variable of the target vehicle and a corresponding control variable to be correlated ([0142] At 706, the predictive state vector p is outputted by the predictive perception module 402. The predictive state vector contains the predictions 416. For examples where the controller 412 implements a PID controller, the integral and derivative terms for the predictions may be computed by the predictive perception module 402 and also stored in the predictive state vector p; [0143] At 708, the controller 412 generates a vehicle action at, based on the received predictive state vector p. The vehicle action at may be a steering control action, a speed control action, or a steering and speed control action. The vehicle action includes control signals for controlling the steering and target speed); and
sending the control variable of each moment during a lane changing process to an actuation mechanism to enable the target vehicle to complete lane changing ([0144] At 712, the vehicle action at is executed. For example, the vehicle action generated by the controller 412 may be outputted to be processed by the vehicle control system 115, in order to be executed by the drive control system 150. The drive control system 150 in turn generates control signals for actuating the electromechanical system 190, to enable the vehicle 105 to perform the actions).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teaching of an example method for deployment of the trained predictive control system as suggested in GRAVES into Nishi’s system because both of these systems are addressing autonomously controlling a vehicle to perform a vehicle operation. This modification would have been motivated by the desire to provide improvements in techniques for generating steering control using digital images (GRAVES, [0005]).

Regarding dependent claim 17, the combination of Nishi and GRAVES teaches all the limitations as set forth in the rejection of claim 6 that is incorporated. Nishi further teaches wherein the sensor data comprises poses, speeds and accelerations of the target vehicle, a front vehicle in the present lane of the target vehicle and a following vehicle in a target lane ([0071] one or more sensors may be provided which are configured to detect the proximity, distance, speed, and other information relating to vehicles moving adjacent or within a certain distance of the vehicle 11; Fig. 4; [0090] The freeway merging operation may have a four-dimensional state space and a one-dimensional action space. The passive dynamics A (xt) of the vehicle environment dynamics and the vehicle control dynamics B(x) may be expressed as:

    PNG
    media_image1.png
    160
    314
    media_image1.png
    Greyscale

where the subscript “0” denotes the vehicle labeled “0” that is behind the merging vehicle on the freeway's rightmost lane (referred to as the “following vehicle”), the subscript “1” denotes the vehicle labeled “1” which is the merging automated vehicle on the ramp RR, and the subscript “2” denotes the vehicle labeled “2” (also referred to as the “leading vehicle”) that is in front of the merging vehicle 1 on the right-most lane of the freeway. dx12 and dv12 denote the merging vehicle's relative position and velocity from the leading vehicle, and the term α0 (x) represents the acceleration of the following vehicle 0. The parameters α, β, and γ are model parameters (for example, as used in the Gazis-Herman-Rothery (GHR) family of car-following models) which may be tuned to approximate human driving behaviors in traffic environments).

Regarding independent claim 18, it is a device claim that corresponding to the method of claim 6. Therefore, it is rejected for the same reason as claim 6 above.
GRAVES further teaches an electronic device (Fig. 2, 115) comprising one or more processors (Fig. 2, 102) and a memory (Fig. 2, 126), wherein the memory is configured to store program instructions ([0048]); and the one or more processors are configured to execute the program instructions stored in the memory, and when the one or more processors execute the program instructions stored in the memory, the electronic device is configured to perform the method of lane changing decision-making of an unmanned vehicle according to claim 6 ([0048]).

Claims 13-16 are rejected under 35 U.S.C. 103 as being unpatentable over Nishi as applied in claim 1, in view of Hu et al. (hereinafter Hu), US 20190266489 A1.

Regarding dependent claim 13, Nishi teaches all the limitations as set forth in the rejection of claim 1 that is incorporated. Nishi does not explicitly disclose wherein the step of obtaining the lane changing decision-making model by training the decision-making model based on deep reinforcement learning network by use of the training sample set comprises:
for a training sample set pre-added to an experience pool, with any state variable in each group of training samples as an input of the prediction network, obtaining a prediction control variable of the prediction network for a next time step length of the state variable; with a state variable of the next time step length of the state variable in the training sample and a corresponding control variable as an input of the target network, obtaining a value evaluation Q value output by the target network;
with the prediction control variable as an input of a pre-constructed environmental simulator, obtaining an environmental reward and a state variable of the next time step length output by the environmental simulator;
storing the state variable, the corresponding prediction control variable, the environmental reward and the state variable of the next time step length as a group of experience data into the experience pool; and
according to multiple groups of experience data and the Q value output by the target network and corresponding to each group of experience data, calculating and optimizing a loss function to obtain a gradient of change of parameters of the prediction network and updating the parameters of the prediction network until the loss function converges.
However, in the same field of endeavor, Hu teaches
wherein the step of obtaining the lane changing decision-making model by training the decision-making model based on deep reinforcement learning network by use of the training sample set (Abstract; [0003] a method for interaction-aware decision making; [0055] FIG. 1 is an exemplary component diagram of a system 100 for cooperative multi-goal, multi-agent, multi-stage (CM3) reinforcement learning; [0168] The traffic simulator 1112 may be a deep Q-learning system, which implements reinforcement learning based on the state input generated attributes for the simulated autonomous vehicle and the simulation environment provided by the traffic simulator 1112) comprises:
for a training sample set pre-added to an experience pool, with any state variable in each group of training samples as an input of the prediction network, obtaining a prediction control variable of the prediction network for a next time step length of the state variable ([0059] In any event, when the CM3 policy network 140 is stored on the storage device of the vehicle, this enables the controller to autonomously drive the vehicle around based on the CM3 policy network 140, and to make autonomous driving decisions according to the CM3 reinforcement learning which occurred within the simulator 108 because the CM3 policy network 140 may be indicative of one or more of the policies or decisions which should be made based on the training or the simulation. For example, the CM3 network policy may receive an input of an observation associated with the first autonomous vehicle or the second autonomous vehicle (e.g., a vehicle state or an environment state) and output a suggested action; [0080] The processor 102 or the simulator 108 may generate a CM3 network policy based on the first agent neural network and the second agent neural network. The simulator 108 may bridge the two stages (e.g., stage one and stage two) by modular augmentation of neural network policy and value functions. The CM3 network policy may be indicative of data which may be utilized to direct the controller of the autonomous vehicle(s) of FIG. 1 to operation in an autonomous fashion. For example, the CM3 network policy may receive an input of an observation associated with the first autonomous vehicle or the second autonomous vehicle (e.g., a vehicle state or an environment state) and output a suggested action, which may include the no-operation action, the acceleration action, the deceleration action, the shift left one sub-lane action, and the shift right one sub-lane action, similarly to the actions used during simulation and provided by the simulator 108; Fig. 11; [0166] In greater detail, the state input generator 1108 may generate a set of attributes associated with an autonomous vehicle undergoing training (e.g., the simulated autonomous vehicle). For example, the set of attributes may include the current velocity v associated with the autonomous vehicle, a lane position/associated with the autonomous vehicle, and a distance d2g from the autonomous vehicle to a goal, which may be a desired destination. Additionally, the set of attributes or the position information associated with the vehicle may be represented as an occupancy grid. The set of attributes may be state information which is indicative or representative of a state (S) or scenario associated with the autonomous vehicle); with a state variable of the next time step length of the state variable in the training sample and a corresponding control variable as an input of the target network, obtaining a value evaluation Q value output by the target network ([0163] According to one aspect, the traffic simulator 1112 may control the other vehicles within the simulation environment to avoid collisions with one another, but not with the simulated autonomous vehicle (e.g., the agent). The Q-masker 1114 may be implemented via a low-level controller and be part of a deep Q-learning system which learns policies which enable the autonomous vehicle to make decisions on a tactical level. The deep Q-learning system may learn a mapping between states and Q-values associated with each potential action; [0164] In a Q-learning network, a mapping between states and Q-values associated to each action may be learned);
with the prediction control variable as an input of a pre-constructed environmental simulator, obtaining an environmental reward and a state variable of the next time step length output by the environmental simulator ([0060] FIG. 2 is an exemplary component diagram of the simulator 108 for the system 100 for cooperative multi-goal, multi-agent, multi-stage reinforcement learning of FIG. 1. In FIG. 2, the simulator 108 of the system 100 for CM3 reinforcement learning of FIG. 1 may be seen. Here, the agent may take the action in the environment. This may be interpreted, by the critic, as the reward or penalty and a representation of the state, which may be then fed back into the agent. The agent may interact with the environment by taking the action at a discrete time step. At each time step, the agent may receive an observation which may include the reward. The agent may select one action from a set of available actions, which results in a new state and a new reward for a subsequent time step. The goal of the agent is generally to collect the greatest amount of rewards possible; [0169] The simulation environment may be the world or the environment through which the simulated autonomous vehicle moves. The traffic simulator 1112 simulates the simulated environment and uses the simulated autonomous vehicle's current state and action (e.g., for a given time interval) as an input, and returns the simulated autonomous vehicle's reward, described below, and next state as an output. For example, the traffic simulator 1112 may take the vehicle's current state (e.g., 50 mph) and action (e.g., deceleration), and apply the laws of physics to determine the simulated autonomous vehicle's next state (e.g., 45 mph));
storing the state variable, the corresponding prediction control variable, the environmental reward and the state variable of the next time step length as a group of experience data into the experience pool ([0153] Double replay buffers B1 and B2 may be used as a heuristic to improve training stability for all algorithms on stage two. Instead of storing each environment transition immediately, an additional episode buffer may be used to store all transitions encountered during each episode. At the end of each episode, the cumulative reward of all agents may be compared against a threshold (e.g., 32), to determine whether the transitions in the episode buffer should be stored into B1 or B2. For training, half of the minibatch is sampled from each of B1 and B2; [0186] the action generator 1116 may store one or more of the explored set of actions associated with the one or more additional time intervals as one or more corresponding trajectories. As previously discussed, a trajectory may be a sequence of states and/or actions which include those states; [0188] During training, actions may be taken in an epsilon-greedy manner and E may be annealed. The action generator 1116 may simulate full trajectories until the terminal state and classify the trajectories as either good or bad (i.e., the good buffer is associated with the simulated autonomous vehicle making it to the goal without being involved in collisions, exceeding the speed limit, etc.). Explained another way, all transitions (i.e., state, action, and reward tuples from successful trajectories) are saved in the good buffer while transitions from failed trajectories (i.e., not making it to the goal) are saved in the bad buffer); and
according to multiple groups of experience data and the Q value output by the target network and corresponding to each group of experience data, calculating and optimizing a loss function to obtain a gradient of change of parameters of the prediction network and updating the parameters of the prediction network until the loss function converges ([0115] The simulator 108 may train these reduced actor and critic networks until convergence; [0189] For any transition, the expected reward may be back calculated from the terminal reward, given by the following:

    PNG
    media_image2.png
    84
    294
    media_image2.png
    Greyscale

where γ is the discount factor; [0190] The network may be optimized using the following loss function, using a mini batch of transitions equally sampled from the good and bad buffer:

    PNG
    media_image3.png
    30
    156
    media_image3.png
    Greyscale

[0191] The two separate buffers help maintain a decent exposure to successful executions when the exploration might constantly lead to failed trajectories, thus avoiding the network getting stuck in a local minima; [0192] In this way, the autonomous vehicle policy generation system 1100 provides a framework that leverages the strengths of deep reinforcement learning for high-level tactical decision making and demonstrates a more structured and data efficient alternative to end-to-end complete policy learning on problems where a high-level policy may be difficult to formulate using traditional optimization or rule based methods, but where well-designed low-level controllers (e.g., the controller implementing the Q-masker 1114) are available. The autonomous vehicle policy generation system 1100 uses deep reinforcement learning to obtain a high-level policy for tactical decision making, while maintaining a tight integration with the low-level controller).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teaching of a method for interaction-aware decision making as suggested in Hu into Nishi’s system because both of these systems are addressing autonomously controlling a vehicle to perform a vehicle operation. This modification would have been motivated by the desire for enhancing the efficiency of the learning, greatly reducing the amount of training time associated with training a CM3 system (Hu, [0076]).

Regarding dependent claim 14, the combination of Nishi and Hu teaches all the limitations as set forth in the rejection of claim 13 that is incorporated. Hu further teaches wherein after the number of the groups of the experience data reaches a first preset number, according to multiple groups of experience data and the Q value output by the target network and corresponding to each group of experience data, calculating and optimizing a loss function to obtain a gradient of change of parameters of the prediction network and updating the parameters of the prediction network until the loss function converges ([0207] The Q-masker 1114 may mask a subset of output Q-values (e.g., an action set) to be simulated by the simulator 108. The action generator 1116 may train the first agent based on a remaining set of actions by excluding the masked set of actions from the set of possible actions. Therefore, only the Q-values associated with a remaining subset of actions are considered by the simulator 108 during simulation, thereby mitigating the amount of processing power and/or computing resources utilized during simulation and training of the autonomous vehicle in autonomous vehicle policy generation; [0208] Based on the remaining subset of actions (e.g., of a set of possible actions, the subset of actions excluding the masked subset), the action generator 1116 may explore the remaining actions and determine the autonomous vehicle policy accordingly. This may be repeated across different time intervals. The Q-masker 1114 may thereby ‘force’ the simulated autonomous vehicle to explore only the non-masked states, and thus, only learn actions associated with a subset of the space of associated Q-values (indicative of the long-term return of an action (a) under policy (Tr) on state (s))).

Regarding dependent claim 15, the combination of Nishi and Hu teaches all the limitations as set forth in the rejection of claim 14 that is incorporated. Hu further teaches wherein after the number of the groups of the experience data reaches the first preset number, according to the experience data, calculating and optimizing the loss function to obtain the gradient of change of the parameters of the prediction network and updating the parameters of the prediction network until the loss function converges, is performed, the method further comprises:
after the number of the updates of the parameters of the prediction network reaches a second preset number, obtaining a prediction control variable with an environmental reward higher than a preset value and a corresponding state variable in the experience pool, or obtaining prediction control variables with environmental rewards ranked in top third preset number and corresponding state variables in the experience pool, and adding the prediction control variables and the corresponding state variables to a target network training sample set of the target network to train and update the parameters of the target network ([0164] In a Q-learning network, a mapping between states and Q-values associated to each action may be learned. According to one aspect, Q-masking, in the form of a mask that is applied on the output Q-values before a max (or soft max) operator may be applied on the output layer of Q-values to pick the ‘best’ action. In this regard, the direct effect of the Q-masker 1114 is that when taking the max operation to choose the ‘best’ action, only the Q-values associated with a subset of actions, which are dictated by a lower-level module, are considered; [0165] Thus, the Q-masker 1114 may mask a subset of output Q-values which are to be simulated by the traffic simulator 1112. Therefore, only the Q-values associated with a remaining subset of actions are considered by the traffic simulator 1112 during simulation, thereby mitigating the amount of processing power and/or computing resources utilized during simulation and training of the autonomous vehicle in autonomous vehicle policy generation. Based on the remaining subset of actions (e.g., of a set of possible actions, the subset of actions excluding the masked subset), the action generator 1116 may explore the remaining actions and determine the autonomous vehicle policy accordingly. This may be repeated across one or more time intervals. The Q-masker 1114 may thereby ‘force’ the simulated autonomous vehicle to explore only the non-masked states, and thus, only learn a subset of the space of associated Q-values (which is indicative of the long-term return of an action (a) under policy (π) on state (s))).

Regarding dependent claim 16, the combination of Nishi and Hu teaches all the limitations as set forth in the rejection of claim 14 that is incorporated. Hu further teaches wherein the loss function is a mean square error of a first preset number of value evaluation Q values of the prediction network and the value evaluation Q value of the target network, wherein the value evaluation Q value of the prediction network is about an input state variable, a corresponding prediction control variable and a policy parameter of the prediction network; and the value evaluation Q value of the target network is about a state variable of an input training sample, a corresponding control variable and a policy parameter of the target network ([0190] The network may be optimized using the following loss function, using a mini batch of transitions equally sampled from the good and bad buffer:

    PNG
    media_image3.png
    30
    156
    media_image3.png
    Greyscale

).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Applicant is required under 37 C.F.R. § 1.111(c) to consider these references fully when responding to this action.
YANG et al. (US 20170364083 A1) discloses a local trajectory planning method and apparatus for a smart vehicle, pre-acquiring path planning information from a starting location to a destination.
It is noted that any citation to specific pages, columns, lines, or figures in the prior art references and any interpretation of the references should not be considered to be limiting in any way.  A reference is relevant for all it contains and may be relied upon for all that it would have reasonably suggested to one having ordinary skill in the art. In re Heck, 699 F.2d 1331, 1332-33, 216 U.S.P.Q. 1038, 1039 (Fed. Cir. 1983) (quoting In re Lemelson, 397 F.2d 1006, 1009, 158 U.S.P.Q. 275, 277 (C.C.P.A. 1968)).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to AMY P HOANG whose telephone number is (469)295-9134. The examiner can normally be reached M-TH 8:30-5:00PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, JENNIFER WELCH can be reached at 571-272-7212. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/AMY P HOANG/Examiner, Art Unit 2143                                                                                                                                                                                                        
/JENNIFER N WELCH/Supervisory Patent Examiner, Art Unit 2143
Read full office action
Prosecution Timeline

Apr 29, 2022
Application Filed
Feb 02, 2026
Non-Final Rejection — §101, §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/455,325
Patent 12602596
APPARATUS AND METHOD FOR VALIDATING DATASET BASED ON FEATURE COVERAGE
2y 5m to grant Granted Apr 14, 2026
18/525,453
Patent 12572263
ACCESS CARD WITH CONFIGURABLE RULES
2y 5m to grant Granted Mar 10, 2026
17/572,921
Patent 12536432
PRE-TRAINING METHOD OF NEURAL NETWORK MODEL, ELECTRONIC DEVICE AND MEDIUM
2y 5m to grant Granted Jan 27, 2026
17/241,391
Patent 12475669
METHOD AND APPARATUS WITH NEURAL NETWORK OPERATION FOR DATA NORMALIZATION
2y 5m to grant Granted Nov 18, 2025
18/386,907
Patent 12461595
SYSTEM AND METHOD FOR EMBEDDED COGNITIVE STATE METRIC SYSTEM
2y 5m to grant Granted Nov 04, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
70%
Grant Probability
99%
With Interview (+64.2%)
3y 3m
Median Time to Grant
Low
PTA Risk
Based on 232 resolved cases by this examiner. Grant probability derived from career allow rate.