Last updated: April 19, 2026

Application No. 17/064,566

AUTONOMOUS BEHAVIOR GENERATION WITH HIERARCHICAL REINFORCEMENT LEARNING

Non-Final OA §101§103

Filed

Oct 06, 2020

Examiner

KWON, JUN

Art Unit

2127

Tech Center

2100 — Computer Architecture & Software

Assignee

Hrl Laboratories LLC

OA Round

5 (Non-Final)

Interview Optional

— +46.2% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 68 resolved cases, 2023–2026

Examiner Intelligence

KWON, JUN View full profile →

Grants only 38% of cases

Career Allow Rate

26 granted / 68 resolved

-16.8% vs TC avg

Strong +46% interview lift

Without

With

+46.2%

Interview Lift

resolved cases with interview

Typical timeline

4y 3m

Avg Prosecution

34 currently pending

Career history

102

Total Applications

across all art units

Statute-Specific Performance

§101

31.8%

-8.2% vs TC avg

§103

41.4%

+1.4% vs TC avg

§102

7.6%

-32.4% vs TC avg

§112

18.1%

-21.9% vs TC avg

Black line = Tech Center average estimate • Based on career data from 68 resolved cases

Office Action

§101 §103

Detailed Action
	This Office Action is in response to the remarks entered on 10/07/2025. Claims 1-7, 9-16, 18-25 and 27 are currently pending.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 101
Amended claims were received on 10/07/2025. 35 U.S.C. 101 rejection has been withdrawn.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 5-6, 9-11, 14-15, 18-20, 23-24, and 27 are rejected under 35 U.S.C. 103 as being unpatentable over Na & Oh (Na & Oh, “Hybrid Control for Autonomous Mobile Robot Navigation Using Neural Network Based Behavior Modules and Environment Classification”, 2003, hereinafter ‘Na’) in view of Heess (US 11210585 B1, hereinafter ‘Heess’) in view of Johnson et al. (US-20040025153-A1, hereinafter ‘Johnson’). 

Regarding claim 1, Na teaches: 
A system for autonomous behavior generation, the system comprising: ([Na, Abstract, line 1-3 and 14-16] discloses that the architecture selects behaviors based on a classification)
an autonomous mobile platform having one or more mobile platform actuators; and ([Na, page 195, Fig. 3] and [Na, page 198, right col, last para, line 20 – page 199, right col, line 25] discloses selecting a behavior (the neural network controller for the selected behavior, which corresponds to the tactical neural net) to generate the steering signal based on the results of the environment classification neural network (strategic neural net). The signal is sent to the Actuator)
one or more processors and one or more associated memories, each associated memory being a non-transitory computer-readable medium having executable instructions encoded thereon such that when executed, an associated one or more processors perform operations of ([Na, Abstract, line 1-3 and 14-16] discloses that the architecture select behaviors based on a classification. The simulation is performed using a computer and an actual robot MORIS is used to test the architecture): 
while in a behavior selection state, receiving, by a high-level controller, observations from a physical environment around the autonomous mobile platform and, using a strategic neural net,[Na, page 198, right col, last para, line 20 – page 199, right col, line 25] discloses selecting a behavior (the neural network controller for the selected behavior, which corresponds to the tactical neural net) to generate the steering signal based on the results of the environment classification neural network (strategic neural net). [Na, page 196, left col, 3.1. Environment Classification Network, line 1 – right col, line 12] discloses that the environment classification neural network determines the high-level behavior in conjunction with the modified potential field algorithm. The environment classification neural network in conjunction with the modified potential field algorithm is the high-level controller); 
generating, by a low-level controller, an output command for a scripted action in the physical environment based on the selected high-level behavior, the scripted action being a precise maneuver of the autonomous mobile platform in the physical environment and the output command being a command for a series of actuator movements that, when implemented, cause the autonomous mobile platform to perform the scripted action; ([Na, page 198, right col, last para, line 20 – page 199, right col, line 25] discloses selecting a behavior (the neural network controller for the selected behavior, which corresponds to the tactical neural net) to generate the steering signal based on the results of the environment classification neural network (strategic neural net). [Na, page 201, left col, line 1-6] and [Na, page 200, Fig. 13] discloses that the steering angle (precise maneuver of the autonomous mobile platform) is determined based on the environment [Na, page 197, left col, 3.2.1. Neural Network Based Controller Capturing Human Basic Behaviors]. [Na, page 196, left col, 3.1. Environment Classification Network, line 1 – right col, line 12] discloses that the environment classification neural network determines the high-level behavior in conjunction with the modified potential field algorithm) 
sending the output command to one or more mobile platform actuators of the autonomous mobile platform; ([Na, page 194, right col, 2. Behavior-Based Control Architecture, line 1-3 and 8-11], [Na, page 197, right col, line 16-19] and [Na, page 199, right col, 3.4. Action Selection for Competitive Coordination, line 1-6] collectively disclose generating the steering signal to perform the selected behavior and adding the signal to change the output behavior. [Na, Abstract, line 1-3 and 14-16] An actual robot MORIS is used to test the robot navigation architecture)
actuating the one or more mobile platform actuators to cause the autonomous mobile platform to perform the scripted action and associated precise maneuver in the physical environment, ([Na, page 195, Fig. 3] and [Na, page 198, right col, last para, line 20 – page 199, right col, line 25] discloses selecting a behavior (the neural network controller for the selected behavior, which corresponds to the tactical neural net) to generate the steering signal based on the results of the environment classification neural network (strategic neural net). [Na, page 201, left col, line 1-6] and [Na, page 200, Fig. 13] discloses that the steering angle (precise maneuver of the autonomous mobile platform) is determined based on the environment. [Na, Abstract, line 1-3 and 14-16] further discloses that an actual robot MORIS is used to test the architecture)
wherein each high-level behavior has a separate and distinct tactical neural net, such that, after each [Na, page 197, left col, 3.2.1. Neural Network Based Controller Capturing Human Basic Behaviors] discloses that three behaviors have been implemented using a neural network and each neural net controller is trained under environment listed in Table 1. Since the Table 1 discloses three different behaviors trained using three different conditions respectively, the paragraph indicates that each behavior has separate controller (separate tactical neural net). [Na, page 199, right col, 3.4. Action Selection for Competitive Coordination, line 1-31; Figure 11] discloses that “First, the local environmental configuration is recognized as being one of the 16 prototypes based on the sensory information. If it is a U-shaped local, then the backup behavior is first selected to get out of it … Otherwise, the past behavior continues to control the robot. (maintaining the current high-level behavior)”)
Na does not specifically disclose: 
wherein the high-level controller selects behaviors at a lower frequency than a frequency at which the low-level controller generates output commands for scripted actions such that in selecting a behavior at each time-step i, if I modulo Q==0, then the high-level controller selects a behavior in [0,N], otherwise a same behavior is used from a previous iteration, wherein Q is a ratio between low-level and high-level behavior selection frequencies and N is a number of scripted behaviors;
after each time-step, the tactical neural net corresponding to a current high-level behavior makes a determination as to maintaining the current high-level behavior or transitioning back to a behavior selection state.
Heess teaches:
after each time-step, the tactical neural net corresponding to a current high-level behavior makes a determination as to maintaining the current high-level behavior or transitioning back to a behavior selection state. ([Heess, col 5, line 47-63] The system receives the current observation characterizing a current state of the environment, and then determines whether the criteria for generating a new control signal are satisfied. Once the criteria are satisfied, the system provides the current observation to the high -level controller (i.e., a recurrent neural network) and update the recurrent state (behavior selection state). If not satisfied, the system maintains current control signal)
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having the teachings of Na and Heess to use the method of determining whether to maintain or change the behavior after each time-step of Heess to implement the behavior generation system of Na. Na teaches determining whether to maintain or change the selected behavior, but is silent on determining it each time-step. Heess discloses how to efficiently determining it time-step, which allows the behavior generation system to interact with the agent (e.g., a robot, aircraft) in a structured manner.
Na in view of Heess does not specifically disclose: 
wherein the high-level controller selects behaviors at a lower frequency than a frequency at which the low-level controller generates output commands for scripted actions such that in selecting a behavior at each time-step i, if i modulo Q==0, then the high-level controller selects a behavior in [0,N], otherwise a same behavior is used from a previous iteration, wherein Q is a ratio between low-level and high-level behavior selection frequencies and N is a number of scripted behaviors.
Johnson teaches: 
selects behavior such that in selecting a behavior at each time-step i, if i modulo Q==0, then the high-level controller selects a behavior in [0,N), otherwise a same behavior is used from a previous iteration, wherein Q is a ratio between low-level and high-level behavior selection frequencies and N is a number of scripted behaviors (The limitation merely recites a loop operation that performs a new operation every Q step. [Johnson, 0007] discloses a frequent path and an infrequent path. The ‘less frequently executed branch paths’ are interpreted as the low-level behavior, and the ‘modulo scheduled portion of the loop’ is interpreted as the high-level behavior. [Johnson, 0020-0021] teaches performing a branch operation if a flow control condition is true (modulo schedule, if modulo Q==0). The branch operation is not performed if the condition is not met (same behavior is used from a previous iteration). [Johnson, 0037] also teaches performing a new operation (branches) to avoid resource overlap based on modulo scheduling);
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having the teachings of Na, Heess, and Johnson to use the method of selecting a behavior based on modulo operation of Johnson to implement the behavior generation system of Na. The suggestion and/or motivation to do so is to improve the efficiency of the system by maximizing utilization of resources by avoiding resource overlapping as disclosed in [Johnson, 0005].

Claim 10 is a computer program product claim having similar limitation to the claim 1. Therefore, it is rejected under the same rationale as the claim 1 above. 

Claim 19 is a method claim having similar limitation to the claim 1. Therefore, it is rejected under the same rationale as the claim 1 above. 

Regarding claim 2, Heess teaches: 
further comprising an operation of training the neural net using reinforcement learning ([Heess, col 6, line 21-38] The high-level controller neural network and the low-level controller neural network are trained using reinforcement learning method. ).

Claim 11 is a computer program product claim having similar limitation to the claim 2. Therefore, it is rejected under the same rationale as the claim 2 above. 

Claim 20 is a method claim having similar limitation to the claim 2. Therefore, it is rejected under the same rationale as the claim 2 above. 

Regarding claim 5, Na in view of Heess teaches: 
wherein the neural net that selects behaviors also produces a state value output for use as a starting value for reinforcement learning ([Heess, col 5, line 7-11] The high-level controller receives a new observation 226 and update its current recurrent state value to a new recurrent state 208. The process is also disclosed in the Figure 2. Since the new recurrent state is new thus it does not have state value before the updating process, the new recurrent state is interpreted as the starting value.).

Claim 14 is a computer program product claim having similar limitation to the claim 5. Therefore, it is rejected under the same rationale as the claim 5 above. 

Claim 23 is a method claim having similar limitation to the claim 5. Therefore, it is rejected under the same rationale as the claim 5 above. 

Regarding claim 6, Na in view of Heess teaches: 
further comprising an additional neural net that produces a state value output for use as a starting value for reinforcement learning ([Heess, col 5, line 7-11] The high-level controller receives a new observation 226 and update its current recurrent state value to a new recurrent state 208. The process is also disclosed in the Figure 2. Since the new recurrent state is new thus it does not have state value before the updating process, the new recurrent state is interpreted as the starting value.).

Regarding claim 15, Na teaches: 
further comprising instructions encoded on the non-transitory medium for causing the one or more processors to use ([Na, Abstract, line 1-3 and 14-16] discloses that the architecture select behaviors based on a classification. The simulation is performed using a computer and an actual robot MORIS is used to test the architecture). Claim 15 is a method claim having similar limitation to the claim 6. Therefore, it is rejected under the same rationale as the claim 6 above. 

Claim 24 is a method claim having similar limitation to the claim 6. Therefore, it is rejected under the same rationale as the claim 6 above. 

Regarding claim 9, Na in view of Heess teaches: 
wherein an additional set of neural nets restricts when the high-level controller switches to a different behavior ([Heess, col 4, line 22-33] The high-level controller 204 includes one or more neural networks (e.g., one or more recurrent networks) that integrate observations at every time step and produce a high level output that defines a new control signal every K time steps (i.e., control interval), which corresponds to the process of determining when to switch the behavior).

Claim 18 is a computer program product claim having similar limitation to the claim 9. Therefore, it is rejected under the same rationale as the claim 9 above. 

Claim 27 is a method claim having similar limitation to the claim 9. Therefore, it is rejected under the same rationale as the claim 9 above. 

Claims 3, 12 and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Na in view of Heess in view of Johnson and further in view of Rodriguez-Ramos et al. (Rodriguez-Ramos et al., “A Deep Reinforcement Learning Strategy for UAV Autonomous Landing on a Moving Platform”, 2019, hereinafter ‘Rodriguez-Ramos’).

Regarding claim 3, Na in view of Heess in view of Johnson teaches: 
The system as set forth in Claim 1. 
Na in view of Heess and further in view of Johnson does not specifically disclose: 
wherein the scripted action includes parameters specifying aircraft direction and speed, and wherein causing a device to perform the scripted action includes controlling aircraft in a flight scenario by operating the one or more mobile platform actuators to conform to the specified aircraft direction and speed. 
Rodriguez-Ramos teaches: 
wherein the scripted action includes parameters specifying aircraft direction and speed, and wherein causing a device to perform the scripted action includes controlling aircraft in a flight scenario by operating the one or more mobile platform actuators to conform to the specified aircraft direction and speed. ([Rodriguez-Ramos, page 356, left col, 3.2 Reinforcement Learning Based Formulation, line 1 – right col, line 26] discloses utilizing reinforcement learning to train the UAV (aircraft) to conform desired direction and speed using reference velocities. Velocity is not just speed and it includes directions. [Rodriguez-Ramos, page 351, left col, 1 Introduction, line 3-10] and [Rodriguez-Ramos, page 361, right col, Table 1] collectively discloses that the UAV is trained to follow the simulated flight scenarios)
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having the teachings of Na, Heess, Johnson, and Rodriguez-Ramos to use the method of wherein causing a device to perform the scripted action includes controlling aircraft in a flight scenario of Rodriguez-Ramos to implement the behavior generation system of Na. The suggestion and/or motivation to do so is to implement the flight behavior generation system, as the device should be able to control the aircraft to implement the flight behavior control system.

Claim 12 is a computer program product claim having similar limitation to the claim 3. Therefore, it is rejected under the same rationale as the claim 3 above. 

Claim 21 is a method claim having similar limitation to the claim 3. Therefore, it is rejected under the same rationale as the claim 3 above. 

Claims 4, 7, 13, 16, 22, and 25 are rejected under 35 U.S.C. 103 as being unpatentable over Na in view of Heess in view of Johnson and further in view of Ionescu et al. (US 20200104645 A1, hereinafter ‘Ionescu’).

Regarding claim 4, Na in view of Heess and further in view of Johnson teaches: 
The system as set forth in claim 1.
Na in view of Heess and further in view of Johnson does not specifically disclose further comprising an operation of using a softmax learning function to train a reinforcement learning agent within the high-level controller to produce probabilities of selecting different high-level behaviors.
Ionescu teaches: 
further comprising an operation of using a softmax learning function to train a reinforcement learning agent within the high-level controller to produce probabilities of selecting different high-level behaviors ([Ionescu, 0077] A score or Q-value may be provided as an output from a neural network or a neural network may provide an output defining a probability distribution from which a score or Q-value may be selected. A selection dependent upon a score or Q-value may be made by choosing a selection with the highest score or Q-value. Alternatively a probability may be determined for each possible selection of a predetermined set to define a probability distribution across the possible selections, e.g. by processing respective scores or Q-values with a softmax function, and the selection may be made by sampling from this probability distribution. A selection made by the option controller 304, meta-controller 310, or task controller 320 may be made according to an epsilon-greedy policy, which makes a random selection with a probability ϵ and a selection based on a determined score or Q-value with a probability 1−ϵ).
Before the effective filing date of the invention to a person of ordinary skill in the art, it would have been obvious, having the teachings of Na, Heess, Johnson and Ionescu to use a softmax learning function to train a reinforcement learning agent to produce probabilities of selecting different high-level behaviors of Ionescu to implement the behavior generation system of Na. The suggestion and/or motivation to do so is to improve the performance of the behavior generation system, as generating probabilities using a softmax layer enables the system to choose the best action to perform in a given situation.

Regarding claim 13, Na teaches: 
further comprising instructions encoded on the non-transitory medium for causing the one or more processors to perform ([Na, Abstract, line 1-3 and 14-16] discloses that the architecture select behaviors based on a classification. The simulation is performed using a computer and an actual robot MORIS is used to test the architecture). Claim 13 is a computer program product claim having similar limitation to the claim 4. Therefore, it is rejected under the same rationale as the claim 4 above. 

Claim 22 is a method claim having similar limitation to the claim 4. Therefore, it is rejected under the same rationale as the claim 4 above. 

Regarding claim 7, Na in view of Heess in view of Johnson and further in view of Ionescu teaches: 
wherein the neural net produces action value outputs based on the environment observations, and wherein the high-level controller selects the high-level behavior using a softmax function on the action value outputs ([Ionescu, 0077] A score or Q-value may be provided as an output from a neural network or a neural network may provide an output defining a probability distribution from which a score or Q-value may be selected. A selection dependent upon a score or Q-value may be made by choosing a selection with the highest score or Q-value. Alternatively a probability may be determined for each possible selection of a predetermined set to define a probability distribution across the possible selections, e.g. by processing respective scores or Q-values with a softmax function, and the selection may be made by sampling from this probability distribution. A selection made by the option controller 304, meta-controller 310, or task controller 320 may be made according to an epsilon-greedy policy, which makes a random selection with a probability ϵ and a selection based on a determined score or Q-value with a probability 1−ϵ).

Claim 16 is a computer program product claim having similar limitation to the claim 7. Therefore, it is rejected under the same rationale as the claim 7 above. 

Claim 25 is a method claim having similar limitation to the claim 7. Therefore, it is rejected under the same rationale as the claim 7 above. 

Response to Arguments
Applicant's arguments filed 09/13/2024 have been fully considered but they are not persuasive. 

Claim Objections
Amended claims were received on 04/22/2025. Claim Objections have been withdrawn. 

Response to Arguments under 35 U.S.C. 101 
Applicant’s arguments, see [Remarks, page 11-16], filed 10/07/2025, with respect to 35 U.S.C. 101 rejection have been fully considered and are persuasive. The 35 U.S.C. 101 rejections of claims 1-7, 9-16, 18-25 and 27 have been withdrawn. 

Response to Arguments under 35 U.S.C. 103
	Arguments: Applicant asserts that neither of the cited prior art references teach or suggest the claim limitations of currently amended Claims 1, 10 and 19.  
Examiner’s Response: Applicant’s arguments with respect to claims 1, 10 and 19 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JUN KWON whose telephone number is (571)272-2072. The examiner can normally be reached Monday – Friday 7:30AM – 4:30PM ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Abdullah Kawsar can be reached at (571)270-3169. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/JUN KWON/Examiner, Art Unit 2127                                                                                                                                                                                                        
/ABDULLAH AL KAWSAR/Supervisory Patent Examiner, Art Unit 2127

Read full office action

Prosecution Timeline

Oct 06, 2020

Application Filed

Nov 30, 2023

Non-Final Rejection — §101, §103

Mar 05, 2024

Response Filed

Jun 18, 2024

Final Rejection — §101, §103

Sep 13, 2024

Request for Continued Examination

Sep 18, 2024

Response after Non-Final Action

Jan 28, 2025

Non-Final Rejection — §101, §103

Apr 22, 2025

Response Filed

Jul 03, 2025

Final Rejection — §101, §103

Sep 30, 2025

Examiner Interview Summary

Sep 30, 2025

Applicant Interview (Telephonic)

Oct 07, 2025

Request for Continued Examination

Oct 15, 2025

Response after Non-Final Action

Feb 18, 2026

Non-Final Rejection — §101, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

17/087,881

Patent 12602569

EXTRACTING ENTITY RELATIONSHIPS FROM DIGITAL DOCUMENTS UTILIZING MULTI-VIEW NEURAL NETWORKS

2y 5m to grant Granted Apr 14, 2026

17/178,360

Patent 12602609

UPDATING MACHINE LEARNING TRAINING DATA USING GRAPHICAL INPUTS

2y 5m to grant Granted Apr 14, 2026

18/451,880

Patent 12579436

Tensorized LSTM with Adaptive Shared Memory for Learning Trends in Multivariate Time Series

2y 5m to grant Granted Mar 17, 2026

18/811,610

Patent 12572777

Policy-Based Control of Multimodal Machine Learning Model via Activation Analysis

2y 5m to grant Granted Mar 10, 2026

18/759,617

Patent 12493772

LAYERED MULTI-PROMPT ENGINEERING FOR PRE-TRAINED LARGE LANGUAGE MODELS

2y 5m to grant Granted Dec 09, 2025

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

5-6

Expected OA Rounds

38%

Grant Probability

84%

With Interview (+46.2%)

4y 3m

Median Time to Grant

High

PTA Risk

Based on 68 resolved cases by this examiner. Grant probability derived from career allow rate.