Last updated: May 29, 2026
Application No. 18/105,180
CONTROLLING COMPUTING DEVICES USING HIERARCHICAL AGENTS

Final Rejection §101§103
Filed
Feb 02, 2023
Examiner
LAU, KAITLYN RENEE
Art Unit
2148
Tech Center
2100 — Computer Architecture & Software
Assignee
Deepmind Technologies Limited
OA Round
2 (Final)
Interview Optional

— +100.0% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 67% grant rate with +100.0% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 3 resolved cases, 2023–2026
Examiner Intelligence

LAU, KAITLYN RENEE View full profile →
Grants 67% — above average
Career Allowance Rate
2 granted / 3 resolved
+11.7% vs TC avg
Strong +100% interview lift
Without
With
+100.0%
Interview Lift
resolved cases with interview
Typical timeline
3y 10m
Avg Prosecution
13 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
21.6%
-18.4% vs TC avg
§103
66.7%
+26.7% vs TC avg
§102
5.9%
-34.1% vs TC avg
§112
5.9%
-34.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 3 resolved cases
Office Action

§101 §103
DETAILED ACTION
This action is in response to the application filed 02/25/2026. Claims 1-6 and 9-22 are pending and have been examined.
	
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-6 and 9-22 rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Regarding Claim 1:
Subject Matter Eligibility Analysis Step 1:
	Claim 1 recites a method and is thus a process, one of the four statutory categories of patentable subject matter.
Subject Matter Eligibility Analysis Step 2A Prong 1:
	Claim 1 recites
selecting a gesture class for the time step from a plurality of gesture classes (This limitation is a mental process as it encompasses a human mentally selecting a gesture class.)
processing a mid-level input derived from the observation … to generate a mid-level output that comprises parameters that define a gesture from the selected gesture class, wherein the parameters comprise one of more of: at least one touch position on a display of the one or more computing devices, or a cardinal direction along the display of the one or more computing devices (This limitation is a mental process as it encompasses a human mentally processing input to generate parameters that define a gesture.)
processing a low-level input derived from at least the parameters … to generate a policy output that defines a sequence of one or more actions from a plurality of actions for interacting with the one or more computing devices to perform the gesture defined by the mid-level output (This limitation is a mental process as it encompasses a human mentally processing input to generate a policy.)
Therefore, claim 1 recites an abstract idea.
Subject Matter Eligibility Analysis Step 2A Prong 2:
	Claim 1 further recites additional elements of
A method for controlling one or more computing devices to perform a task (This element does not integrate the abstract idea into a practical application because it recites a technological environment in which to apply a judicial exception (see MPEP 2106.05(h)).)
receiving an observation characterizing a state of the one or more computing devices at the time step (This element does not integrate the abstract idea into a practical application because it recites insignificant extra-solution activity of data gathering (see MPEP 2106.05(g)).)
using a high-level agent (This element does not integrate the abstract idea into a practical application because it amounts to mere “apply it on a computer” (see MPEP 2106.05(f)).)
using a mid-level agent neural network conditioned on the selected gesture class (This element does not integrate the abstract idea into a practical application because it amounts to mere “apply it on a computer” (see MPEP 2106.05(f)).)
using a low-level agent neural network (This element does not integrate the abstract idea into a practical application because it amounts to mere “apply it on a computer” (see MPEP 2106.05(f)).)
performing the sequence of one or more actions to interact with the one or more computing devices (This element does not integrate the abstract idea into a practical application because it amounts to mere “apply it on a computer” (see MPEP 2106.05(f)).)
Therefore, claim 1 is not integrated into a practical application.
Subject Matter Eligibility Analysis Step 2B:
	The additional elements of claim 1 do not provide significantly more than the abstract idea itself, taken alone and in combination because
A method for controlling one or more computing devices to perform a task specifies a particular technological environment to perform the abstract idea and cannot provide significantly more (see MPEP 2106.05(h)).
receiving an observation characterizing a state of the one or more computing devices at the time step is the well understood, routine, and conventional activity of “transmitting or receiving data over a network” (see MPEP 2106.05(d)(II); OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network)).
using a high-level agent uses a computer as a tool to perform the abstract idea and cannot provide significantly more (see MPEP 2106.05(f)).
using a mid-level agent neural network conditioned on the selected gesture class uses a computer as a tool to perform the abstract idea and cannot provide significantly more (see MPEP 2106.05(f)).
using a low-level agent neural network uses a computer as a tool to perform the abstract idea and cannot provide significantly more (see MPEP 2106.05(f)).
performing the sequence of one or more actions to interact with the one or more computing devices uses a computer as a tool to perform the abstract idea and cannot provide significantly more (see MPEP 2106.05(f)).
Therefore, claim 1 is subject-matter ineligible.

Regarding Claim 2:
Subject Matter Eligibility Analysis Step 2A Prong 1:
	Claim 2 recites
selecting, from the plurality of gesture classes, a gesture class that was determined to be a best performing gesture class for the task during training of the high-level agent (This limitation is a mental process as it encompasses a human mentally selecting a gesture class.)
Therefore, claim 2 recites an abstract idea.
Subject Matter Eligibility Analysis Step 2A Prong 2:
	Claim 2 does not further recite any additional elements. Therefore, claim 2 is not integrated into a practical application.
Subject Matter Eligibility Analysis Step 2B:
	Since there are no additional elements, claim 2 does not provide significantly more than the abstract idea itself, taken alone and in combination. Therefore, claim 2 is subject-matter ineligible.

Regarding Claim 3:
Subject Matter Eligibility Analysis Step 2A Prong 1:
	Claim 3 recites
processing a high-level input derived from the observation… to generate a high-level output that comprises a respective score for each gesture class of the plurality of gesture classes; (This limitation is a mental process as it encompasses a human mentally processing input to generate a score.)
selecting, using the high-level output a gesture class from the plurality of gesture classes. (This limitation is a mental process as it encompasses a human mentally selecting a gesture class.)
Therefore, claim 3 recites an abstract idea.
Subject Matter Eligibility Analysis Step 2A Prong 2:
	Claim 3 further recites additional elements of
wherein the high-level agent comprises a high-level agent neural network, (This element does not integrate the abstract idea into a practical application because it amounts to mere “apply it on a computer” (see MPEP 2106.05(f)).)
using the high-level agent neural network (This element does not integrate the abstract idea into a practical application because it amounts to mere “apply it on a computer” (see MPEP 2106.05(f)).)
Therefore, claim 3 is not integrated into a practical application.
Subject Matter Eligibility Analysis Step 2B:
	The additional elements of claim 3 do not provide significantly more than the abstract idea itself, taken alone and in combination because
wherein the high-level agent comprises a high-level agent neural network uses a computer as a tool to perform the abstract idea and cannot provide significantly more (see MPEP 2106.05(f)).
using the high-level agent neural network uses a computer as a tool to perform the abstract idea and cannot provide significantly more (see MPEP 2106.05(f)).
Therefore, claim 3 is subject-matter ineligible.


Regarding Claim 4:
Subject Matter Eligibility Analysis Step 2A Prong 1:
	Claim 4 recites
wherein the plurality of gesture classes includes one or more of: a tap gesture class, a swipe gesture class, or a fling gesture class (This limitation further describes the mental process of selecting a gesture class from claim 1.)
Therefore, claim 4 recites an abstract idea.
Subject Matter Eligibility Analysis Step 2A Prong 2:
	Claim 4 does not further recite any additional elements. Therefore, claim 4 is not integrated into a practical application.
Subject Matter Eligibility Analysis Step 2B:
	Since there are no additional elements, claim 4 does not provide significantly more than the abstract idea itself, taken alone and in combination. Therefore, claim 4 is subject-matter ineligible.

Regarding Claim 5:
Subject Matter Eligibility Analysis Step 2A Prong 1:
	Claim 5 recites the same abstract idea as claim 1. Therefore, claim 5 recites an abstract idea.
Subject Matter Eligibility Analysis Step 2A Prong 2:
	Claim 5 further recites additional elements of
wherein the observation is an image of a display of the one or more computing devices (This element does not integrate the abstract idea into a practical application because it recites insignificant extra-solution activity of data gathering (see MPEP 2106.05(g)).)
Therefore, claim 5 is not integrated into a practical application.
Subject Matter Eligibility Analysis Step 2B:
	The additional elements of claim 5 do not provide significantly more than the abstract idea itself, taken alone and in combination because
wherein the observation is an image of a display of the one or more computing devices.is the well understood, routine, and conventional activity of “transmitting or receiving data over a network” (see MPEP 2106.05(d)(II); OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network)).
Therefore, claim 5 is subject-matter ineligible.


Regarding Claim 6:
Subject Matter Eligibility Analysis Step 2A Prong 1:
	Claim 6 recites
wherein the action is a touch input to a display of the one or more computing devices. (This limitation further describes the mental process of generating a policy that defines a sequence of one or more actions from claim 1.)
Therefore, claim 6 recites an abstract idea.
Subject Matter Eligibility Analysis Step 2A Prong 2:
	Claim 6 does not further recite any additional elements. Therefore, claim 6 is not integrated into a practical application.
Subject Matter Eligibility Analysis Step 2B:
	Since there are no additional elements, claim 6 does not provide significantly more than the abstract idea itself, taken alone and in combination. Therefore, claim 6 is subject-matter ineligible.

Regarding Claim 9:
Subject Matter Eligibility Analysis Step 2A Prong 1:
	Claim 9 recites
wherein the low-level input further comprises a touch position on a display of the one or more computing devices of a preceding action of a previous time step. (This limitation further describes the mental process of processing a low-level input from claim 1.)
Therefore, claim 9 recites an abstract idea.
Subject Matter Eligibility Analysis Step 2A Prong 2:
	Claim 9 does not further recite any additional elements. Therefore, claim 9 is not integrated into a practical application.
Subject Matter Eligibility Analysis Step 2B:
	Since there are no additional elements, claim 9 does not provide significantly more than the abstract idea itself, taken alone and in combination. Therefore, claim 9 is subject-matter ineligible.

Regarding Claim 10:
Subject Matter Eligibility Analysis Step 2A Prong 1:
	Claim 10 recites
wherein the low-level input comprises a one-hot encoding of each of the parameters and a one-hot encoding of the touch position on a display of the one or more computing devices of a preceding action of a previous time step. (This limitation further describes the mental process of processing a low-level input from claim 9 and thus claim 1.)
Therefore, claim 10 recites an abstract idea.
Subject Matter Eligibility Analysis Step 2A Prong 2:
	Claim 10 does not further recite any additional elements. Therefore, claim 10 is not integrated into a practical application.
Subject Matter Eligibility Analysis Step 2B:
	Since there are no additional elements, claim 10 does not provide significantly more than the abstract idea itself, taken alone and in combination. Therefore, claim 10 is subject-matter ineligible.

Regarding Claim 11:
Subject Matter Eligibility Analysis Step 2A Prong 1:
	Claim 11 recites
wherein each action in the sequence comprises a touch position on a display of the one or more computing devices (This limitation further describes the mental process of generating a policy that defines a sequence of one or more actions from claim 1.)
Therefore, claim 11 recites an abstract idea.
Subject Matter Eligibility Analysis Step 2A Prong 2:
	Claim 11 does not further recite any additional elements. Therefore, claim 11 is not integrated into a practical application.
Subject Matter Eligibility Analysis Step 2B:
	Since there are no additional elements, claim 11 does not provide significantly more than the abstract idea itself, taken alone and in combination. Therefore, claim 11 is subject-matter ineligible.

Regarding Claim 12:
Subject Matter Eligibility Analysis Step 2A Prong 1:
	Claim 12 recites
process the mid-level input derived from the observation (This limitation is a mental process as it encompasses a human mentally processing input.)
processing the observation…to generate a feature representation for each class (This limitation is a mental process as it encompasses a human mentally processing the observation to generate a feature representation.)
processing each feature representation… to generate a respective score for each of the parameters for each of the gesture classes (This limitation is a mental process as it encompasses a human mentally processing each feature representation to generate a score.)
Therefore, claim 12 recites an abstract idea.
Subject Matter Eligibility Analysis Step 2A Prong 2:
	Claim 12 further recites additional elements of
wherein the mid-level agent neural network is configured to process the mid-level input (This element does not integrate the abstract idea into a practical application because it amounts to mere “apply it on a computer” (see MPEP 2106.05(f)).)
using an encoder neural network for each gesture class (This element does not integrate the abstract idea into a practical application because it amounts to mere “apply it on a computer” (see MPEP 2106.05(f)).)
using a decoder neural network (This element does not integrate the abstract idea into a practical application because it amounts to mere “apply it on a computer” (see MPEP 2106.05(f)).)
Therefore, claim 12 is not integrated into a practical application.
Subject Matter Eligibility Analysis Step 2B:
	The additional elements of claim 12 do not provide significantly more than the abstract idea itself, taken alone and in combination because
wherein the mid-level agent neural network is configured to process the mid-level input uses a computer as a tool to perform the abstract idea and cannot provide significantly more (see MPEP 2106.05(f)).
Using an encoder neural network for each gesture class uses a computer as a tool to perform the abstract idea and cannot provide significantly more (see MPEP 2106.05(f)).
Using a decoder network uses a computer as a tool to perform the abstract idea and cannot provide significantly more (see MPEP 2106.05(f)).
Therefore, claim 12 is subject-matter ineligible.


Regarding Claim 13:
Subject Matter Eligibility Analysis Step 2A Prong 1:
	Claim 13 recites
wherein each gesture class of the plurality of gesture classes has a respective set of parameters that each have a respective set of possible values (This limitation further describes the mental process of selecting a gesture class from claim 1.)
process the mid-level input to generate a respective score for each possible value of each of the parameters for each of the gesture classes, (This limitation is a mental process as it encompasses a human mentally processing the input to generate a score.)
generating the mid-level output by selecting, for the selected gesture class, a respective value for each of the parameters for the selected gesture class using the respective scores for each of the possible values for the parameter. (This limitation is a mental process as it encompasses a human mentally generating the output by selecting a value.)
Therefore, claim 13 recites an abstract idea.
Subject Matter Eligibility Analysis Step 2A Prong 2:
	Claim 13 further recites additional elements of
wherein the mid-level agent neural network is configured to process the mid-level input (This element does not integrate the abstract idea into a practical application because it amounts to mere “apply it on a computer” (see MPEP 2106.05(f)).)
Therefore, claim 13 is not integrated into a practical application.
Subject Matter Eligibility Analysis Step 2B:
	The additional elements of claim 13 do not provide significantly more than the abstract idea itself, taken alone and in combination because
wherein the mid-level agent neural network is configured to process the mid-level input uses a computer as a tool to perform the abstract idea and cannot provide significantly more (see MPEP 2106.05(f)).
Therefore, claim 13 is subject-matter ineligible.

Regarding Claim 14:
Subject Matter Eligibility Analysis Step 2A Prong 1:
	Claim 14 recites the same abstract ideas as claim 1. Therefore, claim 14 recites an abstract idea.
Subject Matter Eligibility Analysis Step 2A Prong 2:
	Claim 14 further recites additional elements of
wherein the low-level agent neural network comprises a respective neural network for each gesture class (This element does not integrate the abstract idea into a practical application because it recites generic computing components on which to perform the abstract idea (see MPEP 2106.05(f)).)
Therefore, claim 14 is not integrated into a practical application.
Subject Matter Eligibility Analysis Step 2B:
	The additional elements of claim 14 do not provide significantly more than the abstract idea itself, taken alone and in combination because
wherein the low-level agent neural network comprises a respective neural network for each gesture class uses a computer as a tool to perform the abstract idea and cannot provide significantly more (see MPEP 2106.05(f)).
Therefore, claim 14 is subject-matter ineligible.


Regarding Claim 15:
Subject Matter Eligibility Analysis Step 2A Prong 1:
	Claim 15 recites the same abstract ideas as claim 1. Therefore, claim 15 recites an abstract idea.
Subject Matter Eligibility Analysis Step 2A Prong 2:
	Claim 15 further recites additional elements of
wherein the high-level agent and the mid-level agent neural network have been trained through reinforcement learning on training data for the task (This element does not integrate the abstract idea into a practical application because it recites generic computing components on which to perform the abstract idea (see MPEP 2106.05(f)).)
Therefore, claim 15 is not integrated into a practical application.
Subject Matter Eligibility Analysis Step 2B:
	The additional elements of claim 15 do not provide significantly more than the abstract idea itself, taken alone and in combination because
wherein the high-level agent and the mid-level agent neural network have been trained through reinforcement learning on training data for the task uses a computer as a tool to perform the abstract idea and cannot provide significantly more (see MPEP 2106.05(f)).
Therefore, claim 15 is subject-matter ineligible.

Regarding Claim 16:
Subject Matter Eligibility Analysis Step 2A Prong 1:
	Claim 16 recites the same abstract ideas as claim 15. Therefore, claim 16 recites an abstract idea.
Subject Matter Eligibility Analysis Step 2A Prong 2:
	Claim 16 further recites additional elements of
wherein the low-level agent neural network has been pre-trained prior to the training of the high-level agent and mid-level agent neural network (This element does not integrate the abstract idea into a practical application because it recites generic computing components on which to perform the abstract idea (see MPEP 2106.05(f)).)
Therefore, claim 16 is not integrated into a practical application.
Subject Matter Eligibility Analysis Step 2B:
	The additional elements of claim 16 do not provide significantly more than the abstract idea itself, taken alone and in combination because
wherein the low-level agent neural network has been pre-trained prior to the training of the high-level agent and mid-level agent neural network uses a computer as a tool to perform the abstract idea and cannot provide significantly more (see MPEP 2106.05(f)).
Therefore, claim 16 is subject-matter ineligible.

Regarding Claim 17:
Subject Matter Eligibility Analysis Step 2A Prong 1:
	Claim 17 recites the same abstract ideas as claim 15. Therefore, claim 17 recites an abstract idea.
Subject Matter Eligibility Analysis Step 2A Prong 2:
	Claim 17 further recites additional elements of
wherein the mid-level agent neural network has been trained prior to the training of the high-level agent (This element does not integrate the abstract idea into a practical application because it recites generic computing components on which to perform the abstract idea (see MPEP 2106.05(f)).)
Therefore, claim 17 is not integrated into a practical application.
Subject Matter Eligibility Analysis Step 2B:
	The additional elements of claim 17 do not provide significantly more than the abstract idea itself, taken alone and in combination because
wherein the mid-level agent neural network has been trained prior to the training of the high-level agent uses a computer as a tool to perform the abstract idea and cannot provide significantly more (see MPEP 2106.05(f)).
Therefore, claim 17 is subject-matter ineligible.

Regarding Claim 18:
Subject Matter Eligibility Analysis Step 2A Prong 1:
	Claim 18 recites the same abstract ideas as claim 15. Therefore, claim 18 recites an abstract idea.
Subject Matter Eligibility Analysis Step 2A Prong 2:
	Claim 18 further recites additional elements of
wherein the mid-level agent neural network has been trained using randomly chosen gesture classes (This element does not integrate the abstract idea into a practical application because it recites generic computing components on which to perform the abstract idea (see MPEP 2106.05(f)).)
Therefore, claim 18 is not integrated into a practical application.
Subject Matter Eligibility Analysis Step 2B:
	The additional elements of claim 18 do not provide significantly more than the abstract idea itself, taken alone and in combination because
wherein the mid-level agent neural network has been trained using randomly chosen gesture classes uses a computer as a tool to perform the abstract idea and cannot provide significantly more (see MPEP 2106.05(f)).
Therefore, claim 18 is subject-matter ineligible.

Regarding Claim 19:
Subject Matter Eligibility Analysis Step 2A Prong 1:
	Claim 19 recites the same abstract ideas as claim 15. Therefore, claim 19 recites an abstract idea.
Subject Matter Eligibility Analysis Step 2A Prong 2:
	Claim 19 further recites additional elements of
wherein the mid-level agent neural network has been trained jointly with the high-level agent. (This element does not integrate the abstract idea into a practical application because it recites generic computing components on which to perform the abstract idea (see MPEP 2106.05(f)).)
Therefore, claim 19 is not integrated into a practical application.
Subject Matter Eligibility Analysis Step 2B:
	The additional elements of claim 19 do not provide significantly more than the abstract idea itself, taken alone and in combination because
wherein the mid-level agent neural network has been trained jointly with the high-level agent uses a computer as a tool to perform the abstract idea and cannot provide significantly more (see MPEP 2106.05(f)).
Therefore, claim 19 is subject-matter ineligible.

Regarding Claim 20:
Subject Matter Eligibility Analysis Step 1:
	Claim 20 recites a system and is thus an apparatus, one of the four statutory categories of patentable subject matter.
Subject Matter Eligibility Analysis Step 2A Prong 1:
	Claim 20 recites
perform operations (This limitation is a mental process as it encompasses a human mentally performing operations.)
selecting a gesture class for the time step from a plurality of gesture classes (This limitation is a mental process as it encompasses a human mentally selecting a gesture class.)
processing a mid-level input derived from the observation … to generate a mid-level output that comprises parameters that define a gesture from the selected gesture class, wherein the parameters comprise one or more of: at least one touch position on a display of the one or more computing devices, or a cardinal direction along the display of the one or more computing devices (This limitation is a mental process as it encompasses a human mentally processing input to generate parameters that define a gesture.)
processing a low-level input derived from at least the parameters … to generate a policy output that defines a sequence of one or more actions from a plurality of actions for interacting with the one or more computing devices to perform the gesture defined by the mid-level output (This limitation is a mental process as it encompasses a human mentally processing input to generate a policy.)
Therefore, claim 20 recites an abstract idea.
Subject Matter Eligibility Analysis Step 2A Prong 2:
	Claim 20 further recites additional elements of
A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations (This element does not integrate the abstract idea into a practical application because it amounts to mere “apply it on a computer” (see MPEP 2106.05(f)).)
receiving an observation characterizing a state of the one or more computing devices at the time step (This element does not integrate the abstract idea into a practical application because it recites insignificant extra-solution activity of data gathering (see MPEP 2106.05(g)).)
using a high-level agent (This element does not integrate the abstract idea into a practical application because it amounts to mere “apply it on a computer” (see MPEP 2106.05(f)).)
using a mid-level agent neural network conditioned on the selected gesture class (This element does not integrate the abstract idea into a practical application because it amounts to mere “apply it on a computer” (see MPEP 2106.05(f)).)
using a low-level agent neural network (This element does not integrate the abstract idea into a practical application because it amounts to mere “apply it on a computer” (see MPEP 2106.05(f)).)
performing the sequence of one or more actions to interact with the one or more computing devices (This element does not integrate the abstract idea into a practical application because it amounts to mere “apply it on a computer” (see MPEP 2106.05(f)).)
Therefore, claim 20 is not integrated into a practical application.
Subject Matter Eligibility Analysis Step 2B:
	The additional elements of claim 20 do not provide significantly more than the abstract idea itself, taken alone and in combination because
A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations uses a computer as a tool to perform the abstract idea and cannot provide significantly more (see MPEP 2106.05(f)).
receiving an observation characterizing a state of the one or more computing devices at the time step is the well understood, routine, and conventional activity of “transmitting or receiving data over a network” (see MPEP 2106.05(d)(II); OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network)).
using a high-level agent uses a computer as a tool to perform the abstract idea and cannot provide significantly more (see MPEP 2106.05(f)).
using a mid-level agent neural network conditioned on the selected gesture class uses a computer as a tool to perform the abstract idea and cannot provide significantly more (see MPEP 2106.05(f)).
using a low-level agent neural network uses a computer as a tool to perform the abstract idea and cannot provide significantly more (see MPEP 2106.05(f)).
performing the sequence of one or more actions to interact with the one or more computing devices uses a computer as a tool to perform the abstract idea and cannot provide significantly more (see MPEP 2106.05(f)).
Therefore, claim 20 is subject-matter ineligible.

Regarding claim 21, claim 21 recites substantially similar limitations to claim 2, and is therefore rejected under the same analysis.

Regarding claim 22, claim 22 recites substantially similar limitations to claim 3, and is therefore rejected under the same analysis.



Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claim(s) 1, 4-6, 9-11, 13-15, 18, and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Tomaya et al. (“AndroidEnv: A Reinforcement Learning Platform for Android”) (hereafter referred to as Tomaya) in view of Shi et al. (“World of Bits: An Open-Domain Platform for Web-Based Agents”) (hereafter referred to as Shi).

Regarding claim 1, Tomaya teaches 
A method for controlling one or more computing devices to perform a task, the method comprising, at each of a plurality of time steps (Tomaya, page 1, abstract, “We introduce AndroidEnv, an open-source platform for Reinforcement Learning (RL) research built on top of the Android ecosystem. AndroidEnv allows RL agents to interact with a wide variety of apps and services commonly used by humans through a universal touchscreen interface” where “The agent-environment in AndroidEnv matches that of a user and a real device: the screen pixels constitute the observations, the action space is defined by touchscreen gestures, the interaction is real-time, and actions are executed asynchronously, while the environment runs at its own time scale” (Tomaya, page 1, 2nd paragraph) and Tomaya, page 7, Figure 7, 
    PNG
    media_image1.png
    226
    729
    media_image1.png
    Greyscale
): 
receiving an observation characterizing a state of the one or more computing devices at the time step (Tomaya, page 7, Figure 7, 
    PNG
    media_image1.png
    226
    729
    media_image1.png
    Greyscale
where “The agent-environment in AndroidEnv matches that of a user and a real device: the screen pixels constitute the observations, the action space is defined by touchscreen gestures, the interaction is real-time, and actions are executed asynchronously, while the environment runs at its own time scale” (Tomaya, page 1, 2nd paragraph). Examiner notes that the observation is the pixels of the screen on the computing device. Examiner further notes that Figure 7 shows the agent receiving an observation at the time step.); 
selecting a gesture class for the time step from a plurality of gesture classes using a high-level agent (Tomaya, page 2, Section 2.2. Action interface, 1st paragraph, “Raw action space. The native action space of the environment consists of a tuple consisting of a position (x,y)                                 
                                    ∈
                                
                             [0,1] x [0,1], determining the location of the action on the screen, and a discrete value ActionType                                 
                                    ∈
                                
                             {TOUCH, LIFT, REPEAT} indicating whether the agent opts for touching the screen at the indicated location, lifting the point from the screen, or repeating the last chosen action, respectively.” Examiner notes that determining the discrete value ActionType is selecting a gesture class.); 
processing a mid-level input derived from the observation … to generate a mid-level output that comprises parameters that define a gesture from the selected gesture class wherein the parameters comprise one or more of: at least one touch position on a display of the one or more computing devices, or a cardinal direction along the display of the one or more computing devices (Tomaya, page 2, 1st paragraph “It is more useful for agents to control Android applications via gestures, such as pressing, long pressing, swiping, scrolling, or drag-and-drop. Each of these correspond to a particular sequence of raw actions: for example, a screen touch at a particular location, followed by a lift of the the imaginary finger is a sequence that Android can interpret as a press of a button. Similarly, Android will interpret a sequence of aligned touches as scrolling” and “Raw action space. The native action space of the environment consists of a tuple consisting of a position (x,y)                                 
                                    ∈
                                
                             [0,1] x [0,1], determining the location of the action on the screen, and a discrete value ActionType                                 
                                    ∈
                                
                             {TOUCH, LIFT, REPEAT} indicating whether the agent opts for touching the screen at the indicated location, lifting the point from the screen, or repeating the last chosen action, respectively” (Tomaya, page 2, Section 2.2. Action interface, 1st paragraph).  Examiner notes that processing the mid-level input is interpreting the sequence of raw actions, and the parameters is the particular sequence of raw actions. Examiner notes that the particular sequence of raw actions are the parameters that define a gesture. Examiner notes that by tracking the positions of screen touches in scrolling, the parameters that define a gesture from the selected gesture class comprise a cardinal direction along a display.)
processing a low-level input derived from at least the parameters … to generate a policy output that defines a sequence of one or more actions from a plurality of actions for interacting with the one or more computing devices to perform the gesture defined by the mid-level output (Tomaya, page 2, 3rd paragraph, “Another notable feature of AndroidEnv is the spatial correlation between actions and observations. Often an action can result in local changes in the pixels near the location of the action, or the position of certain items in the observation might hint at the next best location to take an action. In particular, the screen is often suggestive of the kind of gestures the application expects: smartphone users would often find it intuitive to tap where they see an item in the shape of a button, or to scroll where they see a drop-down menu” where “It is more useful for agents to control Android applications via gestures, such as pressing, long pressing, swiping, scrolling, or drag-and-drop. Each of these correspond to a particular sequence of raw actions: for example, a screen touch at a particular location, followed by a lift of the the imaginary finger is a sequence that Android can interpret as a press of a button. Similarly, Android will interpret a sequence of aligned touches as scrolling” (Tomaya, page 2, 1st paragraph). Examiner notes that the policy output is hinting the next best location to take an action. Examiner further notes that the action comprises gestures.)
and performing the sequence of one or more actions to interact with the one or more computing devices (Tomaya, page 2, 1st paragraph “It is more useful for agents to control Android applications via gestures, such as pressing, long pressing, swiping, scrolling, or drag-and-drop. Each of these correspond to a particular sequence of raw actions: for example, a screen touch at a particular location, followed by a lift of the the imaginary finger is a sequence that Android can interpret as a press of a button. Similarly, Android will interpret a sequence of aligned touches as scrolling.”).
Tomaya does not explicitly disclose the use of neural networks. Shi does disclose
using a mid-level agent neural network conditioned on the selected gesture class (Shi, page 5, 2nd paragraph, “Our model first processes the image using a Convolutional Neural Network (CNN). For DOM, we compute a text feature map based on the matching between query and DOM. Then the two maps are concatenated into a join representation. On top of this we develop two variants: first we flatten the features and feed them directly through a fully-connected layer (GlobalCNN). Since we had the intuition that local feature alone should suffice to characterize the action, we also examine a LocalCNN architecture to capture the intuition that agent should attend to where cursor is. So the mouse distribution is used as soft attention (Bahdanau et al., 2014) to average the CNN features into a global representation to predict mouse buttons and keyboard events” where “We train models on web tasks by sequencing behavior cloning and reinforcement learning” (Shi, page 5, 1st column, Section 3.2 Optimization) where “Supervised Learning. We obtain a behavior cloning policy by training on the demonstrations using Adam” (Shi, page 5, 2nd column, 4th paragraph) and “Demonstration data. We collected 10 minutes of human demonstrations on each of the 100 MiniWoB environments (about 17 hours total). Unlike the FormWoB and QAWoB settings, the MiniWoB dataset contains interactions that require dragging and hovering (e.g., to trigger a menu expansion). Therefore, we process the demonstrations at regular 83 millisecond intervals (12 frames per second) to extract approximately 720,000 state-action pairs. With gridpoints spaced 8 pixels across the 160 pixel area, we obtain a 20x20 grid and 3 possible actions (move, drag, click), leading to a total of 20x20x3 = 1200 possible actions” (Shi, page 5, 1st column, last paragraph – 2nd column, 1st paragraph).  Examiner notes that the CNN is the mid-level neural network. Examiner further notes that the CNN is trained, or conditioned, using actions from demonstrations through the behavior cloning.)
using a low-level agent neural network (Shi, page 5, 2nd paragraph, “Our model first processes the image using a Convolutional Neural Network (CNN). For DOM, we compute a text feature map based on the matching between query and DOM. Then the two maps are concatenated into a join representation. On top of this we develop two variants: first we flatten the features and feed them directly through a fully-connected layer (GlobalCNN). Since we had the intuition that local feature alone should suffice to characterize the action, we also examine a LocalCNN architecture to capture the intuition that agent should attend to where cursor is. So the mouse distribution is used as soft attention (Bahdanau et al., 2014) to average the CNN features into a global representation to predict mouse buttons and keyboard events.” Examiner notes that the LocalCNN is the low-level neural network.)
Tomaya and Shi are considered analogous to the claimed invention because they both use reinforcement learning to control a computing device. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Tomaya to include the use of a neural networks like in Shi. Thus, this would be applying a known technique (reinforcement learning) to a known device (neural network) ready for improvement to yield predictable results (controlling a computing device) (MPEP 2143 I. (C) Use of known technique to improve similar devices (methods, or product) in the same way.).

Regarding claim 4, Tomaya in view of Shi teach the method of claim 1. Tomaya further teaches
wherein the plurality of gesture classes includes one or more of: a tap gesture class, a swipe gesture class, or a fling gesture class (Tomaya, page 3, 1st paragraph, “It is more useful for agents to control Android applications via gestures, such as pressing, long pressing, swiping, scrolling, or drag-and-drop.”).

Regarding claim 5, Tomaya in view of Shi teach the method of claim 1. Tomaya further teaches
wherein the observation is an image of a display of the one or more computing devices (Tomaya, page 1, 2nd paragraph, “The agent-environment in AndroidEnv matches that of a user and a real device: the screen pixels constitute the observations.”).

Regarding claim 6, Tomaya in view of Shi teach the method of claim 1. Tomaya further teaches
wherein the action is a touch input to a display of the one or more computing devices (Tomaya, page 2, Section 2.2. Action interface, 1st paragraph, “The native action space of the environment consists of a tuple consisting of a position (x,y)                                 
                                    ∈
                                
                             [0,1] x [0,1], determining the location of the action on the screen, and a discrete value ActionType                                 
                                    ∈
                                
                             {TOUCH, LIFT, REPEAT} indicating whether the agent opts for touching the screen at the indicated location, lifting the point from the screen, or repeating the last chosen action, respectively.”).

Regarding claim 9, Tomaya in view of Shi teach the method of claim 1. Tomaya further teaches
wherein the low-level input further comprises a touch position on a display of the one or more computing devices of a preceding action of a previous time step (Tomaya, page 5, 2nd to last paragraph, “In this case, we discretised the screen as a 6x9 grid, resulting in 108 possible actions, corresponding to a choice of ActionType among (LIFT, TOUCH) combined with any of the 54 cells in the grid. To help memoryless agents, we augmented the current observation with a one-hot encoding of the location of the last taken action, which provides a more informative input for learning.” Examiner notes that the observation is the low-level input. Examiner further notes that location of the last taken action is the touch position on a display of the one or more computing devices of a preceding action of a previous time step.)).

Regarding claim 10, Tomaya in view of Shi teach the method of claim 1. Tomaya further teaches
wherein the low-level input comprises a one-hot encoding of each of the parameters and a one-hot encoding of the touch position on a display of the one or more computing devices of a preceding action of a previous time step (Tomaya, page 5, 2nd to last paragraph, “In this case, we discretised the screen as a 6x9 grid, resulting in 108 possible actions, corresponding to a choice of ActionType among (LIFT, TOUCH) combined with any of the 54 cells in the grid. To help memoryless agents, we augmented the current observation with a one-hot encoding of the location of the last taken action, which provides a more informative input for learning.” Examiner notes that the observation is the low-level input. Examiner further notes that location of the last taken action is the touch position on a display of the one or more computing devices of a preceding action of a previous time step.).

Regarding claim 11, Tomaya in view of Shi teach the method of claim 1. Tomaya further teaches
wherein each action in the sequence comprises a touch position on a display of the one or more computing devices (Tomaya, page 2, Section 2.2. Action interface, 1st paragraph, “The native action space of the environment consists of a tuple consisting of a position (x,y)                                 
                                    ∈
                                
                             [0,1] x [0,1], determining the location of the action on the screen, and a discrete value ActionType                                 
                                    ∈
                                
                             {TOUCH, LIFT, REPEAT} indicating whether the agent opts for touching the screen at the indicated location, lifting the point from the screen, or repeating the last chosen action, respectively.”).  

Regarding claim 13, Tomaya in view of Shi teach the method of claim 1. Tomaya further teaches
wherein each gesture class of the plurality of gesture classes has a respective set of parameters that each have a respective set of possible values (Tomaya, page 2, Section 2.2. Action interface, 1st paragraph, “Raw action space. The native action space of the environment consists of a tuple consisting of a position (x,y)                                 
                                    ∈
                                
                             [0,1] x [0,1], determining the location of the action on the screen, and a discrete value ActionType                                 
                                    ∈
                                
                             {TOUCH, LIFT, REPEAT} indicating whether the agent opts for touching the screen at the indicated location, lifting the point from the screen, or repeating the last chosen action, respectively.” Examiner notes that determining the discrete value ActionType is selecting a gesture class.) 
Tomaya does not explicitly disclose, but Shi does disclose
wherein the mid-level agent neural network is configured to process the mid-level input to generate a respective score for each possible value of each of the parameters for each of the gesture classes, and wherein processing the mid-level input comprises (Shi, page 5, 2nd paragraph, “Our model first processes the image using a Convolutional Neural Network (CNN). For DOM, we compute a text feature map based on the matching between query and DOM. Then the two maps are concatenated into a join representation. On top of this we develop two variants: first we flatten the features and feed them directly through a fully-connected layer (GlobalCNN). Since we had the intuition that local feature alone should suffice to characterize the action, we also examine a LocalCNN architecture to capture the intuition that agent should attend to where cursor is. So the mouse distribution is used as soft attention (Bahdanau et al., 2014) to average the CNN features into a global representation to predict mouse buttons and keyboard events” where “We train models on web tasks by sequencing behavior cloning and reinforcement learning” (Shi, page 5, 1st column, Section 3.2 Optimization) where “Supervised Learning. We obtain a behavior cloning policy by training on the demonstrations using Adam” (Shi, page 5, 2nd column, 4th paragraph) and “Demonstration data. We collected 10 minutes of human demonstrations on each of the 100 MiniWoB environments (about 17 hours total). Unlike the FormWoB and QAWoB settings, the MiniWoB dataset contains interactions that require dragging and hovering (e.g., to trigger a menu expansion). Therefore, we process the demonstrations at regular 83 millisecond intervals (12 frames per second) to extract approximately 720,000 state-action pairs. With gridpoints spaced 8 pixels across the 160 pixel area, we obtain a 20x20 grid and 3 possible actions (move, drag, click), leading to a total of 20x20x3 = 1200 possible actions” (Shi, page 5, 1st column, last paragraph – 2nd column, 1st paragraph).  Examiner notes that the CNN is the mid-level neural network. Examiner further notes that the CNN is trained to output a behavior policy where the behavior policy is the score.): 
generating the mid-level output by selecting, for the selected gesture class, a respective value for each of the parameters for the selected gesture class using the respective scores for each of the possible values for the parameter (Shi, page 5, 2nd paragraph, “Our model first processes the image using a Convolutional Neural Network (CNN). For DOM, we compute a text feature map based on the matching between query and DOM. Then the two maps are concatenated into a join representation. On top of this we develop two variants: first we flatten the features and feed them directly through a fully-connected layer (GlobalCNN). Since we had the intuition that local feature alone should suffice to characterize the action, we also examine a LocalCNN architecture to capture the intuition that agent should attend to where cursor is. So the mouse distribution is used as soft attention (Bahdanau et al., 2014) to average the CNN features into a global representation to predict mouse buttons and keyboard events” where “We train models on web tasks by sequencing behavior cloning and reinforcement learning” (Shi, page 5, 1st column, Section 3.2 Optimization) where “Supervised Learning. We obtain a behavior cloning policy by training on the demonstrations using Adam” (Shi, page 5, 2nd column, 4th paragraph) and “Demonstration data. We collected 10 minutes of human demonstrations on each of the 100 MiniWoB environments (about 17 hours total). Unlike the FormWoB and QAWoB settings, the MiniWoB dataset contains interactions that require dragging and hovering (e.g., to trigger a menu expansion). Therefore, we process the demonstrations at regular 83 millisecond intervals (12 frames per second) to extract approximately 720,000 state-action pairs. With gridpoints spaced 8 pixels across the 160 pixel area, we obtain a 20x20 grid and 3 possible actions (move, drag, click), leading to a total of 20x20x3 = 1200 possible actions” (Shi, page 5, 1st column, last paragraph – 2nd column, 1st paragraph).  Examiner notes that the CNN is the mid-level neural network. Examiner further notes that the CNN is trained to output a behavior policy where the behavior policy is the score. Examiner further notes that the respective value is the behavior policy.).
Tomaya and Shi are considered analogous to the claimed invention because they both use reinforcement learning to control a computing device. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Tomaya to include the use of a neural networks like in Shi. Thus, this would be applying a known technique (reinforcement learning) to a known device (neural network) ready for improvement to yield predictable results (controlling a computing device) (MPEP 2143 I. (C) Use of known technique to improve similar devices (methods, or product) in the same way.).

Regarding claim 14, Tomaya in view of Shi teach the method of claim 1. Tomaya in view of Shi further teaches
wherein the low-level agent neural network comprises a respective neural network for each gesture class (Shi, page 5, 2nd paragraph, “Our model first processes the image using a Convolutional Neural Network (CNN). For DOM, we compute a text feature map based on the matching between query and DOM. Then the two maps are concatenated into a join representation. On top of this we develop two variants: first we flatten the features and feed them directly through a fully-connected layer (GlobalCNN). Since we had the intuition that local feature alone should suffice to characterize the action, we also examine a LocalCNN architecture to capture the intuition that agent should attend to where cursor is. So the mouse distribution is used as soft attention (Bahdanau et al., 2014) to average the CNN features into a global representation to predict mouse buttons and keyboard events.” Examiner notes that the LocalCNN is the low-level neural network. Examiner further notes that the LocalCNN is a neural network for all mouse actions and keyboard events, or gesture classes.). 
Tomaya and Shi are considered analogous to the claimed invention because they both use reinforcement learning to control a computing device. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Tomaya to include the use of a neural networks like in Shi. Thus, this would be applying a known technique (reinforcement learning) to a known device (neural network) ready for improvement to yield predictable results (controlling a computing device) (MPEP 2143 I. (C) Use of known technique to improve similar devices (methods, or product) in the same way.).

Regarding claim 15, Tomaya in view of Shi teach the method of claim 1. Tomaya further teaches
wherein the high-level agent … have been trained through reinforcement learning on training data for the task (Tomaya, page 1, abstract, “we introduce AndroidEnv, an open-source platform for Reinforcement Learning (RL) research built on top of the Android ecosystem….we present an empirical evaluation of some popular reinforcement learning agents on a set of tasks built on this platform.”)
Tomaya does not teach, but Shi does teach
wherein … the mid-level agent neural network have been trained through reinforcement learning on training data for the task (Shi, page 7, 2nd paragraph, “Reinforcement Learning. We fine-tune the models using RL on each of the environments separately. For every episode, we sample randomly from the set of queries and run the model at 8 FPS” and “To foster reinforcement learning research in such settings, we introduce the World of Bits (WoB), a platform in which agents complete tasks on the Internet by performing low-level keyboard and mouse actions.”(Shi, page 1, abstract).)
Tomaya and Shi are considered analogous to the claimed invention because they both use reinforcement learning to control a computing device. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Tomaya to include the use of a neural networks like in Shi. Thus, this would be applying a known technique (reinforcement learning) to a known device (neural network) ready for improvement to yield predictable results (controlling a computing device) (MPEP 2143 I. (C) Use of known technique to improve similar devices (methods, or product) in the same way.).

Regarding claim 18, Tomaya in view of Shi teach the method of claim 15. Tomaya in view of Shi further teaches
wherein the mid-level agent neural network has been trained using randomly chosen gesture classes (Shi, page 5, 2nd paragraph, “Our model first processes the image using a Convolutional Neural Network (CNN). For DOM, we compute a text feature map based on the matching between query and DOM. Then the two maps are concatenated into a join representation. On top of this we develop two variants: first we flatten the features and feed them directly through a fully-connected layer (GlobalCNN). Since we had the intuition that local feature alone should suffice to characterize the action, we also examine a LocalCNN architecture to capture the intuition that agent should attend to where cursor is. So the mouse distribution is used as soft attention (Bahdanau et al., 2014) to average the CNN features into a global representation to predict mouse buttons and keyboard events” where “We train models on web tasks by sequencing behavior cloning and reinforcement learning” (Shi, page 5, 1st column, Section 3.2 Optimization) where “Supervised Learning. We obtain a behavior cloning policy by training on the demonstrations using Adam” (Shi, page 5, 2nd column, 4th paragraph) and “Demonstration data. We collected 10 minutes of human demonstrations on each of the 100 MiniWoB environments (about 17 hours total). Unlike the FormWoB and QAWoB settings, the MiniWoB dataset contains interactions that require dragging and hovering (e.g., to trigger a menu expansion). Therefore, we process the demonstrations at regular 83 millisecond intervals (12 frames per second) to extract approximately 720,000 state-action pairs. With gridpoints spaced 8 pixels across the 160 pixel area, we obtain a 20x20 grid and 3 possible actions (move, drag, click), leading to a total of 20x20x3 = 1200 possible actions” (Shi, page 5, 1st column, last paragraph – 2nd column, 1st paragraph).  Examiner notes that the CNN is the mid-level neural network. Examiner further notes that the CNN is trained, or conditioned, using all actions, or gesture classes, from demonstrations through the behavior cloning.). 
Tomaya and Shi are considered analogous to the claimed invention because they both use reinforcement learning to control a computing device. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Tomaya to include the use of a neural networks like in Shi. Thus, this would be applying a known technique (reinforcement learning) to a known device (neural network) ready for improvement to yield predictable results (controlling a computing device) (MPEP 2143 I. (C) Use of known technique to improve similar devices (methods, or product) in the same way.).

Regarding claim 20, Tomaya teaches 
A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations comprising, at each of a plurality of time steps (Tomaya, page 1, abstract, “We introduce AndroidEnv, an open-source platform for Reinforcement Learning (RL) research built on top of the Android ecosystem. AndroidEnv allows RL agents to interact with a wide variety of apps and services commonly used by humans through a universal touchscreen interface” where “The agent-environment in AndroidEnv matches that of a user and a real device: the screen pixels constitute the observations, the action space is defined by touchscreen gestures, the interaction is real-time, and actions are executed asynchronously, while the environment runs at its own time scale” (Tomaya, page 1, 2nd paragraph) where “we described AndroidEnv, an AI platform based on the Android Operating System, which provided tasks based on its large app ecosystem”(Tomaya, page 10, last paragraph) and Tomaya, page 7, Figure 7, 
    PNG
    media_image1.png
    226
    729
    media_image1.png
    Greyscale
): 
receiving an observation characterizing a state of the one or more computing devices at the time step (Tomaya, page 7, Figure 7, 
    PNG
    media_image1.png
    226
    729
    media_image1.png
    Greyscale
where “The agent-environment in AndroidEnv matches that of a user and a real device: the screen pixels constitute the observations, the action space is defined by touchscreen gestures, the interaction is real-time, and actions are executed asynchronously, while the environment runs at its own time scale” (Tomaya, page 1, 2nd paragraph). Examiner notes that the observation is the pixels of the screen on the computing device. Examiner further notes that Figure 7 shows the agent receiving an observation at the time step.); 
selecting a gesture class for the time step from a plurality of gesture classes using a high-level agent (Tomaya, page 2, Section 2.2. Action interface, 1st paragraph, “Raw action space. The native action space of the environment consists of a tuple consisting of a position (x,y)                                 
                                    ∈
                                
                             [0,1] x [0,1], determining the location of the action on the screen, and a discrete value ActionType                                 
                                    ∈
                                
                             {TOUCH, LIFT, REPEAT} indicating whether the agent opts for touching the screen at the indicated location, lifting the point from the screen, or repeating the last chosen action, respectively.” Examiner notes that determining the discrete value ActionType is selecting a gesture class.); 
processing a mid-level input derived from the observation … to generate a mid-level output that comprises parameters that define a gesture from the selected gesture class wherein the parameters comprise one or more of: at least one touch position on a display of the one or more computing devices, or a cardinal direction along the display of the one or more computing devices (Tomaya, page 2, 1st paragraph “It is more useful for agents to control Android applications via gestures, such as pressing, long pressing, swiping, scrolling, or drag-and-drop. Each of these correspond to a particular sequence of raw actions: for example, a screen touch at a particular location, followed by a lift of the the imaginary finger is a sequence that Android can interpret as a press of a button. Similarly, Android will interpret a sequence of aligned touches as scrolling” and “Raw action space. The native action space of the environment consists of a tuple consisting of a position (x,y)                                 
                                    ∈
                                
                             [0,1] x [0,1], determining the location of the action on the screen, and a discrete value ActionType                                 
                                    ∈
                                
                             {TOUCH, LIFT, REPEAT} indicating whether the agent opts for touching the screen at the indicated location, lifting the point from the screen, or repeating the last chosen action, respectively” (Tomaya, page 2, Section 2.2. Action interface, 1st paragraph).  Examiner notes that processing the mid-level input is interpreting the sequence of raw actions, and the parameters is the particular sequence of raw actions. Examiner notes that the particular sequence of raw actions are the parameters that define a gesture. Examiner notes that by tracking the positions of screen touches in scrolling, the parameters that define a gesture from the selected gesture class comprise a cardinal direction along a display.)
processing a low-level input derived from at least the parameters … to generate a policy output that defines a sequence of one or more actions from a plurality of actions for interacting with the one or more computing devices to perform the gesture defined by the mid-level output (Tomaya, page 2, 3rd paragraph, “Another notable feature of AndroidEnv is the spatial correlation between actions and observations. Often an action can result in local changes in the pixels near the location of the action, or the position of certain items in the observation might hint at the next best location to take an action. In particular, the screen is often suggestive of the kind of gestures the application expects: smartphone users would often find it intuitive to tap where they see an item in the shape of a button, or to scroll where they see a drop-down menu” where “It is more useful for agents to control Android applications via gestures, such as pressing, long pressing, swiping, scrolling, or drag-and-drop. Each of these correspond to a particular sequence of raw actions: for example, a screen touch at a particular location, followed by a lift of the the imaginary finger is a sequence that Android can interpret as a press of a button. Similarly, Android will interpret a sequence of aligned touches as scrolling” (Tomaya, page 2, 1st paragraph). Examiner notes that the policy output is hinting the next best location to take an action. Examiner further notes that the action comprises gestures.)
and performing the sequence of one or more actions to interact with the one or more computing devices (Tomaya, page 2, 1st paragraph “It is more useful for agents to control Android applications via gestures, such as pressing, long pressing, swiping, scrolling, or drag-and-drop. Each of these correspond to a particular sequence of raw actions: for example, a screen touch at a particular location, followed by a lift of the the imaginary finger is a sequence that Android can interpret as a press of a button. Similarly, Android will interpret a sequence of aligned touches as scrolling.”).
Tomaya does not explicitly disclose the use of neural networks. Shi does disclose
using a mid-level agent neural network conditioned on the selected gesture class (Shi, page 5, 2nd paragraph, “Our model first processes the image using a Convolutional Neural Network (CNN). For DOM, we compute a text feature map based on the matching between query and DOM. Then the two maps are concatenated into a join representation. On top of this we develop two variants: first we flatten the features and feed them directly through a fully-connected layer (GlobalCNN). Since we had the intuition that local feature alone should suffice to characterize the action, we also examine a LocalCNN architecture to capture the intuition that agent should attend to where cursor is. So the mouse distribution is used as soft attention (Bahdanau et al., 2014) to average the CNN features into a global representation to predict mouse buttons and keyboard events” where “We train models on web tasks by sequencing behavior cloning and reinforcement learning” (Shi, page 5, 1st column, Section 3.2 Optimization) where “Supervised Learning. We obtain a behavior cloning policy by training on the demonstrations using Adam” (Shi, page 5, 2nd column, 4th paragraph) and “Demonstration data. We collected 10 minutes of human demonstrations on each of the 100 MiniWoB environments (about 17 hours total). Unlike the FormWoB and QAWoB settings, the MiniWoB dataset contains interactions that require dragging and hovering (e.g., to trigger a menu expansion). Therefore, we process the demonstrations at regular 83 millisecond intervals (12 frames per second) to extract approximately 720,000 state-action pairs. With gridpoints spaced 8 pixels across the 160 pixel area, we obtain a 20x20 grid and 3 possible actions (move, drag, click), leading to a total of 20x20x3 = 1200 possible actions” (Shi, page 5, 1st column, last paragraph – 2nd column, 1st paragraph).  Examiner notes that the CNN is the mid-level neural network. Examiner further notes that the CNN is trained, or conditioned, using actions from demonstrations through the behavior cloning.)
using a low-level agent neural network (Shi, page 5, 2nd paragraph, “Our model first processes the image using a Convolutional Neural Network (CNN). For DOM, we compute a text feature map based on the matching between query and DOM. Then the two maps are concatenated into a join representation. On top of this we develop two variants: first we flatten the features and feed them directly through a fully-connected layer (GlobalCNN). Since we had the intuition that local feature alone should suffice to characterize the action, we also examine a LocalCNN architecture to capture the intuition that agent should attend to where cursor is. So the mouse distribution is used as soft attention (Bahdanau et al., 2014) to average the CNN features into a global representation to predict mouse buttons and keyboard events.” Examiner notes that the LocalCNN is the low-level neural network.)
Tomaya and Shi are considered analogous to the claimed invention because they both use reinforcement learning to control a computing device. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Tomaya to include the use of a neural networks like in Shi. Thus, this would be applying a known technique (reinforcement learning) to a known device (neural network) ready for improvement to yield predictable results (controlling a computing device) (MPEP 2143 I. (C) Use of known technique to improve similar devices (methods, or product) in the same way.).


Claim(s) 2-3, 12, 16-17, and 21-22 is/are rejected under 35 U.S.C. 103 as being unpatentable over Tomaya in view of Shi in further view of Pateria et al. (“Hierarchical Reinforcement Learning: A Comprehensive Survey”) (hereafter referred to as Pateria).

Regarding claim 2, Tomaya in view of Shi teach the method of claim 1. Tomaya in view of Shi does not teach, but Pateria teaches
wherein selecting a gesture class for the time step from a plurality of gesture classes using a high-level agent comprises: selecting, from the plurality of gesture classes, a gesture class that was determined to be a best performing gesture class for the task during training of the high-level agent (Pateria, page 6, Figure 2, 
    PNG
    media_image2.png
    651
    928
    media_image2.png
    Greyscale
and “The reward obtained in response to performing the subtask wt starting from state st is denoted as R(st, wt), calculated as follows 
    PNG
    media_image3.png
    84
    752
    media_image3.png
    Greyscale
 Equation (4) indicates that the reward R(st, wt) is equal to the expected cumulative reward obtained while following the subtask policy πwt from time t until the termination of wt after cwt timesteps. Now, an optimal task policy would be the one that leads to the following desired maximum Q-value: 
    PNG
    media_image4.png
    98
    981
    media_image4.png
    Greyscale
” (Pateria, page 6, 1st – 2nd paragraph) and where “Hierarchical Reinforcement Learning (HRL) decomposes a long-horizon reinforcement learning task into a hierarchy of subproblems or subtasks such that a higher-level policy learns to perform the task by choosing optimal subtasks as the higher-level actions.” (Pateria, page 2, 2nd paragraph). Examiner notes that Figure 2 shows the highest task level choosing an optimal subtask for the next level. ).   
Tomaya, Shi, and Pateria are considered analogous to the claimed invention because they use reinforcement learning in a hierarchical fashion to control a computing device. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Tomaya in view of Shi to use the high level agent like in Pateria. Thus, this would be applying a known technique (hierarchical reinforcement learning) to a known device (high-level agent) ready for improvement to yield predictable results (selecting a subtask) (MPEP 2143 I. (C) Use of known technique to improve similar devices (methods, or product) in the same way.).

Regarding claim 3, Tomaya in view of Shi teach the method of claim 1. Tomaya in view of Shi does not teach, but Pateria does teach
wherein the high-level agent comprises a high-level agent neural network, and wherein selecting a gesture class for the time step from a plurality of gesture classes using a high-level agent comprises (Pateria, page 6, Figure 2, 
    PNG
    media_image2.png
    651
    928
    media_image2.png
    Greyscale
and “Vezhnevets et al. [99] propose a feudal hierarchy of neural networks in which a higher level network called the ‘Manager’ samples a subgoal in a learned latent subgoal space”(Pateria, page 17, 3rd paragraph). Examiner notes that the highest level comprises a neural network.): 
processing a high-level input derived from the observation using the high-level agent neural network to generate a high-level output that comprises a respective score for each gesture class of the plurality of gesture classes (Pateria, page 6, Figure 2, 
    PNG
    media_image2.png
    651
    928
    media_image2.png
    Greyscale
and “The reward obtained in response to performing the subtask wt starting from state st is denoted as R(st, wt), calculated as follows 
    PNG
    media_image3.png
    84
    752
    media_image3.png
    Greyscale
 Equation (4) indicates that the reward R(st, wt) is equal to the expected cumulative reward obtained while following the subtask policy πwt from time t until the termination of wt after cwt timesteps. Now, an optimal task policy would be the one that leads to the following desired maximum Q-value: 
    PNG
    media_image4.png
    98
    981
    media_image4.png
    Greyscale
” (Pateria, page 6, 1st – 2nd paragraph) and where “Hierarchical Reinforcement Learning (HRL) decomposes a long-horizon reinforcement learning task into a hierarchy of subproblems or subtasks such that a higher-level policy learns to perform the task by choosing optimal subtasks as the higher-level actions.” (Pateria, page 2, 2nd paragraph). Examiner notes that Figure 2 shows the highest task level choosing an optimal subtask for the next level. Examiner further notes that the optimal task policy is the respective score and the subtasks are the gesture classes.); 
and selecting, using the high-level output, a gesture class from the plurality of gesture classes (Pateria, page 6, Figure 2, 
    PNG
    media_image2.png
    651
    928
    media_image2.png
    Greyscale
and “The reward obtained in response to performing the subtask wt starting from state st is denoted as R(st, wt), calculated as follows 
    PNG
    media_image3.png
    84
    752
    media_image3.png
    Greyscale
 Equation (4) indicates that the reward R(st, wt) is equal to the expected cumulative reward obtained while following the subtask policy πwt from time t until the termination of wt after cwt timesteps. Now, an optimal task policy would be the one that leads to the following desired maximum Q-value: 
    PNG
    media_image4.png
    98
    981
    media_image4.png
    Greyscale
” (Pateria, page 6, 1st – 2nd paragraph) and where “Hierarchical Reinforcement Learning (HRL) decomposes a long-horizon reinforcement learning task into a hierarchy of subproblems or subtasks such that a higher-level policy learns to perform the task by choosing optimal subtasks as the higher-level actions.” (Pateria, page 2, 2nd paragraph). Examiner notes that Figure 2 shows the highest task level choosing an optimal subtask for the next level. Examiner further notes that the optimal task policy is the respective score and the subtasks are the gesture classes.).
Tomaya, Shi, and Pateria are considered analogous to the claimed invention because they use reinforcement learning in a hierarchical fashion to control a computing device. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Tomaya in view of Shi to use the high level agent like in Pateria. Thus, this would be applying a known technique (hierarchical reinforcement learning) to a known device (high-level agent) ready for improvement to yield predictable results (selecting a subtask) (MPEP 2143 I. (C) Use of known technique to improve similar devices (methods, or product) in the same way.).

Regarding claim 12, Tomaya in view of Shi teach the method of claim 1. Tomaya in view of Shi further teach
wherein the mid-level agent neural network is configured to process the mid-level input derived from the observation, and wherein processing the mid-level input comprises (Shi, page 5, 2nd paragraph, “Our model first processes the image using a Convolutional Neural Network (CNN). For DOM, we compute a text feature map based on the matching between query and DOM. Then the two maps are concatenated into a join representation. On top of this we develop two variants: first we flatten the features and feed them directly through a fully-connected layer (GlobalCNN). Since we had the intuition that local feature alone should suffice to characterize the action, we also examine a LocalCNN architecture to capture the intuition that agent should attend to where cursor is. So the mouse distribution is used as soft attention (Bahdanau et al., 2014) to average the CNN features into a global representation to predict mouse buttons and keyboard events” where “We train models on web tasks by sequencing behavior cloning and reinforcement learning” (Shi, page 5, 1st column, Section 3.2 Optimization) where “Supervised Learning. We obtain a behavior cloning policy by training on the demonstrations using Adam” (Shi, page 5, 2nd column, 4th paragraph) and “Demonstration data. We collected 10 minutes of human demonstrations on each of the 100 MiniWoB environments (about 17 hours total). Unlike the FormWoB and QAWoB settings, the MiniWoB dataset contains interactions that require dragging and hovering (e.g., to trigger a menu expansion). Therefore, we process the demonstrations at regular 83 millisecond intervals (12 frames per second) to extract approximately 720,000 state-action pairs. With gridpoints spaced 8 pixels across the 160 pixel area, we obtain a 20x20 grid and 3 possible actions (move, drag, click), leading to a total of 20x20x3 = 1200 possible actions” (Shi, page 5, 1st column, last paragraph – 2nd column, 1st paragraph).  Examiner notes that the CNN is the mid-level neural network. Examiner further notes that the CNN is trained, or conditioned, using actions from demonstrations through the behavior cloning, where the demonstrations are the observation.):
Tomaya and Shi are considered analogous to the claimed invention because they both use reinforcement learning to control a computing device. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Tomaya to include the use of a neural networks like in Shi. Thus, this would be applying a known technique (reinforcement learning) to a known device (neural network) ready for improvement to yield predictable results (controlling a computing device) (MPEP 2143 I. (C) Use of known technique to improve similar devices (methods, or product) in the same way.).
	Tomaya in view of Shi does not disclose, but Pateria does disclose
processing the observation using an encoder neural network for each gesture class to generate a feature representation for each gesture class (Pateria, page 21, last paragraph, “Co-Reyes et al. [77] proposed an approach called Self Consistent Trajectory Autoencoder (SeCTAR), in which an encoder LSTM [40] embeds the state transition trajectories…into a low-dimensional continuous latent vector space and a decoder LSTM learns to decode a latent vector into a policy. A latent vector represents similar trajectories, hence, the policy decoded from the latent vector is considered to represent a skill. ” Examiner notes that the feature representation is the latent vector); 
and processing each feature representation using a decoder neural network to generate a respective score for each of the parameters for each of the gesture classes (Pateria, page 21, last paragraph, “Co-Reyes et al. [77] proposed an approach called Self Consistent Trajectory Autoencoder (SeCTAR), in which an encoder LSTM [40] embeds the state transition trajectories…into a low-dimensional continuous latent vector space and a decoder LSTM learns to decode a latent vector into a policy. A latent vector represents similar trajectories, hence, the policy decoded from the latent vector is considered to represent a skill. ” Examiner notes that the policy is the score.) .   
Tomaya, Shi, and Pateria are considered analogous to the claimed invention because they use reinforcement learning in a hierarchical fashion to control a computing device. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Tomaya in view of Shi to use an autoencoder like in Pateria. Doing so is advantageous because “new skills can be easily interpolated in such a continuous space” (Pateria, page 21, last paragraph).

Regarding claim 16, Tomaya in view of Shi teach the method of claim 15. Tomaya in view of Shi does not teach, but Pateria does teach
wherein the low-level agent neural network has been pre- trained prior to the training of the high-level agent and mid-level agent neural network (Pateria, page 20, 2nd paragraph, “Sukhbaatar et al. [88] proposed an approach called Hierarchical Self Play (HSP) to learn continuous embedding of subgoals using asymmetric self-play [89]. Asymmetric self-play is an unsupervised pre-training phase that is illustrated as follows: At the beginning of asymmetric self-play, two standard (non-hierarchical) RL policies are initialized, say Alice (πA) and Bob (πB)” where “The subgoal encoder of HSP…basically encodes various target states (s*) into a low-dimensional subgoal space. Once a subgoal space has been discovered in the pre-training phase of asymmetric self-play, it is simply used as the continuous action space of a higher-level policy of an HRL agent used to perform a particular task. The lower level of the agent is initialized using the Bob policy and then fine-tuned on the task.” (Pateria, page 20, 3rd paragraph). Examiner notes that the low level agent is pretrained using the Bob policy and then used as the continuous action space of the high level agents.). 
Tomaya, Shi, and Pateria are considered analogous to the claimed invention because they use reinforcement learning in a hierarchical fashion to control a computing device. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Tomaya in view of Shi to pretrain the low level agent like in Pateria. Doing so is advantageous because “this results in an asymmetric self-play mechanism in which Bob learns to reach targets set by Alice and Alice learns to achieve new, unexplored targets that are beyond the reach of Bob. In this way, an agent consisting of these two policies explores the environment effectively without external supervision” (Pateria, page 20, 2nd paragraph).

Regarding claim 17, Tomaya in view of Shi teach the method of claim 15. Tomaya in view of Shi does not teach, but Pateria does teach
wherein the mid-level agent neural network has been trained prior to the training of the high-level agent (Pateria, page 7, Section 2.2.3 Problem Definition of HRL, “the policies at various levels of πhierarchy can be learned simultaneously in an end-to-end manner [3, 21, 52, 54, 69] or they may be learned one level at a time in a bottom-to-top manner [25, 28, 60]”). 
Tomaya, Shi, and Pateria are considered analogous to the claimed invention because they use reinforcement learning in a hierarchical fashion to control a computing device. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Tomaya in view of Shi to train the mid-level agent neural network prior to the high-level agent like in Pateria. Thus, this would be applying a known technique (hierarchical reinforcement learning) to a known device (neural networks) ready for improvement to yield predictable results (training neural networks) (MPEP 2143 I. (C) Use of known technique to improve similar devices (methods, or product) in the same way.).

Regarding claim 19, Tomaya in view of Shi teach the method of claim 15. Tomaya in view of Shi does not teach, but Pateria does teach
wherein the mid-level agent neural network has been trained jointly with the high-level agent (Pateria, page 7, Section 2.2.3 Problem Definition of HRL, “the policies at various levels of πhierarchy can be learned simultaneously in an end-to-end manner [3, 21, 52, 54, 69] or they may be learned one level at a time in a bottom-to-top manner [25, 28, 60]”). 
Tomaya, Shi, and Pateria are considered analogous to the claimed invention because they use reinforcement learning in a hierarchical fashion to control a computing device. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Tomaya in view of Shi to train the mid-level agent neural network prior to the high-level agent like in Pateria. Thus, this would be applying a known technique (hierarchical reinforcement learning) to a known device (neural networks) ready for improvement to yield predictable results (training neural networks) (MPEP 2143 I. (C) Use of known technique to improve similar devices (methods, or product) in the same way.).

Regarding claim 21, claim 21 recites substantially similar limitations to claim 2, and is therefore rejected under the same analysis.

Regarding claim 22, claim 22 recites substantially similar limitations to claim 3, and is therefore rejected under the same analysis.

Response to Arguments
	On page 8, Applicant argues:
As described in the Specification:

"The described techniques control a computing device to perform a task by
exploiting a multi-level hierarchy of agents and agent neural networks. In
particular, the system uses a high-level agent to select among gesture classes, a
mid-level agent neural network to select gestures, and a low-level agent neural
network to execute gestures, allowing for efficient and effective control of the
computing device for a variety of tasks. The hierarchical decomposition also
provides abstraction for the high-level agent and mid-level agent neural network.

The system also provides temporal abstraction. The agents and agent neural
networks can be designed in isolation and can be trained at different stages or
using different techniques. For example, the mid-level agent neural network can
select among a discrete set of gestures, while the agent can control a computing
device that operates with a continuous or otherwise much larger action space. " [
Specification, page 1, lines 18-28]

Thus, the claims achieve advantages such as reducing consumption of computational resources, compared to conventional systems that include a single agent controlling a computing device with a continuous or large action space, and allowing for effective control of a computing device.
Claim 1 includes limitations that enable these improvements, such as "selecting a gesture class for the time step from a plurality of gesture classes using a high-level agent", "processing a mid-level input derived from the observation using a mid-level agent neural network conditioned on the selected gesture class to generate a mid-level output that comprises parameters that define a gesture from the selected gesture class, wherein the parameters comprise one or more of: at least one touch position on a display of the one or more computing devices, or a cardinal direction along the display of the one or more computing devices", and "processing a low-level input derived from at least the parameters using a low-level agent neural network to generate a policy output" (as recited in claim 1).
This allows the system to control a computing device using the high-level agent, the mid-level agent neural network, and the low-level agent neural network to perform a hierarchy of sub-tasks (Specification at page 7 lines 10-13). In particular, the high-level agent performs the sub-task of selecting a gesture class (Specification at page 7 lines 14-15). The mid-level agent neural network performs the sub-task of selecting a gesture by determining parameters that can include, for example, at least one touch position on a display of the one or more computing devices, or a cardinal direction along the display of the one or more computing devices (Specification at page 9 line 24-page 10 line 6). The low-level agent neural network performs the sub-task of determining a sequence of actions that will result in a gesture being performed that meets the parameters (Specification at page 11 lines 3-6). By using the high-level agent, the mid-level agent neural network, and the low-level agent neural network to perform a hierarchy of sub-tasks, the claims provide for efficient and effective control of the one or more computing devices.

Regarding the Applicant’s argument that these elements provide an improvement, Examiner respectfully disagrees. Specifically, Examiner notes that the Applicant provides a bare assertion of an improvement without the detail necessary to be apparent to one of ordinary skill in the art and, thus, cannot provide an improvement (MPEP 2106.04(d)(1)). The assertion that the claims achieve advantages such as reducing consumption of computational resources and allowing for effective control of a computing device are not supported by the cited paragraphs nor is it claimed.

On pages 10-11, Applicant argues:
Applicant respectfully submits that the cited portion of Shi does not disclose or suggest "processing a mid-level input derived from the observation using a mid-level agent neural network conditioned on the selected gesture class to generate a mid-level output that comprises parameters that define a gesture from the selected gesture class, wherein the parameters comprise one or more of: at least one touch position on a display of the one or more computing devices, or a cardinal direction along the display of the one or more computing devices; processing a low level input derived from at least the parameters using a low-level agent neural network to generate a policy output that defines a sequence of one or more actions from a plurality of actions for interacting with the one or more computing devices to perform the gesture defined by the mid-level output."
In particular, the cited portion of Shi does not disclose or suggest processing a mid-level input using a mid-level agent neural network to generate a mid-level output that comprises parameters that define a gesture from the selected gesture class, and processing a low-level input using a low-level agent neural network to generate a policy output that defines a sequence of one or more actions to perform the gesture defined by the mid-level output. Although Shi describes a "Global CNN" and a "Local CNN," these convolutional neural networks are variants of each other. Shi describes that the Global CNN processes an image and Document Object Model (DOM) features to predict mouse and keyboard events. Shi describes that the Local CNN also processes an image and DOM features to predict mouse and keyboard events. Thus, Shi describes two neural networks that process the same type of input, i.e., an image and DOM features, to generate the same type of output, i.e., mouse and keyboard events, rather than a mid level agent neural network that "process[ es] a mid-level input derived from the observation" "to generate a mid-level output that comprises parameters that define a gesture from the selected gesture class, wherein the parameters comprise one or more of: at least one touch position on a display of the one or more computing devices, or a cardinal direction along the display of the one or more computing devices" and a low-level agent neural network that "process[es] a low-level input derived from at least the parameters" "to generate a policy output that defines a sequence of one or more actions from a plurality of actions for interacting with the one or more computing devices to perform the gesture defined by the mid-level output," as recited in amended claim 1.
Accordingly, Applicant respectfully requests that the Section 103 rejection to claims 1 and 20 and their respective dependent claims be withdrawn.

	Regarding the Applicant’s argument that the prior art of record does not teach the amended claim 1, Examiner respectfully disagrees. Specifically, Examiner notes a combination of Tomaya and Shi teach claim 1. Tomaya specifically discloses “processing a mid-level input derived from the observation … to generate a mid-level output that comprises parameters that define a gesture from the selected gesture class wherein the parameters comprise one or more of: at least one touch position on a display of the one or more computing devices, or a cardinal direction along the display of the one or more computing devices” and “processing a low-level input derived from at least the parameters … to generate a policy output that defines a sequence of one or more actions from a plurality of actions for interacting with the one or more computing devices to perform the gesture defined by the mid-level output.” Tomaya does not explicitly disclose neural networks, but Shi discloses, “using a mid-level agent neural network conditioned on the selected gesture class” and “using a low-level agent neural network.” As such, Tomaya and Shi are considered analogous to the claimed invention because they both use reinforcement learning to control a computing device. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Tomaya to include the use of a neural networks like in Shi. Thus, this would be applying a known technique (reinforcement learning) to a known device (neural network) ready for improvement to yield predictable results (controlling a computing device) (MPEP 2143 I. (C) Use of known technique to improve similar devices (methods, or product) in the same way.). Examiner respectfully points the applicant to the above 103 rejections.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Levy et al. (“Learning Multi-Level Hierarchies with Hindsight”) also discloses a three level hierarchical reinforcement learning architecture.
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KAITLYN R HAEFNER whose telephone number is (571)272-1429. The examiner can normally be reached Monday - Thursday: 7:15 am - 5:15 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michelle Bechtold can be reached at (571) 431-0762. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/K.R.H./Examiner, Art Unit 2148                                                                                                                                                                                                        /MICHELLE T BECHTOLD/Supervisory Patent Examiner, Art Unit 2148
Read full office action
Prosecution Timeline

Feb 02, 2023
Application Filed
Nov 13, 2025
Non-Final Rejection mailed — §101, §103
Feb 10, 2026
Examiner Interview Summary
Feb 10, 2026
Applicant Interview (Telephonic)
Feb 25, 2026
Response Filed
Apr 21, 2026
Final Rejection mailed — §101, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/493,365
Patent 12572828
METHOD FOR INDUSTRY TEXT INCREMENT AND ELECTRONIC DEVICE
4y 5m to grant Granted Mar 10, 2026
Study what changed to get past this examiner. Based on 1 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
67%
Grant Probability
99%
With Interview (+100.0%)
3y 10m (~6m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 3 resolved cases by this examiner. Grant probability derived from career allowance rate.