Last updated: May 29, 2026

Application No. 18/539,171

REAL-WORLD ROBOT CONTROL USING TRANSFORMER NEURAL NETWORKS

Non-Final OA §101§102§103§112

Filed

Dec 13, 2023

Priority

Dec 13, 2022 — provisional 63/432,373

Examiner

NILSSON, ERIC

Art Unit

2151

Tech Center

2100 — Computer Architecture & Software

Assignee

Google LLC

OA Round

1 (Non-Final)

Interview Optional

— +17.7% interview lift. Examiner has a relatively high allowance rate (83%); +17.7% interview lift. A written response may suffice.

Based on 501 resolved cases, 2023–2026

Examiner Intelligence

NILSSON, ERIC View full profile →

Grants 83% — above average

Career Allowance Rate

415 granted / 501 resolved

+27.8% vs TC avg

Strong +18% interview lift

Without

With

+17.7%

Interview Lift

resolved cases with interview

Typical timeline

3y 1m

Avg Prosecution

26 currently pending

Career history

528

Total Applications

across all art units

Statute-Specific Performance

§101

14.4%

-25.6% vs TC avg

§103

63.9%

+23.9% vs TC avg

§102

7.7%

-32.3% vs TC avg

§112

1.3%

-38.7% vs TC avg

Black line = Tech Center average estimate • Based on career data from 501 resolved cases

Office Action

§101 §102 §103 §112

DETAILED ACTION
This action is in response to claims filed 13 December 2023 for application 18539171 filed 13 December 2023. Currently claims 1-29 are pending.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(d):
(d) REFERENCE IN DEPENDENT FORMS.—Subject to subsection (e), a claim in dependent form shall contain a reference to a claim previously set forth and then specify a further limitation of the subject matter claimed. A claim in dependent form shall be construed to incorporate by reference all the limitations of the claim to which it refers.

The following is a quotation of pre-AIA  35 U.S.C. 112, fourth paragraph:
Subject to the following paragraph [i.e., the fifth paragraph of pre-AIA  35 U.S.C. 112], a claim in dependent form shall contain a reference to a claim previously set forth and then specify a further limitation of the subject matter claimed. A claim in dependent form shall be construed to incorporate by reference all the limitations of the claim to which it refers.

Claim 16 is rejected under 35 U.S.C. 112(d) or pre-AIA  35 U.S.C. 112, 4th paragraph, as being of improper dependent form for failing to further limit the subject matter of the claim upon which it depends, or for failing to include all the limitations of the claim upon which it depends.  Claim 16 is dependent on claim 15 depending on 14 depending on 1, however it also states, when dependent on claim 8 which is not in the chain of dependency.  Applicant may cancel the claim(s), amend the claim(s) to place the claim(s) in proper dependent form, rewrite the claim(s) in independent form, or present a sufficient showing that the dependent claim(s) complies with the statutory requirements.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claim 29 rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter.  The claim does not fall within at least one of the four categories of patent eligible subject matter because the storage media can be interpreted to be a transitory signal as described in p25 lines 3-15 of the specification.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1-8, 10-24 and 26-29 is/are rejected under 35 U.S.C. 102(A)(1) as being anticipated by Guhur et al. (Instruction-driven history-aware policies for robotic manipulations).

Regarding claims 1, 28 and 29, Guhur discloses: 
A method performed by one or more computers and for controlling an agent interacting with an environment, the method comprising:
receiving a natural language text sequence that characterizes a task to be performed by the agent in the environment (p3 §3 natural language instruction, Fig 2 text input);
generating an encoded representation of the natural language text sequence (Fig 2 encoder language from NL input); and
at each of a plurality of time steps (Fig 2 step 1-step t):
obtaining an observation image characterizing a state of the environment at the time step (p3 §3 visual observations, Fig 2 RGBD input images);
processing the observation image using an image encoder neural network that is conditioned on the encoded representation of the natural language text sequence to generate an encoded representation of the observation image (fig 2 encoder UNet encodes images, §4 feature encoding modules generates token embeddings for instructions, visual observations and previous actions);
generating, from at least the encoded representation of the observation image, a sequence of input tokens (fig 2 encoder UNet encodes images, §4 feature encoding modules generates token embeddings for instructions, visual observations and previous actions);
processing the sequence of input tokens using a Transformer neural network to generate a policy output that defines an action to be performed by the agent in response to the observation image (Fig 2 Transformer uses output of encoding module, §4 “Then, the multimodal transformer (Sec. 4.2) learns relationships between the instruction, current multi-camera observations and history.”);
selecting an action to be performed by the agent using the policy output (§4 “Finally, the action prediction module (Sec.4.3) utilizes a convolutional network (CNN) to predict the next rotation qt+1 and gripper state ct+1, and adopts a UNetdecoder[20] to predict the next position pt+1.”); and
causing the agent to perform the selected action (§5.4 Real-Robot experiments, the robot is able to predict and execute a button press).

Regarding claim 2, Guhur discloses: The method of claim 1, wherein the environment is a real-world environment and the agent is a robot (§5.4 Real-Robot experiments).

Regarding claim 3, Guhur discloses: The method of claim 1, wherein generating, from at least the encoded representation of the observation image, a sequence of input tokens comprises:
generating, from the encoded representation of the observation image, a sequence of image tokens for the observation image (§4.1 ¶1 “We encode the instruction, visual observations, and actions as a sequence of tokens.”).

Regarding claim 4, Guhur discloses: The method of claim 3, wherein generating, from at least the encoded representation of the observation image, a sequence of input tokens comprises:
generating the sequence of input tokens by combining the sequence of image tokens for the observation image with respective sequences of image tokens for one or more earlier observation images received at one or more earlier time steps (§4.2 the tokens are conditioned on the text, images and history of actions/observations).

Regarding claim 5, Guhur discloses: The method of claim 4, wherein combining the sequence of image tokens for the observation image with respective sequences of image tokens for each of one or more earlier observation images received at one or more earlier time steps comprises applying positional encodings to each image token in the sequence of image tokens for the observation image and the respective sequences of image tokens for the one or more earlier observation images (§4.3 positional encoding is used for each image in the sequence of images over a history of time steps to predict the next position, see also Fig 2).

Regarding claim 6, Guhur discloses: The method of claim 3, wherein the encoded representation comprises a feature map that includes a respective feature vector for each of a plurality of regions in the observation image, and wherein generating, from the encoded representation of the observation image, a sequence of image tokens for the observation image (§5.2 visual tokens are different channels in a feature map, previous steps are used (a sequence)) comprises:
generating an initial input sequence by flattening the feature map into a sequence of feature vectors (p14 Appendix A, the CNN uses a flattened vector of feature maps).
Regarding claim 7, Guhur discloses: The method of claim 6, wherein generating, from the encoded representation of the observation image, a sequence of image tokens for the observation image comprises:
processing the initial input sequence of feature vectors using a learned module that maps the initial input sequence to a reduced input sequence that includes a smaller number of feature tokens (p14 §Appendix A, a feature map reduces an image size).

Regarding claim 8, Guhur discloses: The method of claim 1, wherein the image encoder neural network comprises one or more conditioning layers that are each configured to receive a respective intermediate output of a respective intermediate layer of the image encoder neural network and the encoded representation of the natural language instruction and to (i) update the respective intermediate output of the image encoder neural network using the encoded representation of the natural language instruction and (ii) provide the updated respective intermediate output as input to a respective subsequent intermediate layer of the image encoder neural network (§4.1 images are encoded with a CNN having multiple layers, an image is encoded in a feature map, then concatenated, then pooled).

Regarding claim 10, Guhur discloses: The method of claim 8, wherein the image encoder neural network is a convolutional neural network and the respective intermediate layer, the respective subsequent layer, or both are convolutional layers (§4.1 images are encoded with a CNN having multiple layers, an image is encoded in a feature map, then concatenated, then pooled).
Regarding claim 11, Guhur discloses: The method of claim 1, wherein the Transformer is a decoder-only Transformer (Fig 2 transformer, p14 §appendix A UNet decoder is used for the transformer step).

Regarding claim 12, Guhur discloses: The method of claim 1, wherein the policy output comprises, for each of a plurality of action dimensions, a respective categorical distribution over possible values for the action dimension (p14 Appendix B, tasks are split into categories which can be selected).

Regarding claim 13, Guhur discloses: The method of claim 12, wherein selecting an action to be performed by the agent using the policy output comprises selecting a respective value for one or more of the action dimensions using the respective categorical distributions (p14 Appendix B, tasks are split into categories which can be selected).

Regarding claim 14, Guhur discloses: The method of claim 1, wherein the image encoder neural network and the Transformer neural network have been trained jointly on a set of training data (Fig 2 Encoder and transformer are used jointly, §3 ¶1 training a policy jointly uses the encoder and transformer, §5.4 Real robot experiments uses simulation data, real robot examples in a fine-tuning phase).

Regarding claim 15, Guhur discloses: The method of claim 14, wherein, prior to the joint training, the image encoder neural network has been pre-trained on an image classification task (§5.4 ¶Results pretraining of the encoder is utilized).

Regarding claim 16, Guhur discloses: The method of claim 15, when dependent on claim 8, wherein the image encoder neural network does not include the one or more conditioning layers for the pre-training (p14 Appendix A, the image encoder is a pretrained CNN which would not have any conditional information from the current task).

Regarding claim 17, Guhur discloses: The method of claim 8, wherein each conditioning layer is, prior to the joint training, initialized to act as an identity transformation to the corresponding respective intermediate output (p4 §4.2 a contextualized representation is condition on the encoded instruction and history to learn relationships.).

Regarding claim 18, Guhur discloses: The method of claim 7, wherein the learned module has also been trained as part of the joint training (Fig 2 Encoder and transformer are used jointly, §3 ¶1 training a policy jointly uses the encoder and transformer).

Regarding claim 19, Guhur discloses: The method of claim 14, wherein the joint training comprises training through imitation learning and the training data comprises expert interaction data characterizing interactions of one or more expert agents with a corresponding environment (§5.4 Real robot experiments uses simulation data, real robot examples in a fine-tuning phase).

Regarding claim 20, Guhur discloses: The method of claim 19, wherein the expert interaction data includes simulation data (§5.4 Real robot experiments uses simulation data, real robot examples in a fine-tuning phase).

Regarding claim 21, Guhur discloses: The method of claim 19, wherein the expert interaction data includes real-world data (§5.4 Real robot experiments uses simulation data, real robot examples in a fine-tuning phase).

Regarding claim 22, Guhur discloses: The method of claim 20, wherein the expert interaction data includes both simulation data and real-world data (§5.4 Real robot experiments uses simulation data, real robot examples in a fine-tuning phase).

Regarding claim 23, Guhur discloses: The method of claim 1, wherein generating an encoded representation of the natural language text sequence comprises processing the natural language text sequence using a text encoder neural network to generate an embedding of the natural language text sequence (Fig 2 encoder language from NL input).

Regarding claim 24, Guhur discloses: The method of claim 22, wherein the text encoder neural network is pre-trained on a text representation learning task (p4 §4.1 ¶1 “We employ a pre-trained language encoder to tokenize and encode the sentence instruction.”).

Regarding claim 26, Guhur discloses: The method of claim 14, wherein the text encoder neural network is held frozen during the joint training (p4.1 the text encoder is pretrained).

Regarding claim 27, Guhur discloses: The method of claim 1, wherein the agent is a robot and the one or more computers are on-board the robot (p8 §5.4 real-robot experiments).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Guher in view of Birnbaum et al. (Temporal FiLM: Capturing Long-Range Sequence Dependencies with Feature-Wise Modulation).

Regarding claim 9, Guhur discloses condition layers, however, does not explicitly disclose feature-wise linear modulation layers. 

Birnbaum teaches: feature-wise Linear Modulation (FiLM) layers (“We observe that TFiLM significantly improves the performance of deep neural networks on a wide range of discriminative classification tasks as well as on complex high-dimensional time series super-resolution problems. Interestingly, our model is domain-agnostic, yet outperforms more specialized approaches that use domain-specific features” p2 ¶3).

Guhur and Birnbaum are in the same field of endeavor of learning models and are analogous. Guhur discloses a system with text and image encoders combined with a transformer for robot control. Birnbaum teaches FiLM layers for use in deep learning and CNNs. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the known structure including a CNN encoder as disclosed by Guhur to utilize the known FiLM layer as taught by Birnbaum to yield the predictable results of improved performance of deep neural networks. 

Claim 25 is rejected under 35 U.S.C. 103 as being unpatentable over Guhur in view of Toledo (Transfer Learning with Joint Fine-Tuning for Multimodal Sentiment Analysis).

Regarding claim 25, Guhur discloses a text encoder, however, does not explicitly disclose: wherein the text encoder neural network is fine-tuned during the joint training.

Toledo teaches: wherein the text encoder neural network is fine-tuned during the joint training (p2 §3 Joint fine tuning of pretrained text and image encoders is used, Fig 1).

Guhur and Toledo are in the same field of endeavor of learning models and are analogous. Guhur discloses a system with text and image encoders combined with a transformer for robot control. Toledo teaches fine-tuning of encoders. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the known structure including text and image encoders as disclosed by Guhur to utilize the known fine tuning of joint text and image encoders as taught by Toledo to yield the predictable results of improved performance of the encoders. 

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Bousmalis et al. (US 20200279134) discloses training robots using simulation and joint training. Liu et al. (INSTRUCTION-FOLLOWING AGENTS WITH JOINTLY PRE-TRAINED VISION-LANGUAGE MODELS) discloses a very similar structure, however, appears to be by the same applicant but different inventors.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ERIC NILSSON whose telephone number is (571)272-5246. The examiner can normally be reached M-F: 7-3.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, James Trujillo can be reached at (571)-272-3677. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ERIC NILSSON/           Primary Examiner, Art Unit 2151

Read full office action

Prosecution Timeline

Dec 13, 2023

Application Filed

Mar 17, 2026

Non-Final Rejection mailed — §101, §102, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/211,153

Patent 12626169

BAYESIAN CAUSAL RELATIONSHIP NETWORK MODELS FOR HEALTHCARE DIAGNOSIS AND TREATMENT BASED ON PATIENT DATA

2y 11m to grant Granted May 12, 2026

17/471,124

Patent 12619869

LEARNING APPARATUS, LEARNING METHOD, AND A NON-TRANSITORY COMPUTER-READABLE STORAGE MEDIUM

4y 7m to grant Granted May 05, 2026

17/792,580

Patent 12619925

CONTEXT-LEVEL FEDERATED LEARNING

3y 9m to grant Granted May 05, 2026

17/781,539

Patent 12608613

PARAMETER OPTIMIZATION DEVICE, PARAMETER OPTIMIZATION METHOD, AND PARAMETER OPTIMIZATION PROGRAM

3y 10m to grant Granted Apr 21, 2026

17/954,485

Patent 12607972

Method and Apparatus for Monitoring Machine Learning Models

3y 6m to grant Granted Apr 21, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2

Expected OA Rounds

83%

Grant Probability

99%

With Interview (+17.7%)

3y 1m (~7m remaining)

Median Time to Grant

Low

PTA Risk

Based on 501 resolved cases by this examiner. Grant probability derived from career allowance rate.