Last updated: April 19, 2026
Application No. 18/359,550
ROBOTIC REASONING THROUGH PLANNING WITH LANGUAGE MODELS

Non-Final OA §101§103
Filed
Jul 26, 2023
Examiner
HOQUE, SHAHEDA SHABNAM
Art Unit
3658
Tech Center
3600 — Transportation & Electronic Commerce
Assignee
Google LLC
OA Round
3 (Non-Final)
Interview Optional

— +37.9% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 58 resolved cases, 2023–2026
Examiner Intelligence

HOQUE, SHAHEDA SHABNAM View full profile →
Grants 43% of resolved cases
Career Allow Rate
25 granted / 58 resolved
-8.9% vs TC avg
Strong +38% interview lift
Without
With
+37.9%
Interview Lift
resolved cases with interview
Typical timeline
3y 1m
Avg Prosecution
38 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
10.5%
-29.5% vs TC avg
§103
61.8%
+21.8% vs TC avg
§102
16.9%
-23.1% vs TC avg
§112
10.2%
-29.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 58 resolved cases
Office Action

§101 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
35 USC § 101 rejection is maintained as the amendments are missing an active control step and incorporating it into the claim language will help overcome the rejection. 
Applicant's arguments filed on 12/17/2025 with respect to claim(s) 1-9, and 13-22 have been fully considered but they are not persuasive or moot in view of new ground of rejection provided below which was necessitated based on Applicant’s amendments to the claims. The new ground of rejection for independent claim is based on Paxton in view of Pramanick as disclosed below.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claim 1 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
101 Analysis – Step 1
Claim 1 is directed to a method (i.e., a process). Therefore, claim 1 is within at least one of the four statutory categories.
101 Analysis – Step 2A, Prong I
Regarding Prong I of the Step 2A analysis in the 2019 PEG, the claims are to be analyzed to determine whether they recite subject matter that falls within one of the follow groups of abstract ideas: a) mathematical concepts, b) certain methods of organizing human activity, and/or c) mental processes.
Independent claim 1 includes limitations that recite an abstract idea (emphasized below) and will be used as a representative claim for the remainder of the 101 rejection. Claim 1 recites:
A method implemented by one or more processors, the method comprising:

identifying an instruction for a robot to perform a task in an environment, the instruction being a free-form natural language instruction; 
determining, based on processing sensor data from one or more sensors of the robot, textual feedback that describes a current state of the environment of the robot; 
processing the instruction and the textual feedback using a large language model (LLM) to generate first LLM output that is dependent on the instruction; wherein the first LLM output includes an unstructured and LLM-generated textual question posed about the environment; 
identifying an unstructured textual answer to the question, wherein the unstructured textual answer to the LLM-generated question is generated based on:
a response provided by a human operator, or 
processing the question using another model different from the LLM to generate the unstructured textual answer;
processing the instruction and the unstructured textual answer to the question using the LLM to generate second LLM output that is dependent on the instruction and that indicates one or more sub-tasks for performing the task; 
identifying a robotic skill that is performable by the robot, and a textual skill description of the robotic skill; 
determining, based on comparing the second LLM output to the skill description, to implement the robotic skill; and 
in response to determining to implement the robotic skill: 
causing the robot to implement the robotic skill in the environment.
The examiner submits that the foregoing bolded limitation(s) constitute mental processes because under its broadest reasonable interpretation, the claim covers performance of the limitation in the human mind. For example, “identifying a robotic skill…”, and “determining, based on comparing the LLM…” in the context of this claim encompasses performing judgment which can be used to control the robot. Accordingly, the claim recites at least one abstract idea.
101 Analysis – Step 2A, Prong II
	Regarding Prong II of the Step 2A analysis in the 2019 PEG, the claims are to be analyzed to determine whether the claim, as a whole, integrates the abstract idea into a practical application. As noted in the 2019 PEG, it must be determined whether any additional elements in the claim beyond the abstract idea integrate the exception into a practical application in a manner that imposes a meaningful limit on the judicial exception. The courts have indicated that additional elements merely using a computer to implement an abstract idea, adding insignificant extra solution activity, or generally linking use of a judicial exception to a particular technological environment or field of use do not integrate a judicial exception into a “practical application.”
	In the present case, the additional limitations beyond the above-noted abstract idea are as follows (where the underlined portions are the “additional limitations” while the bolded portions continue to represent the “abstract idea”):
A method implemented by one or more processors, the method comprising:

identifying an instruction for a robot to perform a task in an environment, the instruction being a free-form natural language instruction;
determining, based on processing sensor data from one or more sensors of the robot, textual feedback that describes a current state of the environment of the robot; 
processing the instruction and the textual feedback using a large language model (LLM) to generate first LLM output that is dependent on the instruction; wherein the first LLM output includes an unstructured and LLM-generated textual question posed about the environment; 
identifying an unstructured textual answer to the question, wherein the unstructured textual answer to the LLM-generated question is generated based on:
a response provided by a human operator, or 
processing the question using another model different from the LLM to generate the unstructured textual answer;
identifying a robotic skill that is performable by the robot, and a textual skill description of the robotic skill; 
determining, based on comparing the second LLM output to the skill description, to implement the robotic skill; and 
in response to determining to implement the robotic skill: 
causing the robot to implement the robotic skill in the environment.
Regarding the additional limitations of “identifying an instruction for a robot …”, “determining, based on processing sensor data from one or more sensors of the robot...”, “processing the instruction and the textual feedback …”, “identifying an unstructured textual answer to the question…”, “in response to determining to implement the robotic skill …”, the examiner submits that these limitations are insignificant extra-solution activities which does not integrate the abstract idea into practical application. The “causing the robot to implement the robotic skill …” merely describes how to generally “apply” the otherwise mental processes in a generic or general purpose robot control environment. The robot control system is recited at a high level of generality.
Thus, taken alone, the additional elements do not integrate the abstract idea into a practical application. Further, looking at the additional limitation(s) as an ordered combination or as a whole, the limitation(s) add nothing that is not already present when looking at the elements taken individually. For instance, there is no indication that the additional elements, when considered as a whole, reflect an improvement in the functioning of a computer or an improvement to another technology or technical field, apply or use the above-noted judicial exception to effect a particular treatment or prophylaxis for a disease or medical condition, implement/use the above-noted judicial exception with a particular machine or manufacture that is integral to the claim, effect a transformation or reduction of a particular article to a different state or thing, or apply or use the judicial exception in some other meaningful way beyond generally linking the use of the judicial exception to a particular technological environment, such that the claim as a whole is not more than a drafting effort designed to monopolize the exception (MPEP § 2106.05). Accordingly, the additional limitation(s) do/does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea.
101 Analysis – Step 2B
Regarding Step 2B of the 2019 PEG, representative independent claim 1 does not include additional elements (considered both individually and as an ordered combination) that are sufficient to amount to significantly more than the judicial exception for the same reasons to those discussed above with respect to determining that the claim does not integrate the abstract idea into a practical application.
Further, a conclusion that an additional element is insignificant extra-solution activity in Step 2A should be re-evaluated in Step 2B to determine if they are more than what is well-understood, routine, conventional activity in the field. The additional limitations of “identifying an instruction for a robot …”, “determining, based on processing sensor data from one or more sensors of the robot...”, “processing the instruction and the textual feedback …”, “identifying an unstructured textual answer to the question…” , “in response to determining to implement the robotic skill …”, are well-understood, routine, and conventional activities and the specification does not provide any indication that controlling the robot is anything other than a conventional computer within a robot. MPEP 2106.05(d)(II), and the cases cited therein, including Intellectual Ventures I, LLC v. Symantec Corp., 838 F.3d 1307, 1321 (Fed. Cir. 2016), TLI Communications LLC v. AV Auto. LLC, 823 F.3d 607, 610 (Fed. Cir. 2016), and OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363 (Fed. Cir. 2015), indicate that mere collection or receipt of data over a network is a well‐understood, routine, and conventional function when it is claimed in a merely generic manner.
Dependent claim(s) 2-9, and 11-22 do not recite any further limitations that cause the claim(s) to be patent eligible. Rather, the limitations of dependent claims are directed toward additional aspects of the judicial exception and/or well-understood, routine and conventional additional elements that do not integrate the judicial exception into a practical application. Therefore, dependent claims 2-9, and 11-22 are not patent eligible under the same rationale as provided for in the rejection of independent claim 1.
The examiner also submits that dependent claim 2 constitutes mental processes. For example, “identifying an additional robotic skill…”, and “determining, based on comparing the updated LLM…” in the context of this claim encompasses performing observation or judgment to obtain certain results to control the robot. Accordingly, the claim recites at least one abstract idea and are rejected under 101 with the same analysis provided above for claim 1.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1, 2, 3, 7, 8, 9, 14, 18, 19, 20, and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Paxton et al. (US 2023/0297074 A1) (Hereinafter Paxton) in view of Pramanick et al. (Enabling human-like task identification from natural conversation, Pradip Pramanick, Chayan Sarkar, Balamuralidhar P, Ajay Kattepur, Indrajit Bhattacharya, Arpan Pal) (Hereinafter Pramanick).
Regarding Claim 1, Paxton teaches a method implemented by one or more processors, the method comprising: 
identifying an instruction for a robot to perform a task in an environment, the instruction being a free-form natural language instruction (See at least Speech Recognition 202, Para [0006] “FIG. 4 illustrates an interface for collecting human language data, according to at least one embodiment;”, Para [0027] “…In order for robots to be able to assist humans on a daily basis with respect to a wide variety of tasks, however, the robots must be able to interpret human language to determine the instruction, and determine at least one sequence of tasks that can be performed in order to execute that instruction.”); 
determining, based on processing sensor data from one or more sensors of the robot (See at least Para [0032] “In addition to the language input, camera input 206 may also be provided as input to this system 200. One or more cameras (or other such sensors or devices) can be positioned relative to an environment in order to be able to capture images or video of at least a portion or region of the environment, such as where a task is to be performed…”), textual feedback that describes a current state of the environment of the robot (See at least Para [0049] “Concurrently in this example, image data (as may include a sequence of images or frames of video data) can be captured 508 or otherwise obtained or received for an environment, as may be captured using one or more cameras or sensors. This image data can be analyzed 510 to extract and encode relevant image features into a latent image space. The inferred actions or goals, as well as the encoded image features, can then be analyzed 512, such as by using a predicate predictor, to generate a set of task symbols or predicates that may be used to represent at least a current set of conditions for the environment,…”, Para [0038]); 
processing the instruction and the textual feedback using a large language model (LLM) to generate first LLM output that is dependent on the instruction; … one or more sub-tasks for performing the task (See at least Para [0033] “A symbolic planner 212 can be a process, module, or algorithm that takes in a set of symbols or predicates, in a format that can be understood by the planner, and then generate a sequence of discrete tasks to be performed by the robot in order to complete the complex task or achieve the goal of the instruction in the original language input.”, Para [0036] “In this example system, the visual features and the textual data from the speech recognition module 252 can be fed into a natural language understanding module (NLU) 254. The NLU module in this example can attempt to correlate instructions with objects in the environment, but will not attempt to correlate specific types of learned objects with the spoken instruction as in the example of FIG. 2B. In this example, the NLU will attempt to use image understanding and visual information extracted from the various object segmentations for the scene. Thus, the NLU might not be trained to recognize specific types of objects, but may be trained to correlate textual information with objects in the scene that have certain characteristics or placements. This correlation can then be used to determine one or more goals or actions associated with a specified task, and correlate those goals or actions with specific objects in the environment.”, Fig. 5B item 564  Utilize visual planner to determine a sequence of discrete tasks to be performed); 
identifying a robotic skill that is performable by the robot (See at least Para [0034] “…those actions can be determined from a respective library or domain for that type of robot or automated device…”), and a textual skill description of the robotic skill (See at least Para [0048] “…The audio data can be analyzed 504, such as by using speech-to-text or speech recognition, to generate a textual representation of that spoken instruction. That textual representation can then be analyzed 506, such as by using natural language understanding, to infer actions or goals corresponding to the spoken instruction…”); 
…
in response to determining to implement the robotic skill: 
causing the robot to implement the robotic skill in the environment (See at least 
Para [0020] “…This can involve use of a symbolic or visual planner in at least some embodiments, as well as a search of possible sequences of actions available for the robot or automated device. Executable instructions can then be generated for execution to cause the robot or automated device to perform the sequence of instructions…”).
However, Paxton does not explicitly spell out … wherein the first LLM output includes an unstructured and LLM-generated textual question posed about the environment; 
identifying an unstructured textual answer to the question, wherein the unstructured textual answer to the LLM-generated question is generated based on:
a response provided by a human operator, or 
processing the question using another model different from the LLM to generate the unstructured textual answer; 
processing the instruction and the unstructured textual answer to the question using 
the LLM to generate second LLM output that is dependent on the instruction and that indicate … 	determining, based on comparing the LLM output to the skill description, to implement the robotic skill; and …
Pramanick teaches … wherein the first LLM output includes an unstructured and LLM-generated textual question posed about the environment (See at least Page 3 Col 1 Para 1 “1) Dialogue engine: It consists of a set of generic NLP toolset that extract the features from the utterance text received from the user and a question framer that formulate relevant questions (only if it is necessary) for the user to resolve ambiguity in task understanding. For example, if the user says, “Take the book from the table”, the NLP tools process it to provide the following output consisting of grammatical features…”, Page 5 Col 1 “The dialogue for the same example is shown below. H1: add some water to the bowl R1: Is this task similar to placing? H2: I didn’t understand R2: Do you want me to put some water in the bowl? H3: Yes R3: Got it”); 
identifying an unstructured textual answer to the question(See at least Fig 2, Page 5 Col 1 “The dialogue for the same example is shown below. H1: add some water to the bowl R1: Is this task similar to placing? H2: I didn’t understand R2: Do you want me to put some water in the bowl? H3: Yes R3: Got it”), wherein the unstructured textual answer to the LLM-generated question is generated based on:
a response provided by a human operator (See at least Page 5 Col 1 “The dialogue for the same example is shown below. H1: add some water to the bowl R1: Is this task similar to placing? H2: I didn’t understand R2: Do you want me to put some water in the bowl? H3: Yes R3: Got it”), or 
processing the question using another model different from the LLM to generate the unstructured textual answer; 
processing the instruction and the unstructured textual answer to the question using 
the LLM to generate second LLM output that is dependent on the instruction (See at least Page 5 Col 1 “The dialogue for the same example is shown below. H1: add some water to the bowl R1: Is this task similar to placing? H2: I didn’t understand R2: Do you want me to put some water in the bowl? H3: Yes R3: Got it”) and that indicate … 	
determining, based on comparing the LLM output to the skill description, to implement the robotic skill (See at least Page 5 Col 1 Para 4 “A task given as a natural language instruction expresses a specific goal to be fulfilled. Also, the task can be assumed to be given at a hypothetical world state, which is the initial state for the task. We store the templates of the initial and the goal states for each of task types in the KB. Table I shows some examples of such templates. From such a template, we generate the grounded initial and the goal states of the planning problem by first grounding the variables of the fluents using the arguments extracted from the instruction and then by validating using the world model. We create the templates from the definition of the frame that models the task, also considering the predicates of the PDDL domain. The effort for template creation is manageable because templates are created once and there is a small number of task types to be considered.”); and …
Therefore, it would have been obvious to one of the ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Paxton with the teachings of Pramanick and include the feature of an unstructured textual question about the environment, identifying an unstructured textual answer to the question and processing the instruction and the unstructured textual answer to the question using the LLM to generate second LLM output that is dependent on the instruction and determining, based on comparing the LLM output to the skill description, to implement the robotic skill, thereby providing realistic environment with the option of having a conversation with the robot in order to mimic actual human behavior and provide higher robot operability (See at least Page 6 Col 2 Para 2 “By adding the dialogue module to get the missing arguments (TCAR-IdP0), the performance improves by a high degree, as shown in Table V.”).
Regarding Claim 2, Paxton teaches all the elements of claim 1. Paxton further teaches the method of claim 1, subsequent to causing the robot to implement the robotic skill in the environment and further comprising:
determining, based on processing updated sensor data from the one or more sensors of the robot, updated textual feedback that describes an updated state of the environment of the robot (See at least Para [0034] “…The one or more cameras can capture image or video data of this performance, as well as other monitoring or feedback data, which can be used to analyze the performance of those tasks, as well as provided as camera input to the encoder 208 in order to update the current state of the environment…”, Para [0049] “Concurrently in this example, image data (as may include a sequence of images or frames of video data) can be captured 508 or otherwise obtained or received for an environment, as may be captured using one or more cameras or sensors. This image data can be analyzed 510 to extract and encode relevant image features into a latent image space. The inferred actions or goals, as well as the encoded image features, can then be analyzed 512, such as by using a predicate predictor, to generate a set of task symbols or predicates that may be used to represent at least a current set of conditions for the environment,…”, Para [0038]), Para [0033] “A symbolic planner 212 can be a process, module, or algorithm that takes in a set of symbols or predicates, in a format that can be understood by the planner, and then generate a sequence of discrete tasks to be performed by the robot in order to complete the complex task or achieve the goal of the instruction in the original language input.”);
 processing the instruction and the updated textual feedback using the LLM to generate updated LLM output (See at least Para [0114] “In at least one embodiment, training pipeline 1404 (FIG. 14) may include a scenario where facility 1302 is training their own machine learning model, or has an existing machine learning model that needs to be optimized or updated…”, Para [0115] “…In at least one embodiment, a machine learning model may then be retrained, or updated, at any number of other facilities, and a retrained or updated model may be made available in model registry 1324…”); 
identifying an additional robotic skill that is performable by the robot (See at least Para [0052] “…This can involve capturing additional image or sensor data to determine progress of the sequence, as well as any changes to the state of the environment that may require or warrant modification of at least a portion of this sequence. Once the task is completed, results can be provided 572 and/or data logged regarding performance of the instructions…”), and an additional textual skill description of the additional robotic skill (See at least Para [0048] “…The audio data can be analyzed 504, such as by using speech-to-text or speech recognition, to generate a textual representation of that spoken instruction. That textual representation can then be analyzed 506, such as by using natural language understanding, to infer actions or goals corresponding to the spoken instruction…”); 
determining, based on comparing the updated LLM output and the additional textual skill description, to implement the additional robotic skill; and
 in response to determining to implement the additional robotic skill:
causing the robot to implement the additional robotic skill in the environment (See at least Para [0020] “…This can involve use of a symbolic or visual planner in at 
least some embodiments, as well as a search of possible sequences of actions available for the robot or automated device. Executable instructions can then be generated for execution to cause the robot or automated device to perform the sequence of instructions…”).
However, Paxton does not explicitly spell out … determining, based on comparing the updated LLM output and the additional textual skill description, to implement the additional robotic skill; and …
Gu teaches … determining, based on comparing the updated LLM output and the additional textual skill description, to implement the additional robotic skill (See at least Abstract “…based on the input data, a pre-training language model is adopted to obtain a corresponding prediction result list, and a reasoning result language description list is extracted; a robot skill list is obtained, and the reasoning result language description list and the robot skill list are compared and analyzed; taking a robot skill with the maximum similarity in the robot skill list as a current robot reasoning decision result…”, Page 2 Para 5 “Further, it is determined that a robot skill with the greatest similarity in the robot skill list is specifically:”); and …
Therefore, it would have been obvious to one of the ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Paxton with the teachings of Gu and include the feature of determining, based on comparing the updated LLM output and the additional textual skill description, to implement the robotic skill, thereby providing higher robotic operability (See at least Abstract “…the method has the advantages of higher operability, convenience, easiness in use and the like”).
Regarding Claim 3, modified Paxton teaches all elements of claim 1. 
However, Paxton does not explicitly spell out the method of claim 1, wherein the textual feedback includes task specific feedback.
Pramanick teaches the method of claim 1, wherein the textual feedback includes task specific feedback (See at least Page 5 Col 1 “The dialogue for the same example is shown below. H1: add some water to the bowl R1: Is this task similar to placing? H2: I didn’t understand R2: Do you want me to put some water in the bowl? H3: Yes R3: Got it”).
Therefore, it would have been obvious to one of the ordinary skill in the art before the effective filing date of the claimed invention to combine Paxton with the teachings of Pramanick and include the feature of textual task specific feedback, thereby provide easy, useful, and effective robot-human interaction with improved performance (See at least By adding the dialogue module to get the missing arguments (TCAR-IdP0), the performance improves by a high degree, as shown in Table V.).
Regarding Claim 7, modified Paxton teaches all the elements of claim 1. Paxton further teaches method of claim 1, wherein the textual feedback includes passive scene description feedback (See at least Para [0033] “The encoded image features, representative of a current state of the environment, and the task data from the NLU 204 can be provided to a predicate goal predictor 210 or other such module. The predicate predictor can include a neural network, such as a long short-term memory (LSTM) network, trained to infer one or more predicates for a task with respect to an environment. An LSTM is a type of recurrent neural network (RNN) that can learn order dependence in prediction problems, and in this case can be used to predict a set of symbols in a language that can be understood by a symbolic planner 212. In this context, a predicate can refer to a condition or statement about the environment. This can include a condition of the environment at the present time, as well as a desired condition of the environment after performance of the task or once the goal is achieved…  In this example, the predicate predictor can generate predictions using one or more symbols or tokens specifying that there is a white block, that is an object, that is on the left side of the table, but in a machine language that can be understood by the planner. These symbols can then be passed to a symbolic planner 212 or other such module or process. The symbols in at least one embodiment are passed as a string of text, including a sequence of textual symbols. A symbolic planner 212 can be a process, module, or algorithm that takes in a set of symbols or predicates, in a format that can be understood by the planner, and then generate a sequence of discrete tasks to be performed by the robot in order to complete the complex task or achieve the goal of the instruction in the original language input.”).
Regarding Claim 8, modified Paxton teaches all the elements of claim 7. Paxton further teaches the method of claim 7, wherein the passive scene description feedback includes an indication of one or more objects detected in the environment (See at least Para [0028] “In a situation such as that illustrated in FIGS. 1 A through 1 D, image data (or video or map data, etc.) can be captured of an environment, in addition to audio data that may contain spoken instructions. This image data can be analyzed to identify objects in an environment…”).
Regarding Claim 9, modified Paxton teaches all the elements of claim 8. Paxton further teaches the method of claim 8, wherein the sensor data from the one or more sensors of the robot includes one or more instances of vision data from one or more vision sensors of the robot (See at least Para [0032] “In addition to the language input, camera input 206 may also be provided as input to this system 200…”, Para [0035] “…These segmentations can take various forms, as may include one or more object masks specific to one or more objects determined from that environment. This segmentation information can be passed to a visual feature extractor 260, which can extract features representing the spatial position and characteristics of a respective object…”), and wherein determining the passive scene description feedback comprises:
processing the one or more instances of vision data using an object detection model to generate the indication of the one or more objects detected in the environment (See at least Para [0028] “In a situation such as that illustrated in FIGS. 1 A through 1 D, image data (or video or map data, etc.) can be captured of an environment, in addition to audio data that may contain spoken instructions. This image data can be analyzed to identify objects in an environment…”, Para [0051] “…This image data can be analyzed 510 to generate one or more segmentation masks for the environment, where each detected object might have a respective segmentation generated. Visual features can then be extracted 562 from these various segmentation masks, which can be provided to assist in analyzing the textual representation of the instruction to infer the various goals or actions.”).
Regarding Claim 14, modified Paxton teaches all the elements of claim 1. Paxton further teaches the method of claim 1, wherein the textual feedback includes task specific feedback (See at least Col 10 Lines 40-52 “The observation system 235 is graphically depicted as a camera, however other systems suitable for recording the desired observations can be used in other implementations. Recorded observations can include audio signals, still images, video images sequences, electromagnetic tracking data, and textual information, depending upon the nature of a particular task. Recorded observations can additionally or alternatively include data from sensors on the robot or the target object of the task, for example data from strain gauges, torque sensors (e.g., back EMF sensors), inertial sensors (e.g., gyroscopes, accelerometers), optical sensors, radio frequency sensors, magnetic wave detectors, haptic sensors, air pressure sensors, and piezoelectric sensors.”) and passive scene description feedback (See at least Para [0028] “In a situation such as that illustrated in FIGS. 1 A through 1 D, image data (or video or map data, etc.) can be captured of an environment, in addition to audio data that may contain spoken instructions. This image data can be analyzed to identify objects in an environment…”, Para [0033]).
Regarding Claim 18, Paxton teaches all the elements of claim 1. Paxton further teaches the method of claim 1, wherein the task is a long-horizon task, and wherein the long- horizon task cannot be implemented, by the robot, in a single robotic skill (See at least Para [0030] “…Such a system can combine symbolic planning, for example, with natural language understanding in order to accomplish long-horizon tasks.”, discloses accomplishing long- horizon task which cannot be implemented, by the robot, in a single robotic skill).
Regarding Claim 19, Paxton teaches all the elements of claim 1. Paxton further teaches the method of claim 1, wherein the environment is a simulation (See at least Para [0030] “…A system in accordance with at least one embodiment can move from language and vision inputs to a set of actions, in order to perform a set or sequence of tasks, partially or wholly in simulation as well as in a real environment…”).
Regarding Claim 20, Paxton teaches all the elements of claim 1. Paxton further teaches the method of claim 1, wherein the environment is a real world environment (See at least Para [0030] “…A system in accordance with at least one embodiment can move from language and vision inputs to a set of actions, in order to perform a set or sequence of tasks, partially or wholly in simulation as well as in a real environment…”).
Regarding Claim 21, Paxton teaches all the elements of claim 1. Paxton further teaches the method of claim 1, wherein the task is a manipulation task (See at least Para [0048] “FIG. 5A illustrates an example process 500 for determining a sequence of discrete tasks for performance of a complex task or long-term, high level manipulation goal, which can be performed in accordance with various embodiments…”).

Claim(s) 4, 5, and 6 are rejected under 35 U.S.C. 103 as being unpatentable over Paxton et al. (US 2023/0297074 A1) (Hereinafter Paxton) in view of Pramanick et al. (Enabling human-like task identification from natural conversation, Pradip Pramanick, Chayan Sarkar, Balamuralidhar P, Ajay Kattepur, Indrajit Bhattacharya, Arpan Pal) (Hereinafter Pramanick), and further in view of Beckman et al. (US 11584008 B1) (Hereinafter Beckman).
Regarding Claim 4, modified Paxton teaches all elements of claim 3. 
However, Paxton does not explicitly spell out the method of claim 3, wherein the task specific feedback includes an indication of whether the robot successfully implemented a previous robotic skill.
Beckman teaches the method of claim 3, wherein the task specific feedback includes an indication of whether the robot successfully implemented a previous robotic skill (See at least Col 11 Lines 58-63 “…This performance is evaluated, either by a human or a programmatic module of the feedback engine 210, to indicate its level of success. Block 302 can involve acquiring recorded observations of the robotic task performance from the observation system 235 for automated evaluation…”).
Therefore, it would have been obvious to one of the ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Paxton with the teachings of Beckman and include the feature of task specific feedback includes an indication of whether the robot successfully implemented a previous robotic skill, thereby provide easy, useful, and effective robot-human interaction (See at least Col 1 Lines 20-21 “…enabling artificial intelligence systems to learn complex tasks or adapt to changing environments…”).
Regarding Claim 5, modified Paxton teaches all the elements of claim 4. Paxton further teaches the method of claim 4, wherein the sensor data from the one or more sensors of the robot includes one or more instances of vision data from one or more vision sensors of the robot (See at least Para [0032] “In addition to the language input, camera input 206 may also be provided as input to this system 200…”, Para [0035] “…These segmentations can take various forms, as may include one or more object masks specific to one or more objects determined from that environment. This segmentation information can be passed to a visual feature extractor 260, which can extract features representing the spatial position and characteristics of a respective object…”), and wherein determining the task specific feedback comprises:
However, Paxton does not explicitly spell out …
processing the one or more instances of vision data using a success detection model to generate the indication of whether the robot successfully implemented the previous robotic skill.
Beckman teaches …
processing the one or more instances of vision data using a success detection model to generate the indication of whether the robot successfully implemented the previous robotic skill (See at least Col 11 Lines 58-63 “…This performance is evaluated, either by a human or a programmatic module of the feedback engine 210, to indicate its level of success. Block 302 can involve acquiring recorded observations of the robotic task performance from the observation system 235 for automated evaluation…”, Col 10 Lines 40-52 “The observation system 235 is graphically depicted as a camera, however other systems suitable for recording the desired observations can be used in other implementations. Recorded observations can include audio signals, still images, video images sequences, electromagnetic tracking data, and textual information, depending upon the nature of a particular task. Recorded observations can additionally or alternatively include data from sensors on the robot or the target object of the task, for example data from strain gauges, torque sensors (e.g., back EMF sensors), inertial sensors (e.g., gyroscopes, accelerometers), optical sensors, radio frequency sensors, magnetic wave detectors, haptic sensors, air pressure sensors, and piezoelectric sensors.”).
Therefore, it would have been obvious to one of the ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Paxton with the teachings of Beckman and include the feature of processing the one or more instances of vision data using a success detection model to generate the indication of whether the robot successfully implemented the previous robotic skill, thereby provide easy, useful, and effective robot-human interaction (See at least Col 1 Lines 20-21 “…enabling artificial intelligence systems to learn complex tasks or adapt to changing environments…”).
Regarding Claim 6, modified Paxton teaches all the elements of claim 4. 
However, Paxton does not explicitly spell out the method of claim 4, wherein the sensor data from the one or more sensors of the robot includes one or more instances of force sensor data from one or more force sensors of an end effector of the robot, and wherein determining the task specific feedback comprises:
processing the one or more instances of force sensor data using a success detection model to generate the indication of whether the robot successfully implemented the previous robotic skill. 
Beckman teaches the method of claim 4, wherein the sensor data from the one or more sensors of the robot includes one or more instances of force sensor data from one or more force sensors of an end effector of the robot (See at least Col 10 Lines 40-52 “The observation system 235 is graphically depicted as a camera, however other systems suitable for recording the desired observations can be used in other implementations. Recorded observations can include audio signals, still images, video images sequences, electromagnetic tracking data, and textual information, depending upon the nature of a particular task. Recorded observations can additionally or alternatively include data from sensors on the robot or the target object of the task, for example data from strain gauges, torque sensors (e.g., back EMF sensors), inertial sensors (e.g., gyroscopes, accelerometers), optical sensors, radio frequency sensors, magnetic wave detectors, haptic sensors, air pressure sensors, and piezoelectric sensors.”), and wherein determining the task specific feedback comprises:
processing the one or more instances of force sensor data using a success detection model to generate the indication of whether the robot successfully implemented the previous robotic skill (See at least Col 11 Lines 58-63 “…This performance is evaluated, either by a human or a programmatic module of the feedback engine 210, to indicate its level of success. Block 302 can involve acquiring recorded observations of the robotic task performance from the observation system 235 for automated evaluation…”, Col 10 Lines 40-52 “The observation system 235 is graphically depicted as a camera, however other systems suitable for recording the desired observations can be used in other implementations. Recorded observations can include audio signals, still images, video images sequences, electromagnetic tracking data, and textual information, depending upon the nature of a particular task. Recorded observations can additionally or alternatively include data from sensors on the robot or the target object of the task, for example data from strain gauges, torque sensors (e.g., back EMF sensors), inertial sensors (e.g., gyroscopes, accelerometers), optical sensors, radio frequency sensors, magnetic wave detectors, haptic sensors, air pressure sensors, and piezoelectric sensors.”). 
Therefore, it would have been obvious to one of the ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Paxton with the teachings of Beckman and include the feature of one or more instances of force sensor data from one or more force sensors of an end effector of the robot and processing the one or more instances of force sensor data using a success detection model to generate the indication of whether the robot successfully implemented the previous robotic skill, thereby provide easy, useful, and effective robot-human interaction (See at least Col 1 Lines 20-21 “…enabling artificial intelligence systems to learn complex tasks or adapt to changing environments…”).

Claim(s) 13, 15, 16 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Paxton et al. (US 2023/0297074 A1) (Hereinafter Paxton) in view of Pramanick et al. (Enabling human-like task identification from natural conversation, Pradip Pramanick, Chayan Sarkar, Balamuralidhar P, Ajay Kattepur, Indrajit Bhattacharya, Arpan Pal) (Hereinafter Pramanick), and further in view of Stepputtis et al. (STEPPUTTIS, SIMON, et al.; Language-Conditioned Imitation Learning for Robot Manipulation Tasks; 34th Conference on Neural Information Processing Systems (NeurIPS 2020); 12 pages; dated 22 Oct 2020) (Hereinafter Stepputtis).
Regarding Claim 13, modified Paxton teaches all elements of claim 1. 
However, Paxton does not explicitly spell out the method of claim 1, wherein the unstructured textual answer to the question generated by the LLM is generated based on processing the question using a Visual Question Answering model to generate the unstructured textual answer.
Stepputtis teaches the method of claim 1, wherein the unstructured textual answer to the 
question provided by the LLM is generated based on processing the question using a Visual Question Answering model to generate the unstructured textual answer (See at least Page 2 “5 Conclusion - demonstrated that our model, trained with imitation learning, achieved a high success rate on both synthetic instructions and unstructured human instructions.”, Page 9 “We present an approach for end-to-end imitation learning of robot manipulation policies that combines language, vision, and control. The extracted language-conditioned policies provided a simple and intuitive interface to a human user for providing unstructured commands.”, Page 3 “”2 Background - …Beyond robotics, the combination of vision and language has received ample attention in visual question-and-answering systems (VQA) [22, 5] and vision-and-language navigation (VNL) [34, 18, 10]. Our approach is most similar to [1]. However, unlike our model, the work in [1] used a fixed alphabet and required information about the task to be extracted from the sentence before being used for control. In contrast, our model can extract a variety of information directly from natural language.).
Therefore, it would have been obvious to one of the ordinary skill in the art before the effective 
filing date of the claimed invention to combine the teachings of Paxton with the teachings of Stepputtis and include the feature of the unstructured textual answer to the question generated by the LLM is generated based on processing the question using a Visual Question Answering model to generate the unstructured textual answer, thereby enhance clarity providing speech instruction to the robot regarding the visual data (See at least “Abstract - …The resulting language-conditioned visuomotor policies can be conditioned at runtime on new human commands and instructions, which allows for more fine-grained control over the trained policies while also reducing situational ambiguity.”).
Regarding Claim 15, modified Paxton teaches all elements of claim 1. 
However, Paxton does not explicitly spell out the method of claim 1, wherein the textual feedback includes task specific feedback and active scene description feedback.
Pramanick teaches the method of claim 1, wherein the textual feedback includes task specific feedback (See at least Page 5 Col 1 “The dialogue for the same example is shown below. H1: add some water to the bowl R1: Is this task similar to placing? H2: I didn’t understand R2: Do you want me to put some water in the bowl? H3: Yes R3: Got it”).
Therefore, it would have been obvious to one of the ordinary skill in the art before the effective filing date of the claimed invention to combine Paxton with the teachings of Pramanick and include the feature of textual task specific feedback, thereby provide easy, useful, and effective robot-human interaction with improved performance (See at least By adding the dialogue module to get the missing arguments (TCAR-IdP0), the performance improves by a high degree, as shown in Table V.).However, neither Paxton nor Beckman explicitly spell out … active scene description feedback.
Stepputtis teaches … active scene description feedback (See at least Page 2 “5 Conclusion - demonstrated that our model, trained with imitation learning, achieved a high success rate on both synthetic instructions and unstructured human instructions.”, Page 9 “We present an approach for end-to-end imitation learning of robot manipulation policies that combines language, vision, and control. The extracted language-conditioned policies provided a simple and intuitive interface to a human user for providing unstructured commands.”).
Therefore, it would have been obvious to one of the ordinary skill in the art before the effective 
filing date of the claimed invention to combine the teachings of Paxton with the teachings of Stepputtis and include the feature of textual feedback including active scene description feedback, thereby enhance clarity providing speech instruction to the robot (See at least “Abstract - …The resulting language-conditioned visuomotor policies can be conditioned at runtime on new human commands and instructions, which allows for more fine-grained control over the trained policies while also reducing situational ambiguity.”).
Regarding Claim 16, modified Paxton teaches all elements of claim 1. Paxton further teaches the method of claim 1, wherein the textual feedback includes passive scene description feedback (See at least Para [0028] “In a situation such as that illustrated in FIGS. 1 A through 1 D, image data (or video or map data, etc.) can be captured of an environment, in addition to audio data that may contain spoken instructions. This image data can be analyzed to identify objects in an environment…”, Para [0033]) and …
However, Paxton does not explicitly spell out … active scene description feedback.
Stepputtis teaches … active scene description feedback (See at least Page 2 “5 Conclusion – 
demonstrated that our model, trained with imitation learning, achieved a high success rate on both synthetic instructions and unstructured human instructions.”, Page 9 “We present an approach for end-to-end imitation learning of robot manipulation policies that combines language, vision, and control. The extracted language-conditioned policies provided a simple and intuitive interface to a human user for providing unstructured commands.”).
Therefore, it would have been obvious to one of the ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Paxton with the teachings of Stepputtis and include the feature of textual feedback including active scene description feedback, thereby enhance clarity providing speech instruction to the robot (See at least “Abstract - …The resulting language-conditioned visuomotor policies can be conditioned at runtime on new human commands and instructions, which allows for more fine-grained control over the trained policies while also reducing situational ambiguity.”).
Regarding Claim 17, modified Paxton teaches all elements of claim 1. Paxton further teaches … passive scene description feedback (See at least Para [0028] “In a situation such as that illustrated in FIGS. 1 A through 1 D, image data (or video or map data, etc.) can be captured of an environment, in addition to audio data that may contain spoken instructions. This image data can be analyzed to identify objects in an environment…”, Para [0033]), and ...
However, Paxton does not explicitly spell out the method of claim 1, wherein the textual feedback includes task specific feedback … active scene description feedback.
Beckman teaches the method of claim 1, wherein the textual feedback includes task specific feedback (See at least Page 5 Col 1 “The dialogue for the same example is shown below. H1: add some water to the bowl R1: Is this task similar to placing? H2: I didn’t understand R2: Do you want me to put some water in the bowl? H3: Yes R3: Got it”).
Therefore, it would have been obvious to one of the ordinary skill in the art before the effective filing date of the claimed invention to combine Paxton with the teachings of Pramanick and include the feature of textual task specific feedback, thereby provide easy, useful, and effective robot-human interaction with improved performance (See at least By adding the dialogue module to get the missing arguments (TCAR-IdP0), the performance improves by a high degree, as shown in Table V.).Stepputtis teaches … active scene description feedback (See at least Page 2 “5 Conclusion – 
demonstrated that our model, trained with imitation learning, achieved a high success rate on both synthetic instructions and unstructured human instructions.”, Page 9 “We present an approach for end-to-end imitation learning of robot manipulation policies that combines language, vision, and control. The extracted language-conditioned policies provided a simple and intuitive interface to a human user for providing unstructured commands.”).
Therefore, it would have been obvious to one of the ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Paxton with the teachings of Stepputtis and include the feature of textual feedback including active scene description feedback, thereby enhance clarity providing speech instruction to the robot (See at least “Abstract - …The resulting language-conditioned visuomotor policies can be conditioned at runtime on new human commands and instructions, which allows for more fine-grained control over the trained policies while also reducing situational ambiguity.”).

Claim 22 is rejected under 35 U.S.C. 103 as being unpatentable over Paxton et al. (US 2023/0297074 A1) (Hereinafter Paxton) in view of Pramanick et al. (Enabling human-like task identification from natural conversation, Pradip Pramanick, Chayan Sarkar, Balamuralidhar P, Ajay Kattepur, Indrajit Bhattacharya, Arpan Pal) (Hereinafter Pramanick), and further in view of Cherian et al. (US 2024/0300096 A1).
Regarding Claim 22, Paxton teaches all the elements of claim 1. 
However, Paxton does not explicitly spell out the method of claim 1, wherein the task is a navigation task.
Cherian teaches the method of claim 1, wherein the task is a navigation task (See at least Para [0013] “Some embodiments are based on the recognition that a Self-Motivated Communication Agent (SCoA) may be able to query for help and use natural language feedback to assist its navigation...”, Para [0011] Some embodiments are based on the recognition that audio produced by an audio source may be sporadic, intermittent, or garbled with other sounds such that localization may be challenging. Therefore, rather than taking incorrect navigation steps in such uncertain circumstances and thus delaying progress on the navigation task, the robot may benefit from natural language interaction with an oracle to seek help for navigation guidance.”).
Therefore, it would have been obvious to one of the ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Paxton with the teachings of Cherian and include the task being a navigation task, thereby providing added flexibility to the robot of being able to move around and perform other tasks as needed accurately with efficiency (See at least Para [0231] “…the accuracy and efficiency of navigation tasks in may be highly improved…”).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Groz et al. (US 11654552 B2) teaches method generating training data based on the sensor data and results of execution of the task for updating the AI model.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHAHEDA HOQUE whose telephone number is (571)270-5310. The examiner can normally be reached Monday-Friday 8:00 am- 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ramon Mercado can be reached on 571-270-5744. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/SHAHEDA HOQUE/Examiner, Art Unit 3658
/Ramon A. Mercado/Supervisory Patent Examiner, Art Unit 3658
Read full office action
Prosecution Timeline

Jul 26, 2023
Application Filed
Apr 10, 2025
Non-Final Rejection — §101, §103
Jul 18, 2025
Response Filed
Jul 18, 2025
Applicant Interview (Telephonic)
Jul 18, 2025
Examiner Interview Summary
Sep 08, 2025
Final Rejection — §101, §103
Dec 08, 2025
Interview Requested
Dec 16, 2025
Examiner Interview Summary
Dec 16, 2025
Applicant Interview (Telephonic)
Dec 17, 2025
Request for Continued Examination
Jan 20, 2026
Response after Non-Final Action
Feb 12, 2026
Non-Final Rejection — §101, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/612,719
Patent 12569992
AUTOMATIC DETERMINATION OF ROBOT SETTLING STATES
2y 5m to grant Granted Mar 10, 2026
18/248,067
Patent 12539597
ROBOT SYSTEM, AND CONTROL METHOD FOR SAME
2y 5m to grant Granted Feb 03, 2026
18/231,853
Patent 12514143
AGRICULTURAL MACHINE, AGRICULTURAL WORK ASSISTANCE APPARATUS, AND AGRICULTURAL WORK ASSISTANCE SYSTEM
2y 5m to grant Granted Jan 06, 2026
17/740,627
Patent 12485538
METHOD AND SYSTEM FOR DETERMINING A WORKPIECE LOADING LOCATION IN A CNC MACHINE WITH A ROBOTIC ARM
2y 5m to grant Granted Dec 02, 2025
17/384,452
Patent 12479107
METHOD AND AN ASSEMBLY UNIT FOR PERFORMING ASSEMBLING OPERATIONS
2y 5m to grant Granted Nov 25, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
43%
Grant Probability
81%
With Interview (+37.9%)
3y 1m
Median Time to Grant
High
PTA Risk
Based on 58 resolved cases by this examiner. Grant probability derived from career allow rate.
ROBOTIC REASONING THROUGH PLANNING WITH LANGUAGE MODELS

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email