Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 3/18/2026 has been entered.
Response to Arguments
Applicant’s arguments, see pages 7-8 of the Response, filed 3/11/2026, with respect to the rejection(s) of claim(s) 1-24 under 35 USC 102 have been fully considered and are persuasive. Therefore, the rejection has been withdrawn. However, upon further consideration, a new ground(s) of rejection is made in view of Huang et al (NPL: Inner Monologue: Embodied Reasoning through Planning with Language Models) in view of Guo et al (US Publication No.: 202502093338).
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claim limitation “means for pre-processing training data” invokes 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph. However, the written description fails to disclose the corresponding structure, material, or acts for performing the entire claimed function and to clearly link the structure, material, or acts to the function. Paragraph 56 discloses pre-processing training data where initial rewards are replaced with higher reward values, but such paragraph fails to disclose structure or apparatus, for an apparatus claim, that performs the recited limitation. Therefore, the claim is indefinite and is rejected under 35 U.S.C. 112(b) or pre-AIA 35 U.S.C. 112, second paragraph.
Applicant may:
(a) Amend the claim so that the claim limitation will no longer be interpreted as a limitation under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph;
(b) Amend the written description of the specification such that it expressly recites what structure, material, or acts perform the entire claimed function, without introducing any new matter (35 U.S.C. 132(a)); or
(c) Amend the written description of the specification such that it clearly links the structure, material, or acts disclosed therein to the function recited in the claim, without introducing any new matter (35 U.S.C. 132(a)).
If applicant is of the opinion that the written description of the specification already implicitly or inherently discloses the corresponding structure, material, or acts and clearly links them to the function so that one of ordinary skill in the art would recognize what structure, material, or acts perform the claimed function, applicant should clarify the record by either:
(a) Amending the written description of the specification such that it expressly recites the corresponding structure, material, or acts for performing the claimed function and clearly links or associates the structure, material, or acts to the claimed function, without introducing any new matter (35 U.S.C. 132(a)); or
(b) Stating on the record what the corresponding structure, material, or acts, which are implicitly or inherently set forth in the written description of the specification, perform the claimed function. For more information, see 37 CFR 1.75(d) and MPEP §§ 608.01(o) and 2181.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1,2,4-8,11-13,14,16-19,20,22-24 is/are rejected under 35 U.S.C. 103 as being unpatentable over Huang et al (NPL: Inner Monologue: Embodied Reasoning through Planning with Language Models) in view of Guo et al (US Publication No.: 202502093338).
Claim 1, Huang et al discloses
Training a language model to perform decision making tasks using a vocabulary to express one or more of actions, states, or returns as a sequence of separated values (Caption of Fig. 3 discloses “… uses a pre-trained language model to take in a human instruction and decompose it into a sequence of actionable steps (decision making tasks to express one or more actions, states or returns as a sequence of separated values) by the agent …”. Section A.2 discloses using InstructGPT (language model), wherein such model is trained using a data set as described in reference 91, wherein Section 3.2 of Reference 91 discloses dataset for training the language model (vocabulary).);
Receiving, by the language model (Section A.2), an input comprising a previous language stream (Fig. 3, label tabletop rearrangement (Real) shows a previous language stream such as “Human: Sort fruits and condiments in different plate”. Fig. 1 shows a n example of the robot planning and interaction where a previous language stream is “Can you bring me the drink from the table?”.) including at least a state of an environment or an agent in the environment (Fig. 3, label Tabletop Rearrangement (Real) includes “Scene: objects …”. Such indicates a state of an environment aka state of the tabletop.);
Generating, but the language model (Section A.2), an output language stream according to the vocabulary (Section A.2 discloses the language model is InstructGPT where Section 3.2 of Reference 91 disclosing the language model is trained using dataset.) based on the input (Caption of Fig. 3 discloses “Sharing across the domain is the same Inner Monologue formulation that uses a pre-trained language model to take in a human instruction and decompose it into a sequence of actionable steps by the agent…”. Section 4.2 discloses “… an LLM to parse target objects from a language command (e.g., given by the planner) and then executes the scripted suction …”. Fig. 1, caption discloses “robot planning with large language models ….”.);
Detecting, by the language model (Section A.2), a well-formed action in response to patterns in the output language stream (Caption of Fig. 3 discloses “…. A pre-trained language model to take in a human instruction and decompose it into a sequence of actionable steps by the agent …”. This indicates a well-formed action is detected (actionable steps) in response to patterns or content of the output language stream (human instruction). Fig. 1, label can you bring me the drink from the table as a well-formed action with subsequent question as a result of input to the robot.);
Performing an operation, by the environment, in response to detecting the well-formed action, the operation returning a result (Fig. 3, label Tabletop Rearrangement (real) performs an operation (“Robot: robot.pickplace(strawberry, green plate)”, by an environment such as LLM performing the scripted suction as per Section 4.2. The action of the robot is according or by the environment as indicated in Fig. 3, “Scene objects …”. Fig. 1, label action: “go to table” and “action “pick up the coke” indicates operation performed in response to detecting the well-formed action “Can you bring me the drink form the table?” and “coke please”.);
Appending the result to the output language stream to obtain an updated output language stream (Fig. 3, label tabletop rearrangement (real) label “Success Detector: False”. The language model will continue to run, performing operation in addition or with the appended result of “Success Detector: False” (result of the output language stream). Fig. 1, label action was not successful indicates the result of the output language stream. This is inputted to the robot for continued action, which indicates appending the result to the output language stream.); and
Repeating the generating, with the updated output language stream as the input, the detecting, the performing, and the appending until a termination condition is satisfied (Fig. 3, label tabletop rearrangement (real) shows repetition of the generating, with updated output language stream as the input of “Success Detector: False”, the detecting, the performing and the appending until label “Robot: robot.stop()”. Fig. 1, label “action: pick up the coke”, “action was successful” and “action: bring it to you” indicates repeating actions the generating, with the updated output language stream as the input, the detecting, the performing and the appending until termination of condition such as success of an action and completed task or job.).
Although Huang et al discloses training of the language model (Caption of Fig. 3), Huang et al fails to disclose pre-processing as recited in the limitation.
Guo et al discloses
pre-processing training data (Fig. 1, label training system), prior to training a neural network to perform decision making tasks (Fig. 1, label training system trains policy neural network to make decision making tasks such as selecting actions the agent, label 104, will perform) vocabulary, the vocabulary expressing one or more actions, states or returns as a sequence of separated values (Fig. 1, label 190 includes the training data. Paragraph 15 discloses the training data includes trajectories, wherein trajectories comprises a series of actions previously performed by the agent at successive time-steps (actions and states) and corresponding observations (returns). A series of actions, states or returns indicates separate entities or values indicating an action, state at a time-step and observation or return corresponding to the trajectory.),
the pre-processing comprising converting an initial reward into a new reward value that is between the initial reward value and an ideal reward value such that initial reward values are replaced by higher reward values (Fig. 1, label 192 determines rewards to the training data. Paragraph 89 discloses “The intrinsic reward value for each action is combined with the task reward for the corresponding action, to give a total reward value for the action. This total reward value is used by the … update unit 193 to update the policy neural network 122 in one iteration of training the policy neural network 122.” Paragraph 9 discloses “choosing a control policy so as to increase the expected intrinsic reward term corresponds to adapting the control policy so that it is more likely to to choose actions which enable the predictions to be improved …”. Fig. 1, label 190 with 192 indicates rewards are assigned according to label 192 corresponding to actions of trajectories found in the training database. Depending on computation by label 192, the reward is updated or replaces an initial reward to a higher reward.).
Guo et al discloses training a neural network for selecting actions or tasks to be performed by an agent (Fig. 1, label 100,108,104) via iterative training and based on observations in the environment or returns or outcomes of the agent’s actions (Fig. 1, label 110,112,190) and Huang et al discloses task based language model, where a set of tasks or actions are planned and performed by the language model (Fig. 1,3), hence it would be obvious to one skilled in the art before the effective filing date of the application to modify Huang et al’s language model performing planned tasks or actions by incorporating Guo et al’s iterative reinforcement training of task decision making neural network or policy neural network so to improve the selection of actions, hence improving the agent such as a language model’s performance.
Claim 2, Huang et al discloses the operation is a state transition in the environment and the result includes a new state and/or a reward (Fig. 3, label tabletop rearrangement (real), “Success detector: false” indicates the state transition, where success detection is part of the operation. The success detector outcome is a result of the operation and indicates a new state of the operation.).
Claim 4, Huang et al discloses detecting the well-formed action by parsing using a pointer to indicate an end of a last well-formed action. (Fig. 1, label “coke please.”, where the well-formed action is “action: “pick up the coke” and “bring it to you”. Parsing or determining the action was successful or not indicates a pointer to indicate an end of a last well-formed action aka the action “pick up the coke” and “bring it to you” is completed or ended.)
Claim 5, Huang et al discloses the last well-formed action is included in a regular expression. (Fig. 1, label “pick up the coke” and “bring it to you” as the last well-formed action includes a regular expression.)
Claim 6, Huang et al discloses pretrained language model decomposes a human instruction into a sequence of actionable steps (Fig. 3 caption), but fails to disclose pre-processing the training data, prior to training, to replace a first return encoding with a random value.
Guo et al discloses preprocessing the training data (Fig. 1, label training system processes training data, label 191, to include reward, label 192,193.), prior to training (Fig. 1, label training system processes training data prior to training label 122), to replace a first return encoding with a random value (Fig. 1, label 192 determines intrinsic reward, label 112 determines the extrinsic reward. The first return encoding is reward, extrinsic and/or intrinsic reward. Both the intrinsic reward and extrinsic reward is determined based on observation of the environment once the agent performs the selected actions from label 122. This indicates the reward can be updated or adjusted to a random value, wherein the value of the reward depends on observations, where such observations have no pattern or plan, hence resulting in random value of the reward, intrinsic and/or extrinsic.).
Guo et al discloses training a neural network for selecting actions or tasks to be performed by an agent (Fig. 1, label 100,108,104) via iterative training and based on observations in the environment or returns or outcomes of the agent’s actions (Fig. 1, label 110,112,190) and Huang et al discloses task based language model, where a set of tasks or actions are planned and performed by the language model (Fig. 1,3), hence it would be obvious to one skilled in the art before the effective filing date of the application to modify Huang et al’s language model performing planned tasks or actions by incorporating Guo et al’s iterative reinforcement training of task decision making neural network or policy neural network so to improve the selection of actions, hence improving the agent such as a language model’s performance.
Claim 7, Huang et al discloses
Receive, by the language model (Section A.2) after training (Caption of Fig. 3 discloses “… uses a pre-trained language model to take in a human instruction and decompose it into a sequence of actionable steps (decision making tasks to express one or more actions, states or returns as a sequence of separated values) by the agent …”.), an input comprising a previous language stream (Fig. 3, label tabletop rearrangement (Real) shows a previous language stream such as “Human: Sort fruits and condiments in different plate”. Fig. 1 shows a n example of the robot planning and interaction where a previous language stream is “Can you bring me the drink from the table?”.) including at least a state of an environment or an agent in the environment (Fig. 3, label Tabletop Rearrangement (Real) includes “Scene: objects …”. Such indicates a state of an environment aka state of the tabletop.);
Generate, but the language model (Section A.2) after training (Caption of Fig. 3 discloses “… uses a pre-trained language model to take in a human instruction and decompose it into a sequence of actionable steps (decision making tasks to express one or more actions, states or returns as a sequence of separated values) by the agent …”.), an output language stream according to the vocabulary (Section A.2 discloses the language model is InstructGPT where Section 3.2 of Reference 91 disclosing the language model is trained using dataset.) based on the input (Caption of Fig. 3 discloses “Sharing across the domain is the same Inner Monologue formulation that uses a pre-trained language model to take in a human instruction and decompose it into a sequence of actionable steps by the agent…”. Section 4.2 discloses “… an LLM to parse target objects from a language command (e.g., given by the planner) and then executes the scripted suction …”. Fig. 1, caption discloses “robot planning with large language models ….”.);
Detect, by the language model (Section A.2) after training (Caption of Fig. 3 discloses “… uses a pre-trained language model to take in a human instruction and decompose it into a sequence of actionable steps (decision making tasks to express one or more actions, states or returns as a sequence of separated values) by the agent …”.), a well-formed action in response to patterns in the output language stream (Caption of Fig. 3 discloses “…. A pre-trained language model to take in a human instruction and decompose it into a sequence of actionable steps by the agent …”. This indicates a well-formed action is detected (actionable steps) in response to patterns or content of the output language stream (human instruction). Fig. 1, label can you bring me the drink from the table as a well-formed action with subsequent question as a result of input to the robot.);
Perform an operation, by the environment, in response to detecting the well-formed action, the operation returning a result (Fig. 3, label Tabletop Rearrangement (real) performs an operation (“Robot: robot.pickplace(strawberry, green plate)”, by an environment such as LLM performing the scripted suction as per Section 4.2. The action of the robot is according or by the environment as indicated in Fig. 3, “Scene objects …”. Fig. 1, label action: “go to table” and “action “pick up the coke” indicates operation performed in response to detecting the well-formed action “Can you bring me the drink form the table?” and “coke please”.);
Append the result to the output language stream to obtain an updated output language stream (Fig. 3, label tabletop rearrangement (real) label “Success Detector: False”. The language model will continue to run, performing operation in addition or with the appended result of “Success Detector: False” (result of the output language stream). Fig. 1, label action was not successful indicates the result of the output language stream. This is inputted to the robot for continued action, which indicates appending the result to the output language stream.); and
Repeat the generating, with the updated output language stream as the input, the detecting, the performing, and the appending until a termination condition is satisfied (Fig. 3, label tabletop rearrangement (real) shows repetition of the generating, with updated output language stream as the input of “Success Detector: False”, the detecting, the performing and the appending until label “Robot: robot.stop()”. Fig. 1, label “action: pick up the coke”, “action was successful” and “action: bring it to you” indicates repeating actions the generating, with the updated output language stream as the input, the detecting, the performing and the appending until termination of condition such as success of an action and completed task or job.).
Although Huang et al discloses training of the language model (Caption of Fig. 3), Huang et al fails to disclose pre-processing as recited in the limitation and at least one memory; and at least one processor coupled to the at least one memory.
Guo et al discloses
At least one memory (paragraphs 152,158); and
At least one processor coupled to the at least one memory (paragraph 158) configured to include a system with a neural network and training (Fig. 1), where the system comprising
pre-processing training data (Fig. 1, label training system), prior to training a neural network to perform decision making tasks (Fig. 1, label training system trains policy neural network to make decision making tasks such as selecting actions the agent, label 104, will perform) vocabulary, the vocabulary expressing one or more actions, states or returns as a sequence of separated values (Fig. 1, label 190 includes the training data. Paragraph 15 discloses the training data includes trajectories, wherein trajectories comprises a series of actions previously performed by the agent at successive time-steps (actions and states) and corresponding observations (returns). A series of actions, states or returns indicates separate entities or values indicating an action, state at a time-step and observation or return corresponding to the trajectory.),
the pre-processing comprising converting an initial reward into a new reward value that is between the initial reward value and an ideal reward value such that initial reward values are replaced by higher reward values (Fig. 1, label 192 determines rewards to the training data. Paragraph 89 discloses “The intrinsic reward value for each action is combined with the task reward for the corresponding action, to give a total reward value for the action. This total reward value is used by the … update unit 193 to update the policy neural network 122 in one iteration of training the policy neural network 122.” Paragraph 9 discloses “choosing a control policy so as to increase the expected intrinsic reward term corresponds to adapting the control policy so that it is more likely to to choose actions which enable the predictions to be improved …”. Fig. 1, label 190 with 192 indicates rewards are assigned according to label 192 corresponding to actions of trajectories found in the training database. Depending on computation by label 192, the reward is updated or replaces an initial reward to a higher reward.).
Guo et al discloses training a neural network for selecting actions or tasks to be performed by an agent (Fig. 1, label 100,108,104) via iterative training and based on observations in the environment or returns or outcomes of the agent’s actions (Fig. 1, label 110,112,190) and Huang et al discloses task based language model, where a set of tasks or actions are planned and performed by the language model (Fig. 1,3), hence it would be obvious to one skilled in the art before the effective filing date of the application to modify Huang et al’s language model performing planned tasks or actions by incorporating Guo et al’s iterative reinforcement training of task decision making neural network or policy neural network so to improve the selection of actions, hence improving the agent such as a language model’s performance.
Claim 8, Huang et al discloses the operation is a state transition in the environment and the result includes a new state and/or a reward (Fig. 3, label tabletop rearrangement (real), “Success detector: false” indicates the state transition, where success detection is part of the operation. The success detector outcome is a result of the operation and indicates a new state of the operation.).
Claim 10, Huang et al discloses detecting the well-formed action by parsing using a pointer to indicate an end of a last well-formed action. (Fig. 1, label “coke please.”, where the well formed action is “action: “pick up the coke” and “bring it to you”. Parsing or determining the action was successful or not indicates a pointer to indicate an end of a last well-formed action aka the action “pick up the coke” and “bring it to you” is completed or ended.)
Claim 11, Huang et al discloses the last well-formed action is included in a regular expression. (Fig. 1, label “pick up the coke” and “bring it to you” as the last well-formed action includes a regular expression.)
Claim 12, Huang et al discloses pretrained language model decomposes a human instruction into a sequence of actionable steps (Fig. 3 caption), but fails to disclose pre-processing the training data, prior to training, to replace a first return encoding with a random value.
Guo et al discloses preprocessing the training data (Fig. 1, label training system processes training data, label 191, to include reward, label 192,193.), prior to training (Fig. 1, label training system processes training data prior to training label 122), to replace a first return encoding with a random value (Fig. 1, label 192 determines intrinsic reward, label 112 determines the extrinsic reward. The first return encoding is reward, extrinsic and/or intrinsic reward. Both the intrinsic reward and extrinsic reward is determined based on observation of the environment once the agent performs the selected actions from label 122. This indicates the reward can be updated or adjusted to a random value, wherein the value of the reward depends on observations, where such observations have no pattern or plan, hence resulting in random value of the reward, intrinsic and/or extrinsic.).
Guo et al discloses training a neural network for selecting actions or tasks to be performed by an agent (Fig. 1, label 100,108,104) via iterative training and based on observations in the environment or returns or outcomes of the agent’s actions (Fig. 1, label 110,112,190) and Huang et al discloses task based language model, where a set of tasks or actions are planned and performed by the language model (Fig. 1,3), hence it would be obvious to one skilled in the art before the effective filing date of the application to modify Huang et al’s language model performing planned tasks or actions by incorporating Guo et al’s iterative reinforcement training of task decision making neural network or policy neural network so to improve the selection of actions, hence improving the agent such as a language model’s performance.
Claim 13, Huang et al discloses
Means for (such limitation invokes 35 USC 112f, hence incorporates the apparatus performing the claimed function. Section 1 discloses processor in the form of a robot or embodied agent performing the claimed function) training a language model to perform decision making tasks using a vocabulary to express one or more of actions, states or returns as a sequence of separated values (Caption of Fig. 3 discloses “… uses a pre-trained language model to take in a human instruction and decompose it into a sequence of actionable steps (decision making tasks to express one or more actions, states or returns as a sequence of separated values) by the agent …”. Section A2 discloses using InstructGPT (language model), wherein such model is trained using a data set as described in reference 91, wherein Section 3.2 of Reference 91 discloses dataset for training the language model (vocabulary).);
Means for (such limitation invokes 35 USC 112f, hence incorporates the apparatus performing the claimed function. Section 1 discloses processor in the form of a robot or embodied agent performing the claimed function) receiving, by the language model (Section A.2), an input comprising a previous language stream (Fig. 3, label tabletop rearrangement (Real) shows a previous language stream such as “Human: Sort fruits and condiments in different plate”. Fig. 1 shows a n example of the robot planning and interaction where a previous language stream is “Can you bring me the drink from the table?”.) including at least a state of an environment or an agent in the environment (Fig. 3, label Tabletop Rearrangement (Real) includes “Scene: objects …”. Such indicates a state of an environment aka state of the tabletop.);
Means for (such limitation invokes 35 USC 112f, hence incorporates the apparatus performing the claimed function. Section 1 discloses processor in the form of a robot or embodied agent performing the claimed function) generating, but the language model (Section A.2), an output language stream according to the vocabulary (Section A.2 discloses the language model is InstructGPT where Section 3.2 of Reference 91 disclosing the language model is trained using dataset.) based on the input (Caption of Fig. 3 discloses “Sharing across the domain is the same Inner Monologue formulation that uses a pre-trained language model to take in a human instruction and decompose it into a sequence of actionable steps by the agent…”. Section 4.2 discloses “… an LLM to parse target objects from a language command (e.g., given by the planner) and then executes the scripted suction …”. Fig. 1, caption discloses “robot planning with large language models ….”.);
Means for (such limitation invokes 35 USC 112f, hence incorporates the apparatus performing the claimed function. Section 1 discloses processor in the form of a robot or embodied agent performing the claimed function) Detecting, by the language model (Section A.2), a well-formed action in response to patterns in the output language stream (Caption of Fig. 3 discloses “…. A pre-trained language model to take in a human instruction and decompose it into a sequence of actionable steps by the agent …”. This indicates a well-formed action is detected (actionable steps) in response to patterns or content of the output language stream (human instruction). Fig. 1, label can you bring me the drink from the table as a well-formed action with subsequent question as a result of input to the robot.);
Means for (such limitation invokes 35 USC 112f, hence incorporates the apparatus performing the claimed function. Section 1 discloses processor in the form of a robot or embodied agent performing the claimed function) performing an operation, by the environment, in response to detecting the well-formed action, the operation returning a result (Fig. 3, label Tabletop Rearrangement (real) performs an operation (“Robot: robot.pickplace(strawberry, green plate)”, by an environment such as LLM performing the scripted suction as per Section 4.2. The action of the robot is according or by the environment as indicated in Fig. 3, “Scene objects …”. Fig. 1, label action: “go to table” and “action “pick up the coke” indicates operation performed in response to detecting the well-formed action “Can you bring me the drink form the table?” and “coke please”.);
Means for (such limitation invokes 35 USC 112f, hence incorporates the apparatus performing the claimed function. Section 1 discloses processor in the form of a robot or embodied agent performing the claimed function) appending the result to the output language stream to obtain an updated output language stream (Fig. 3, label tabletop rearrangement (real) label “Success Detector: False”. The language model will continue to run, performing operation in addition or with the appended result of “Success Detector: False” (result of the output language stream). Fig. 1, label action was not successful indicates the result of the output language stream. This is inputted to the robot for continued action, which indicates appending the result to the output language stream.); and
Means for (such limitation invokes 35 USC 112f, hence incorporates the apparatus performing the claimed function. Section 1 discloses processor in the form of a robot or embodied agent performing the claimed function) Repeating the generating, with the updated output language stream as the input, the detecting, the performing, and the appending until a termination condition is satisfied (Fig. 3, label tabletop rearrangement (real) shows repetition of the generating, with updated output language stream as the input of “Success Detector: False”, the detecting, the performing and the appending until label “Robot: robot.stop()”. Fig. 1, label “action: pick up the coke”, “action was successful” and “action: bring it to you” indicates repeating actions the generating, with the updated output language stream as the input, the detecting, the performing and the appending until termination of condition such as success of an action and completed task or job.).
Although Huang et al discloses training of the language model (Caption of Fig. 3), Huang et al fails to disclose pre-processing as recited in the limitation and a system comprising a processor.
Guo et al discloses a system comprising neural network selecting actions performed by an agent and training of the neural network (Fig. 1) includes a processor (Paragraphs 158,157,152), wherein the system comprises
Means for (such limitation invokes 35 USC 112f, hence incorporates the apparatus performing the claimed function. Due to the 35 USC 112(f) indicated above, the claimed language is interpreted as a processor as the apparatus. Paragraph 157,158 discloses general or special purpose microprocessors or both or any other kind of central processing unit performing the described processes and logic flows.) pre-processing training data (Fig. 1, label training system), prior to training a neural network to perform decision making tasks (Fig. 1, label training system trains policy neural network to make decision making tasks such as selecting actions the agent, label 104, will perform) vocabulary, the vocabulary expressing one or more actions, states or returns as a sequence of separated values (Fig. 1, label 190 includes the training data. Paragraph 15 discloses the training data includes trajectories, wherein trajectories comprises a series of actions previously performed by the agent at successive time-steps (actions and states) and corresponding observations (returns). A series of actions, states or returns indicates separate entities or values indicating an action, state at a time-step and observation or return corresponding to the trajectory.),
the pre-processing comprising converting an initial reward into a new reward value that is between the initial reward value and an ideal reward value such that initial reward values are replaced by higher reward values (Fig. 1, label 192 determines rewards to the training data. Paragraph 89 discloses “The intrinsic reward value for each action is combined with the task reward for the corresponding action, to give a total reward value for the action. This total reward value is used by the … update unit 193 to update the policy neural network 122 in one iteration of training the policy neural network 122.” Paragraph 9 discloses “choosing a control policy so as to increase the expected intrinsic reward term corresponds to adapting the control policy so that it is more likely to to choose actions which enable the predictions to be improved …”. Fig. 1, label 190 with 192 indicates rewards are assigned according to label 192 corresponding to actions of trajectories found in the training database. Depending on computation by label 192, the reward is updated or replaces an initial reward to a higher reward.).
Guo et al discloses training a neural network for selecting actions or tasks to be performed by an agent (Fig. 1, label 100,108,104) via iterative training and based on observations in the environment or returns or outcomes of the agent’s actions (Fig. 1, label 110,112,190) and Huang et al discloses task based language model, where a set of tasks or actions are planned and performed by the language model (Fig. 1,3), hence it would be obvious to one skilled in the art before the effective filing date of the application to modify Huang et al’s language model performing planned tasks or actions by incorporating Guo et al’s iterative reinforcement training of task decision making neural network or policy neural network so to improve the selection of actions, hence improving the agent such as a language model’s performance.
Claim 14, Huang et al discloses the operation is a state transition in the environment and the result includes a new state and/or a reward (Fig. 3, label tabletop rearrangement (real), “Success detector: false” indicates the state transition, where success detection is part of the operation. The success detector outcome is a result of the operation and indicates a new state of the operation.).
Claim 16, Huang et al discloses detecting the well-formed action by parsing using a pointer to indicate an end of a last well-formed action. (Fig. 1, label “coke please.”, where the well formed action is “action: “pick up the coke” and “bring it to you”. Parsing or determining the action was successful or not indicates a pointer to indicate an end of a last well-formed action aka the action “pick up the coke” and “bring it to you” is completed or ended.)
Claim 17, Huang et al discloses the last well-formed action is included in a regular expression. (Fig. 1, label “pick up the coke” and “bring it to you” as the last well-formed action includes a regular expression.)
Claim 18 recites similar limitations as recited in claim 12 and rejected on the same grounds as claim 12.
Claim 19, Huang et al discloses
Receiving, by the language model (Section A.2) after training (Caption of Fig. 3 discloses “… uses a pre-trained language model to take in a human instruction and decompose it into a sequence of actionable steps (decision making tasks to express one or more actions, states or returns as a sequence of separated values) by the agent …”.), an input comprising a previous language stream (Fig. 3, label tabletop rearrangement (Real) shows a previous language stream such as “Human: Sort fruits and condiments in different plate”. Fig. 1 shows a n example of the robot planning and interaction where a previous language stream is “Can you bring me the drink from the table?”.) including at least a state of an environment or an agent in the environment (Fig. 3, label Tabletop Rearrangement (Real) includes “Scene: objects …”. Such indicates a state of an environment aka state of the tabletop.);
Generating, but the language model (Section A.2) after training (Caption of Fig. 3 discloses “… uses a pre-trained language model to take in a human instruction and decompose it into a sequence of actionable steps (decision making tasks to express one or more actions, states or returns as a sequence of separated values) by the agent …”.), an output language stream according to the vocabulary (Section A.2 discloses the language model is InstructGPT where Section 3.2 of Reference 91 disclosing the language model is trained using dataset.) based on the input (Caption of Fig. 3 discloses “Sharing across the domain is the same Inner Monologue formulation that uses a pre-trained language model to take in a human instruction and decompose it into a sequence of actionable steps by the agent…”. Section 4.2 discloses “… an LLM to parse target objects from a language command (e.g., given by the planner) and then executes the scripted suction …”. Fig. 1, caption discloses “robot planning with large language models ….”.);
Detecting, by the language model (Section A.2) after training (Caption of Fig. 3 discloses “… uses a pre-trained language model to take in a human instruction and decompose it into a sequence of actionable steps (decision making tasks to express one or more actions, states or returns as a sequence of separated values) by the agent …”.), a well-formed action in response to patterns in the output language stream (Caption of Fig. 3 discloses “…. A pre-trained language model to take in a human instruction and decompose it into a sequence of actionable steps by the agent …”. This indicates a well-formed action is detected (actionable steps) in response to patterns or content of the output language stream (human instruction). Fig. 1, label can you bring me the drink from the table as a well-formed action with subsequent question as a result of input to the robot.);
Performing an operation, by the environment, in response to detecting the well-formed action, the operation returning a result (Fig. 3, label Tabletop Rearrangement (real) performs an operation (“Robot: robot.pickplace(strawberry, green plate)”, by an environment such as LLM performing the scripted suction as per Section 4.2. The action of the robot is according or by the environment as indicated in Fig. 3, “Scene objects …”. Fig. 1, label action: “go to table” and “action “pick up the coke” indicates operation performed in response to detecting the well-formed action “Can you bring me the drink form the table?” and “coke please”.);
Appending the result to the output language stream to obtain an updated output language stream (Fig. 3, label tabletop rearrangement (real) label “Success Detector: False”. The language model will continue to run, performing operation in addition or with the appended result of “Success Detector: False” (result of the output language stream). Fig. 1, label action was not successful indicates the result of the output language stream. This is inputted to the robot for continued action, which indicates appending the result to the output language stream.); and
Repeating the generating, with the updated output language stream as the input, the detecting, the performing, and the appending until a termination condition is satisfied (Fig. 3, label tabletop rearrangement (real) shows repetition of the generating, with updated output language stream as the input of “Success Detector: False”, the detecting, the performing and the appending until label “Robot: robot.stop()”. Fig. 1, label “action: pick up the coke”, “action was successful” and “action: bring it to you” indicates repeating actions the generating, with the updated output language stream as the input, the detecting, the performing and the appending until termination of condition such as success of an action and completed task or job.).
Although Huang et al discloses training of the language model (Caption of Fig. 3), Huang et al fails to disclose pre-processing as recited in the limitation.
Guo et al discloses
program code to perform the recited limitations (paragraphs 157-158);
Program code (paragraph 157-158) to pre-process training data (Fig. 1, label training system), prior to training a neural network to perform decision making tasks (Fig. 1, label training system trains policy neural network to make decision making tasks such as selecting actions the agent, label 104, will perform) vocabulary, the vocabulary expressing one or more actions, states or returns as a sequence of separated values (Fig. 1, label 190 includes the training data. Paragraph 15 discloses the training data includes trajectories, wherein trajectories comprises a series of actions previously performed by the agent at successive time-steps (actions and states) and corresponding observations (returns). A series of actions, states or returns indicates separate entities or values indicating an action, state at a time-step and observation or return corresponding to the trajectory.),
the pre-processing comprising converting an initial reward into a new reward value that is between the initial reward value and an ideal reward value such that initial reward values are replaced by higher reward values (Fig. 1, label 192 determines rewards to the training data. Paragraph 89 discloses “The intrinsic reward value for each action is combined with the task reward for the corresponding action, to give a total reward value for the action. This total reward value is used by the … update unit 193 to update the policy neural network 122 in one iteration of training the policy neural network 122.” Paragraph 9 discloses “choosing a control policy so as to increase the expected intrinsic reward term corresponds to adapting the control policy so that it is more likely to to choose actions which enable the predictions to be improved …”. Fig. 1, label 190 with 192 indicates rewards are assigned according to label 192 corresponding to actions of trajectories found in the training database. Depending on computation by label 192, the reward is updated or replaces an initial reward to a higher reward.).
Guo et al discloses training a neural network for selecting actions or tasks to be performed by an agent (Fig. 1, label 100,108,104) via iterative training and based on observations in the environment or returns or outcomes of the agent’s actions (Fig. 1, label 110,112,190) and Huang et al discloses task based language model, where a set of tasks or actions are planned and performed by the language model (Fig. 1,3), hence it would be obvious to one skilled in the art before the effective filing date of the application to modify Huang et al’s language model performing planned tasks or actions by incorporating Guo et al’s iterative reinforcement training of task decision making neural network or policy neural network so to improve the selection of actions, hence improving the agent such as a language model’s performance.
Claim 20, Huang et al discloses the operation is a state transition in the environment and the result includes a new state and/or a reward (Fig. 3, label tabletop rearrangement (real), “Success detector: false” indicates the state transition, where success detection is part of the operation. The success detector outcome is a result of the operation and indicates a new state of the operation.).
Claim 22, Huang et al discloses detecting the well-formed action by parsing using a pointer to indicate an end of a last well-formed action. (Fig. 1, label “coke please.”, where the well formed action is “action: “pick up the coke” and “bring it to you”. Parsing or determining the action was successful or not indicates a pointer to indicate an end of a last well-formed action aka the action “pick up the coke” and “bring it to you” is completed or ended.)
Claim 23, Huang et al discloses the last well-formed action is included in a regular expression. (Fig. 1, label “pick up the coke” and “bring it to you” as the last well-formed action includes a regular expression.)
Claim 24 recites similar limitations as recited in claim 12 and rejected on the same grounds as claim 12.
Claim(s) 3,9,15,21 is/are rejected under 35 U.S.C. 103 as being unpatentable over Huang et al (NPL: Inner Monologue: Embodied Reasoning through Planning with Language Models) in view of Guo et al (US Publication No.: 202502093338), further in view of Ouyang et al (NPL: Training language models to follow instructions with human feedback).
Claim 3, Huang et al discloses using a large language model for real world tabletop rearrangement as shown in Fig. 3 (page 15, Section A2, Large Language Model), wherein the LLM is InstructGPT and trained using training data (Section A.2, Reference 91’s Section 3.2 discloses training data used to train the LLM.), but fails to disclose training data comprising randomly generated crops of sub-sequences.
Ouyang et al discloses InstructGPT trained using training data comprising randomly generated crops of sub-sequences (Section 3.2 discloses prompt dataset or training data used to train initial InstructGPT models and training set for finetuning InstructGPT. The dataset for training initial InstructGPT models comprise text prompts submitted by customers. The dataset used for fine-tuning comprises supervised fine-tuning (SFT) dataset, reward model (RM) dataset and proximal policy optimization (PPO) dataset. Depending on the dataset as indicated above, such training data includes randomly generated data since labelers or customers are generating the dataset without organization or purpose from the InstructGPT model. Depending on the labelers or customers input, the dataset includes sub-sequences such as sentences or any form of input as inputted by the customer or labeler.). It would be obvious to one skilled in the art before the effective filing date of the application for the language model as disclosed by Huang et al to include training dataset as disclosed by Ouyang et al so to improve truthfulness and reductions in toxic output generation, hence improving the user’s experience with automation.
Claim 9, Huang et al discloses using a large language model for real world tabletop rearrangement as shown in Fig. 3 (page 15, Section A2, Large Language Model), wherein the LLM is InstructGPT and trained using training data (Section A.2, Reference 91’s Section 3.2 discloses training data used to train the LLM.), but fails to disclose training data comprising randomly generated crops of sub-sequences.
Ouyang et al discloses InstructGPT trained using training data comprising randomly generated crops of sub-sequences (Section 3.2 discloses prompt dataset or training data used to train initial InstructGPT models and training set for finetuning InstructGPT. The dataset for training initial InstructGPT models comprise text prompts submitted by customers. The dataset used for fine-tuning comprises supervised fine-tuning (SFT) dataset, reward model (RM) dataset and proximal policy optimization (PPO) dataset. Depending on the dataset as indicated above, such training data includes randomly generated data since labelers or customers are generating the dataset without organization or purpose from the InstructGPT model. Depending on the labelers or customers input, the dataset includes sub-sequences such as sentences or any form of input as inputted by the customer or labeler.). It would be obvious to one skilled in the art before the effective filing date of the application for the language model as disclosed by Huang et al to include training dataset as disclosed by Ouyang et al so to improve truthfulness and reductions in toxic output generation, hence improving the user’s experience with automation.
Claim 15 recites similar limitations as recited in claim 9 and rejected on the same grounds as claim 9.
Claim 21 recites similar limitations as recited in claim 9 and rejected on the same grounds as claim 9.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LINDA WONG whose telephone number is (571)272-6044. The examiner can normally be reached 9-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew C Flanders can be reached at 571-272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/LINDA WONG/Primary Examiner, Art Unit 2655