DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
Status of Claims
Claims 1-16 are currently pending and are being hereby examined herein.
Joint Inventors
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Domestic Benefit
Acknowledgment is made of applicant’s claim for domestic benefit to provisional application 63/647,926. The provisional application is nearly identical to the non-provisional application and provides support for the claims being examined herein; therefore, the effective filing date being considered by the examiner is 15 May 2024.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 16 July 2024 has been considered by the examiner.
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art. The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) is invoked.
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f):
(A) the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function;
(B) the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and
(C) the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function.
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f). The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function.
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f). The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function.
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f), except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. Such claim limitation(s) is/are:
“input interface” in Claim 1.
“LLM decoder” in Claim 1.
“multimodal transformer” in Claim 7.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
“input interface” is shown on FIG. 7 (reference number 700); however, no structure for “input interface” was found.
No structure for “LLM decoder” was found.
No structure for “multimodal transformer” was found.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f), applicant may: (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f).
Drawings
The drawings are objected to for the following informalities:
In FIG. 1A, there is a line (part of an arrow) over the word “actions” (box with reference number 103).
In FIG. 1B, FIG. 3, FIG. 4A, FIG. 4B, and FIG. 5 there is a header overlapping the figure.
Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. The figure or figure number of an amended drawing should not be labeled as “amended.” If a drawing figure is to be canceled, the appropriate figure must be removed from the replacement sheet, and where necessary, the remaining figures must be renumbered and appropriate changes made to the brief description of the several views of the drawings for consistency. Additional replacement sheets may be necessary to show the renumbering of the remaining figures. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
Claim Objections
The claims are objected to because of the following informalities:
Claims 1, 10, and 16: “the multimodal instructions” lacks antecedent basis (was introduced as “receiving a plurality of multimodal inputs each specifying instructions in a different modality”).
Claims 3 and 11: “its output” should be “[[its]] an output”.
Claims 4 and 12: “whose probability of feasibility is maximum” should be reworded to remove the word “whose” (for example, “selecting an action candidate [[whose]] with a maximum probability of feasibility
Claim 4: “the most feasible action candidate” should be “[[the]] a most feasible action candidate”.
Claim 7: “the learnable tokens” should be “the trainable [[learnable]] tokens”.
Claims 9 and 15: “the modalities” lacks antecedent basis (was introduced as “a different modality”).
Claims 9 and 15: “the instructions specified by the multimodal inputs” lacks antecedent basis (was introduced as “receiving a plurality of multimodal inputs each specifying instructions in a different modality” and “the multimodal instructions”).
Claim 12: “one or more observations”, “the observation”, and “the observations” have inconsistent antecedent basis and should be corrected.
Claim 12: “and action candidates” should be “and the plurality of action candidates”
Appropriate corrections are required.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
Claims 1-9 are rejected under 35 U.S.C. 112(b) as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor regards as the invention. These claims refer to “a multimodal large language model (LLM)”, “a multimodal LLM encoder”, “an action sequence decoder”, and “a query-transformer (Q-Former)”; despite not invoking 35 U.S.C. 112(f), one of ordinary skill would not know what infringes on these apparatus claims due to the uncertainty if any structure is required to infringe in the broadest reasonable interpretation in view of the specification for these limitations. One of ordinary skill would consider that these functions may be merely software (no rejection under 35 U.S.C. 101 for a signal per se has been issued because the first line states “including circuitry”, so the claim as a whole has minimal structure). Furthermore, in Claim 4 “an action evaluator module” stored on a memory and executed by one or more processors is introduced, which further implies that there isn’t necessarily a memory or processors (any structure) for the other functions. For the purposes of compact prosecution, the examiner will assume any apparatus that can complete the recited functions would read on the claims. Appropriate corrections are required.
Claim 2 is further rejected under 35 U.S.C. 112(b) as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor regards as the invention. Claim 1 recites “the LLM decoder is configured to decode the encodings into a sequence of actions”. Claim 2 recites “wherein to decode the encodings into a sequence of actions, the LLM decoder is configured to decode the encodings into a sequence of robotic instructions and wherein the robotic controller further comprises an action sequence decoder trained with machine learning to transform the sequence of robotic instructions generated by the LLM decoder into a sequence of actions based on a library of robotic skills”. It is indefinite how both “the LLM decoder is configured to decode the encodings into a sequence of actions” and “an action sequence decoder trained with machine learning to transform the sequence of robotic instructions generated by the LLM decoder into a sequence of actions based on a library of robotic skills” are true. In the independent claim, the LLM decoder decodes the encodings into the sequence of actions and the dependent claim offers an alternative. Furthermore, it is indefinite if “transform the sequence of robotic instructions generated by the LLM decoder into a sequence of actions based on a library of robotic skills” should be “transform the sequence of robotic instructions generated by the LLM decoder into [[a]] the sequence of actions based on a library of robotic skills” or something different. For the purposes of compact prosecution, the examiner will assume an LLM decoder or an action sequence decoder that can generate a sequence of actions based on a library of robotic skills would read on this claim. Appropriate corrections are required.
Claims 4-6 are further rejected under 35 U.S.C. 112(b) as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor regards as the invention. Claim 4 recites “collect a plurality of candidates for each action in the sequence of actions generated by the action sequence decoder”, but depends only from Claim 1. Accordingly, it is indefinite if Claim 4 should depend from Claim 2 (and therefore, would have the same rejection as Claim 2 above) or if this should instead be “collect a plurality of candidates for each action in the sequence of actions
Claims 12-14 are rejected under 35 U.S.C. 112(b) as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor regards as the invention. Claim 12 is dependent on Claim 10, and therefore directed to a method; however, the limitations are written as if Claim 10 was directed to an apparatus / a system (they further limit the multimodal LLM which is part of the preamble). The claim is indefinite because one of ordinary skill would not know if the limitations directed to a system of Claim 10 are necessary to infringe. Claims 13 and 14 are dependent on Claim 12, and accordingly rejected. For the purposes of compact prosecution, the examiner will assume this claim is directed to the preamble / an intended use. Appropriate corrections are required.
Claims 1-9 are further rejected because claim limitations “input interface”, “LLM decoder”, and “multimodal transformer” invoke 35 U.S.C. 112(f). However, the written description fails to disclose the corresponding structure, material, or acts for performing the entire claimed function and to clearly link the structure, material, or acts to the function. The disclosure is devoid of any structure that performs the function in the claim. Therefore, the claim is indefinite and is rejected under 35 U.S.C. 112(b).
Applicant may:
(a) Amend the claim so that the claim limitation will no longer be interpreted as a limitation under 35 U.S.C. 112(f);
(b) Amend the written description of the specification such that it expressly recites what structure, material, or acts perform the entire claimed function, without introducing any new matter (35 U.S.C. 132(a)); or
(c) Amend the written description of the specification such that it clearly links the structure, material, or acts disclosed therein to the function recited in the claim, without introducing any new matter (35 U.S.C. 132(a)).
If applicant is of the opinion that the written description of the specification already implicitly or inherently discloses the corresponding structure, material, or acts and clearly links them to the function so that one of ordinary skill in the art would recognize what structure, material, or acts perform the claimed function, applicant should clarify the record by either:
(a) Amending the written description of the specification such that it expressly recites the corresponding structure, material, or acts for performing the claimed function and clearly links or associates the structure, material, or acts to the claimed function, without introducing any new matter (35 U.S.C. 132(a)); or
(b) Stating on the record what the corresponding structure, material, or acts, which are implicitly or inherently set forth in the written description of the specification, perform the claimed function. For more information, see 37 CFR 1.75(d) and MPEP §§ 608.01(o) and 2181.
Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b).
The filing of a terminal disclaimer by itself is not a complete reply to a nonstatutory double patenting (NSDP) rejection. A complete reply requires that the terminal disclaimer be accompanied by a reply requesting reconsideration of the prior Office action. Even where the NSDP rejection is provisional the reply must be complete. See MPEP § 804, subsection I.B.1. For a reply to a non-final Office action, see 37 CFR 1.111(a). For a reply to final Office action, see 37 CFR 1.113(c). A request for reconsideration while not provided for in 37 CFR 1.113(c) may be filed after final for consideration. See MPEP §§ 706.07(e) and 714.13.
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The actual filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/apply/applying-online/eterminal-disclaimer.
Claims 1-3 and 7-10 are provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over Claims 1-3, 5-7, and 10-11 of copending Application No. 19/069,490 (reference application). Below is a claim-by-claim comparison between the present application and reference application 19,069,490, wherein the differences have been bolded. This is a provisional nonstatutory double patenting rejection because the patentably indistinct claims have not in fact been patented.
18/773,853
Reference Application: 19/,069,490
1. A robotic controller including circuitry, comprising:
an input interface configured to receive a plurality of multimodal inputs each specifying instructions in a different modality;
a multimodal large language model (LLM) including a multimodal LLM encoder and an LLM decoder, wherein the multimodal LLM encoder is trained with machine learning to transform the multimodal instructions into encodings and the LLM decoder is configured to decode the encodings into a sequence of actions; and
a trajectory controller configured to control a robot according to the sequence of actions.
1. A robotic controller including circuitry, comprising:
at least one input interface configured to receive a plurality of multimodal inputs, each input of the plurality of multimodal inputs specifying instructions for a task in a different modality;
a memory configured to store computer executable instructions and a multimodal large language model (LLM) including a multimodal LLM encoder and an LLM decoder, a feedback encoder, and a first query-transformer (Q-Former);
a processor configured to execute the instructions to:
transform using the multimodal LLM encoder, the plurality of multimodal inputs into a plurality of encodings;
decode using the LLM decoder, the plurality of encodings into a first sequence of actions and a robot action description (natural language action description) aligned to the first sequence of actions;
receive a feedback input corresponding to at least one action in the first sequence of actions;
encode using the feedback encoder, the robot action description and the feedback input to generate encoded feedback data;
generate using the first Q-Former, multimodal features for the LLM decoder based on the encodings of the multimodal LLM encoder; and
generate, using the LLM decoder, a second sequence of actions based on the encoded feedback data and the multimodal features; and
a trajectory controller operatively coupled to the processor, the trajectory controller configured to control a robot according to the second sequence of actions.
2. The robotic controller of claim 1, wherein to decode the encodings into a sequence of actions, the LLM decoder is configured to decode the encodings into a sequence of robotic instructions and wherein the robotic controller further comprises an action sequence decoder trained with machine learning to transform the sequence of robotic instructions generated by the LLM decoder into a sequence of actions based on a library of robotic skills.
2. The robotic controller of claim 1,
wherein to decode the plurality of encodings into the first sequence of actions, the processor is configured to execute the LLM decoder to decode the plurality of encodings into a sequence of robotic instructions, and
wherein the robotic controller further comprises an action sequence decoder trained with machine learning to transform the sequence of robotic instructions into a sequence of robotic actions based on a library of robotic skills.
3. The robotic controller of claim 1, further comprising:
a query-transformer (Q-Former) trained with machine learning to translate the encodings of the multimodal LLM encoder into an instruction conditioning the LLM decoder to produce its output structured in a format compatible with the trajectory controller.
3. The robotic controller of claim 1, further comprising:
a second Q-Former trained with machine learning to translate the plurality of encodings of the multimodal LLM encoder into an instruction conditioning the LLM decoder to produce its output structured in a format compatible with the trajectory controller.
7. The robotic controller of claim 3, wherein the Q-Former comprises a multimodal transformer trained with trainable tokens and a text transformer that shares the same self-attention layers with the multimodal transformer, and wherein the multimodal transformer is configured to compute cross-attention between the learnable tokens and the encodings of the multimodal LLM encoder and output a latent vector of the encodings of the multimodal LLM encoder.
5. The robotic controller of claim 3, wherein the second Q-Former comprises a multimodal transformer trained with trainable tokens and a text transformer that shares a same self-attention layers with the multimodal transformer, and wherein the multimodal transformer is configured to compute cross-attention between learnable tokens and the plurality of encodings of the multimodal LLM encoder and output a latent vector of the plurality of encodings.
8. The robotic controller of claim 1, wherein the sequence of actions corresponds to a sequence of dynamic movement primitives (DMPs) to be executed by the robot.
6. The robotic controller of claim 1, wherein the second sequence of actions corresponds to a sequence of dynamic movement primitives (DMPs) to be executed by the robot.
9. The robotic controller of claim 1, wherein the modalities of the instructions specified by the multimodal inputs include a video modality, an audio modality, and a text modality.
7. The robotic controller of claim 1, wherein the modalities of the instructions specified by the multimodal inputs include a video modality, an audio modality, and a text modality.
10. A computer-implemented method for applying a robotic controller including a multimodal large language model (LLM), an action sequence decoder trained with machine learning, and a trajectory controller for controlling a robot according to a sequence of actions, the method comprising:
receiving a plurality of multimodal inputs each specifying instructions in a different modality;
transforming the multimodal instructions into encodings using a multimodal LLM encoder of the multimodal LLM that is trained with machine learning;
decoding the encodings into a sequence of robotic instructions using an LLM decoder of the multimodal LLM;
transforming the sequence of robotic instructions into a sequence of actions based on a library of robotic skills, using the action sequence decoder; and
controlling the robot according to the sequence of actions using the trajectory controller.
10. A computer-implemented method for controlling a robot, the method comprising:
receiving a plurality of multimodal inputs, each input of the plurality of multimodal inputs specifying instructions for a task in a different modality;
transforming the multimodal instructions into a plurality of encodings using a multimodal large language model (LLM) encoder of a multimodal LLM that is trained with machine learning;
decoding, using an LLM decoder of the multimodal LLM, the plurality of encodings into a first sequence of actions and a robot action description (natural language action description) aligned to the first sequence of actions;
receiving a feedback input corresponding to at least one action in the first sequence of actions;
encoding, using a feedback encoder, the robot action description and the feedback input to generate encoded feedback data;
generating, using a first query-transformer (Q-Former), multimodal features for the LLM decoder based on the encodings of the multimodal LLM encoder; and
generating, using the LLM decoder, a second sequence of actions based on the encoded feedback data and the multimodal features; and
controlling the robot according to the second sequence of actions.
11. The computer-implemented method of claim 10, wherein the decoding the plurality of encodings into the first sequence of actions comprises:
decoding the plurality of encodings into a sequence of robotic instructions; and
transforming the sequence of robotic instructions into a sequence of robotic actions based on a library of robotic skills.
As shown above, the primary difference between the present claims and the claims of the reference application is that the reference claims are narrower. The other differences are merely wording-related and the office takes Official Notice that one of ordinary skill would understand the other differences to be insignificant and result in double patenting.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
Claims 1-2, 4-6, 9-10, and 15-16 are rejected under 35 U.S.C. 102(a)(1)/(a)(2) as being anticipated by U.S. Pub. No. 2023/0311335 (Hausman et al., hereinafter, Hausman).
Regarding Claim 1, Hausman discloses A robotic controller including circuitry (see at least FIG. 5 and FIG. 6), comprising:
an input interface configured to receive a plurality of multimodal inputs each specifying instructions in a different modality (see at least [0060]-[0062] and [0112]: “The instruction can be a free-form natural language instruction generated based on user interface input that is provided by a user via one or more user interface input devices.”; “the LLM engine 130 generates an LLM prompt 205A based on the FF NL input 105”; “the LLM engine 130 can optionally generate the LLM prompt 205A further based on one or more of scene descriptor(s) 202A of a current environment of the robot 110, prompt example(s) 203A, and/or an explanation 204A”);
a multimodal large language model (LLM) (see at least FIG. 2A and FIG. 2B: LLM 150 is included in the process which makes the entire process a multimodal large language model) including a multimodal LLM encoder (see at least [0003]) and an LLM decoder (see at least [0003]), wherein the multimodal LLM encoder is trained with machine learning to transform the multimodal instructions into encodings (see at least [0003], [0040], [0062], FIG. 2A, and FIG. 2B: the LLM prompt, which came from instructions is input to the LLM; “a pre-trained large sentence encoder language model can be utilized”; “large language models (LLMs) have been developed that are trained on massive amounts of data”; “The scene descriptor(s) 202A can include NL descriptor(s) of object(s) currently or recently detected in the environment with the robot 110, such as descriptor(s) of object(s) determined based on processing image(s) or other vision data using object detection and classification machine learning model(s). For example, the scene descriptor(s) 202A can include “pear”, “keys”, “human”, “table”, “sink”, and “countertops” and the LLM engine 130 can generate the LLM prompt 205A to incorporate one or more of such descriptors. For instance, the LLM prompt 205A can be “A pear, keys, a human, a table, a sink, and countertops are nearby. How would you bring the human a snack from the table. I would 1.”.”) and the LLM decoder is configured to decode the encodings into a sequence of actions (see at least [0063], FIG. 2A, and FIG. 2B: “For example, the prior LLM prompt could be “explain how you would bring me a snack from the table”, and the explanation 204A can be generated based on the highest probability decoding from the prior LLM output. For instance, the explanation 204A can be “I would find the table, then find a snack on the table, then bring it to you”. The explanation 204A can be prepended to the LLM prompt 205A, replace term(s) of the FF NL input 105 in the LLM prompt 205A, or otherwise incorporated into the LLM prompt 205A.”); and
a trajectory controller configured to control a robot according to the sequence of actions (see at least [0071], [0112], FIG. 5, and FIG. 6: “ The method further includes, in response to determining to implement the robotic skill, causing the robot to implement the robotic skill in the current environment.”; “the implementation engine 136 controls the robot 110 based on the selected robotic skill A. For example, the implementation engine 136 can control the robot using a navigation policy with a navigation target of “table” (or of a location corresponding to a “table”)”).
Regarding Claim 2, Hausman discloses the limitations of Claim 1. Furthermore, Hausman discloses wherein to decode the encodings into a sequence of actions, the LLM decoder is configured to decode the encodings into a sequence of robotic instructions (see at least [0076], FIG. 2A, and FIG. 2B: “the LLM output 206B can model a probability distribution, over candidate word compositions, and is dependent on the LLM prompt 205B”) and wherein the robotic controller further comprises an action sequence decoder trained with machine learning to transform the sequence of robotic instructions generated by the LLM decoder into a sequence of actions based on a library of robotic skills (see at least [0008], [0041], [0065], [0069], and [0117]: “The task grounding engine 132 generates task-grounding measures 208A and generates the task-grounding measures 208A based on the LLM output 206A and skill descriptions 207. Each of the skill descriptions 207 is descriptive of a corresponding skill that the robot 110 is configured to perform. For example, “go to the table” can be descriptive of a “navigate to table” skill that the robot can perform by utilizing a trained navigation policy with a navigation target of “table” (or of a location corresponding to a “table”). As another example, “go to the sink” can be descriptive of a “navigate to sink” skill that the robot can perform by utilizing the trained navigation policy with a navigation target of “sink” (or of a location corresponding to a “sink”). As yet another example, “pick up a bottle” can be descriptive of a “grasp a bottle” skill that the robot can perform utilizing grasping heuristics fine-tuned to a bottle and/or using a trained grasping network. As yet another example, “pick up keys” can be descriptive of a “grasp keys” skill that the robot can perform utilizing grasping heuristics fine-tuned to keys and/or using a trained grasping network. Although skill descriptions 207 includes descriptors for skills A-G in FIG. 2A, skill descriptions 207 can include descriptors for additional skills in various implementations (as indicated by the ellipsis). Such additional skills can correspond to alternative objects and/or can be for different types of robotic action(s) (e.g., “place”, “push”, “open”, “close”).”; “In some implementations, the world-grounding engine 134 can additionally or alternatively, for some robotic skill(s), generate world-grounding measures based on a corresponding one of the value function model(s) 152 that is a trained value function model.”; “In some implementations, generating, based on the robotic skill and the current environmental state data, the world-grounding measure, includes processing the robotic skill and the current environmental state data, using a trained value function, to generate value function output that comprises the world-grounding measure. In some versions of those implementations, the current environmental state data includes vision data (e.g., a multi-channel image), of the sensor data, the vision data being captured by one or more vision components of the one or more sensor components of the robot. In some additional or alternative versions of those implementations, the trained value function is a language-conditioned value function and processing the robotic skill using the trained value function comprises processing the skill description of the robotic skill. In some additional or alternative versions of those implementations, the trained value function is trained to correspond to an affordance function, and the value function output specifies whether the robotic skill is possible based on the current environmental state data. In some additional or alternative versions of those implementations, the value function is a machine learning model trained using reinforcement learning”).
Regarding Claim 4, Hausman discloses the limitations of Claim 1. Furthermore, Hausman discloses further comprising:
a memory configured to store an action evaluator module (see at least FIG. 6); and
one or more processors configured to execute the action evaluator module (see at least FIG. 6) to:
collect a plurality of action candidates for each action in the sequence of actions generated by the action sequence decoder (see at least FIG. 2A: skill descriptions 207);
collect one or more first observations of an environment of the robot and one or more second observations of the robot (see at least [0067]: “The world-grounding engine 134 generates world-grounding measures 211A for the robotic skills. In generating the world-grounding measures 211A for at least some of the robotic skills, the world-grounding engine 134 can generate the world-grounding measure based on environmental state data 209A and, optionally, further based on robot state data 210A and/or corresponding ones of the skill descriptions 207. Further, in generating the world-grounding measures 211A for at least some of the robotic skills, the world-grounding engine 134 can utilize one or more of the value function model(s) 152.”);
collect a text prompt associated with at least one of the one or more first observations or the one or more second observations and the plurality of action candidates (see at least [0026] and [0070]: “The world-grounding measures 211A are generated based on the state of the robot 110 as reflected in the birds-eye view of FIG. 1B. Namely, when the robot 110 is still quite a distance away from the pear 184A and the keys 184B. Accordingly, world-grounding measures F and G, for “pick up a pear” and “pick up keys” respectively, are both relatively low (“0.10”). This reflects the low probability of an attempted grasp of either item being successful, due to the large distance between the robot 110 and the pear 184A and the keys 184B. The world-grounding measure A, “0.80”, for “go to the table” reflects lower probability than does the world-grounding measure B, “0.85” for “go to the sink”. This can be based on the robot 111 being closer to the sink 193 than it is to the table 194.”; “Some of those implementations, instead of only prompting an LLM to simply interpret an FF NL high-level instruction, utilize the LLM output generated by the prompting to generate task-grounding measures that each quantify the likelihood that a corresponding robotic skill makes progress towards completing the high-level instruction. Further, a corresponding affordance function (e.g., a learned value function) for each of the robotic skills can be utilized to generate a world-grounding measure, for the robotic skill, that that quantifies how likely it is to succeed from the current stat.”);
compute a probability of feasibility for each action candidate of the plurality of action candidates, based on the one or more first observations and the one or more second observations and the text prompt (see at least [0070]-[0072], FIG. 2A, and FIG. 2B: “The world-grounding measures 211A are generated based on the state of the robot 110 as reflected in the birds-eye view of FIG. 1B”; “In FIG. 2A, the selection engine 136 generates overall measures 212A by multiplying the world-grounding measures 211A and the task-grounding measures 208A, and selects the robotic skill A based on it having the highest of the overall measures 212A. It is noted that robotic skill A has the highest overall measure, despite it not having the highest world-grounding measure. Although FIG. 2A illustrates the overall measures 212A being generated by multiplying the world-grounding measures 211A and the task-grounding measures 208A, other techniques can be utilized in generating the overall measures 212A. For example, different weightings could be applied to the world-grounding measures 211A and the task-grounding measures 208A in the multiplying. For example, the world-grounding measures 211A can be weighted at 90% and the task-grounding measures 208A weighted at 100%.”); and
select, an action candidate from among the plurality of action candidates whose probability of feasibility is maximum among the plurality of action candidates, as the most feasible action candidate (see at least [0072]: “In FIG. 2A, the selection engine 136 generates overall measures 212A by multiplying the world-grounding measures 211A and the task-grounding measures 208A, and selects the robotic skill A based on it having the highest of the overall measures 212A. It is noted that robotic skill A has the highest overall measure, despite it not having the highest world-grounding measure. Although FIG. 2A illustrates the overall measures 212A being generated by multiplying the world-grounding measures 211A and the task-grounding measures 208A, other techniques can be utilized in generating the overall measures 212A. For example, different weightings could be applied to the world-grounding measures 211A and the task-grounding measures 208A in the multiplying. For example, the world-grounding measures 211A can be weighted at 90% and the task-grounding measures 208A weighted at 100%.”).
Regarding Claim 5, Hausman discloses the limitations of Claim 4. Furthermore, Hausman discloses wherein the one or more processors are further configured to generate a refined sequence of actions based on the most feasible action candidate corresponding to each action in the sequence of actions generated by the action sequence decoder (see at least [0072]-[0074]: “In FIG. 2B, the LLM engine 130 generates an LLM prompt 205B based on the FF NL input 105 (“bring me a snack from the table”) and further based on the selected skill descriptor 201B (“go to the table”), for robotic skill A, based on robotic skill A being selected and provided for implementation.”).
Regarding Claim 6, Hausman discloses the limitations of Claim 5. Furthermore, Hausman discloses wherein the trajectory controller is configured to generate control commands to control the robot in accordance with the refined sequence of actions (see at least [0081] and FIG. 2B: “the implementation engine 136 controls the robot 110 based on the selected robotic skill F. For example, the implementation engine 136 can control the robot using a grasping policy, optionally with parameters fine-tuned for grasping a pear”).
Regarding Claim 9, Hausman discloses the limitations of Claim 1. Furthermore, Hausman discloses wherein the modalities of the instructions specified by the multimodal inputs include a video modality, an audio modality, and a text modality (see at least [0054]-[0055], [0058], [0060]-[0062], and [0106]: “User interface input devices 622 may include a keyboard, pointing devices such as a mouse, trackball, touchpad, or graphics tablet, a scanner, a touchscreen incorporated into the display, audio input devices such as voice recognition systems, microphones, and/or other types of input devices. In general, use of the term “input device” is intended to include all possible types of devices and ways to input information into computing device 610 or onto a communication network.”; “Robot 110 also includes a vision component 111 that can generate vision data (e.g., images) related to shape, color, depth, and/or other features of object(s) that are in the line of sight of the vision component 111. The vision component 111 can be, for example, a monocular camera, a stereographic camera (active or passive), and/or a 3D laser scanner”; “Also illustrated in FIG. 1B is an example current vision data instance 180 that can be captured in the environment and utilized, for example, in generating a world-grounding measure for a robotic skill. For example, robot 110 can capture the current vision data instance 180, using vision component 111. The vision data instance 180 captures a pear 184A and keys 184B that are both present on the round table represented by feature 194. It is noted that, in the birds-eye view, the pear and the keys are illustrated as dots for the sake of simplicity.”; “The scene descriptor(s) 202A can include NL descriptor(s) of object(s) currently or recently detected in the environment with the robot 110, such as descriptor(s) of object(s) determined based on processing image(s) or other vision data using object detection and classification machine learning model(s)”).
Regarding Claim 10, Hausman discloses A computer-implemented method for applying a robotic controller including a multimodal large language model (LLM), an action sequence decoder trained with machine learning, and a trajectory controller for controlling a robot according to a sequence of actions (see at least FIG. 2A, FIG. 2B, FIG. 5, and FIG. 6), the method comprising:
receiving a plurality of multimodal inputs each specifying instructions in a different modality (see at least [0060]-[0063]: “the LLM engine 130 generates an LLM prompt 205A based on the FF NL input 105 (“bring me a snack from the table”)”; “In some implementations, the LLM engine 130 can optionally generate the LLM prompt 205A further based on one or more of scene descriptor(s) 202A of a current environment of the robot 110, prompt example(s) 203A, and/or an explanation 204A.”);
transforming the multimodal instructions into encodings using a multimodal LLM encoder of the multimodal LLM that is trained with machine learning (see at least [0003], [0040], [0062], FIG. 2A, and FIG. 2B: the LLM prompt, which came from instructions is input to the LLM; “a pre-trained large sentence encoder language model can be utilized”; “large language models (LLMs) have been developed that are trained on massive amounts of data”; “The scene descriptor(s) 202A can include NL descriptor(s) of object(s) currently or recently detected in the environment with the robot 110, such as descriptor(s) of object(s) determined based on processing image(s) or other vision data using object detection and classification machine learning model(s). For example, the scene descriptor(s) 202A can include “pear”, “keys”, “human”, “table”, “sink”, and “countertops” and the LLM engine 130 can generate the LLM prompt 205A to incorporate one or more of such descriptors. For instance, the LLM prompt 205A can be “A pear, keys, a human, a table, a sink, and countertops are nearby. How would you bring the human a snack from the table. I would 1.”.”);
decoding the encodings into a sequence of robotic instructions using an LLM decoder of the multimodal LLM (see at least [0008]: “The generated LLM prompt can be processed, using the LLM, to generate LLM output that models a probability distribution, over candidate word compositions, that is dependent on the instruction. Continuing with the working example, the highest probability decoding of the LLM output can be, for example, “use a vacuum”. However, implementations disclosed herein do not simply blindly utilize the probability distribution of the LLM output in determining how to control a robot. Rather, implementations leverage the probability distribution of the LLM output, while also considering robotic skills that are actually performable by the robot, such as tens of, hundreds of, or thousands of pre-trained robotic skills.”);
transforming the sequence of robotic instructions into a sequence of actions based on a library of robotic skills, using the action sequence decoder (see at least [0065]-[0066], [0069], [0077], FIG. 2A, and FIG. 2B: “The task grounding engine 132 generates task-grounding measures 208B and generates the task-grounding measures 208B based on the LLM output 206A and skill descriptions 207. Each of the task-grounding measures 208B is generated based on a probability, of the corresponding skill description, in the LLM output 206B. For example, task-grounding measure A, “0.00”, reflects the probability of the word sequence “go to the table” in the LLM output 206B. As another example, task-grounding measure B, “0.10”, reflects the probability of the word sequence “go to the sink” in the LLM output 206B.”; “In some implementations, the world-grounding engine 134 can additionally or alternatively, for some robotic skill(s), generate world-grounding measures based on a corresponding one of the value function model(s) 152 that is a trained value function model. In some of those implementations, the trained value function model can be a language-conditioned model. For example, the world-grounding engine 134 can, in generating a world-grounding measure for a robotic skill, process, using the language-conditioned model, a corresponding one of the skill descriptions 207 for the robotic skill, along with the environmental state data 209A and optionally along with the robot state data 210A, to generate a value that reflects a probability of the robotic skill being successful based on the current state data. The world-grounding engine can generate the world-grounding measure based on the generated value (e.g., to conform to the value).”); and
controlling the robot according to the sequence of actions using the trajectory controller (see at least [0071] and [0081]: “the implementation engine 136 controls the robot 110 based on the selected robotic skill A. For example, the implementation engine 136 can control the robot using a navigation policy with a navigation target of “table” (or of a location corresponding to a “table”).”; “the implementation engine 136 controls the robot 110 based on the selected robotic skill F. For example, the implementation engine 136 can control the robot using a grasping policy, optionally with parameters fine-tuned for grasping a pear”).
Regarding Claim 15, Hausman discloses the limitations of Claim 10. Furthermore, Hausman discloses wherein the modalities of the instructions specified by the multimodal inputs include a video modality, an audio modality, and a text modality (see at least [0054]-[0055], [0058], [0060]-[0062], and [0106]: “User interface input devices 622 may include a keyboard, pointing devices such as a mouse, trackball, touchpad, or graphics tablet, a scanner, a touchscreen incorporated into the display, audio input devices such as voice recognition systems, microphones, and/or other types of input devices. In general, use of the term “input device” is intended to include all possible types of devices and ways to input