Last updated: April 19, 2026
Application No. 18/640,621
System and Method for Controlling Robotic Manipulator with Self-Attention Having Hierarchically Conditioned Output

Non-Final OA §103§112
Filed
Apr 19, 2024
Examiner
GAMMON, MATTHEW CHRISTOPHER
Art Unit
3657
Tech Center
3600 — Transportation & Electronic Commerce
Assignee
Mitsubishi Electric Research Laboratories Inc.
OA Round
1 (Non-Final)
This examiner grants 65% of cases after interview

— +23.4% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 102 resolved cases, 2023–2026
Examiner Intelligence

GAMMON, MATTHEW CHRISTOPHER View full profile →
Grants 65% of resolved cases
Career Allow Rate
66 granted / 102 resolved
+12.7% vs TC avg
Strong +23% interview lift
Without
With
+23.4%
Interview Lift
resolved cases with interview
Typical timeline
2y 9m
Avg Prosecution
32 currently pending
Career history
134
Total Applications
across all art units
Statute-Specific Performance

§101
7.4%
-32.6% vs TC avg
§103
32.4%
-7.6% vs TC avg
§102
26.8%
-13.2% vs TC avg
§112
31.1%
-8.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 102 resolved cases
Office Action

§103 §112
DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Objections
Claims 3 – 4, 10 – 11, and 17 – 18 are objected to because of the following informalities:  
Regarding Claims 3, 10, and 17, the claims recite the limitation “state change”. It should read “a state change”.
Regarding Claims 4, 11, and 18, the claims recite the limitation “the multimodal observations”. It should read “the multi-modal observations”. 
Appropriate correction is required.

Specification
The lengthy specification has not been checked to the extent necessary to determine the presence of all possible minor errors. Applicant’s cooperation is requested in correcting any errors of which applicant may become aware in the specification.


Claim Interpretation
The terms “skill” and “action” do not appear to receive a specific special definition within Applicant’s disclosure. Even within the prior art, these terms are explicitly indicated as having varying meaning/interpretation. Thus, these terms are broadly interpretated such that an “action” is merely more specific than a “skill”. 
See for example Page 17 of Homanga et al. (Bharadhwaj, Homanga, et al. "RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking." arXiv preprint arXiv:2309.01918 (2023).)  “Different works in robotics often assign different meanings when they refer to “skills” ” and which uses terms such as skill, task, activity, action, and action chunks to describe their disclosure.
The phrase “wherein the neural network is trained” has been interpreted as meaning “wherein the neural network has been trained” (past tense) as appears indicated by Applicant’s claim construction. For example, Applicant in Claims 8 and 15 do not positively recite a step of “training” and instead use the same phrasing indicating only particular structure. This therefore describes the structure of the network, but does not positively recite any particular item into a method/process or step thereof. 



Claim Rejections - 35 USC § 112(b)
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 2 – 3, 5 – 6, 9 – 10, 12 – 13, 16 – 17, and 19 – 20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Regarding Claims 2, 9, and 16, the claims recite the limitations of “update the sequence of actions with the current action and update the sequence of skills with the current skill” (Claim 2) or “updating the sequence of actions with the current action and updating the sequence of skills with the current skill” (Claims 9 and 16). However, there is no limitation in Claims 1, 8, and 15 of actually producing a “sequence of skills” or a “sequence of actions”. Furthermore, there is insufficient antecedent basis for “the current action” and “the current skill”. The independent claims merely describe the nature of the neural network (“is trained … to produce”), however the claims do not actually specify producing these items, thus there is no “sequence of actions” or “sequence of skills” to update, and it is unclear which skill is “the current skill” and “the current action” particularly in light of these references to unproduced sequences.
In the interest of compact prosecution, the limitations have been interpreted as reading:
“update a sequence of actions with a current action and update a sequence of skills with a current skill” (Claim 2) or “updating a sequence of actions with a current action and updating a sequence of skills with a current skill” (Claims 9 and 16).
It is Examiner’s understanding that the independent claims should potentially be rewritten such that it is clear the sequences are produced as part of the claimed function or process/method of the claims of processing the multi-modal observations, and that the claims should specify one or more actions as being “current” or otherwise provide a clear temporal relationship such that it is fully understood, particularly wherein a plurality of each might exist.
Regarding Claims 3, 5, 10, 12, 17, and 19, the claims recite “wherein the multi-modal observations are processed in an iterative manner”. This is not a clear and positively recited limitation. In the case of Claims 3, 5, 17, and 19 directed towards a product/system, this recitation does not further narrow an existing structure or function, and does not recite an additional structure or function, but appears to describe an activity presumably (but not actually specified) performed by the feedback controller. In the case of Claims 10 and 12, the claims are directed towards a method/process. This recitation does not recite a clear step of the method/process.
In the interest of compact prosecution, the claims have been instead interpreted as reading:“wherein the circuitry is further configured to process the multi-modal observations in an iterative manner” (Claims 3 and 5) and “wherein the method further comprises processing the multi-modal observations in an iterative manner” (Claims 10, 12, 17, and 19).
Regarding Claims 6, 13, and 20, the claims depend from claim(s) rejected above and inherit the deficiencies of said claim(s) as described above. Therefore, Claims 6, 13, and 20 are rejected under the same logic presented above.



Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1 – 20 are rejected under 35 U.S.C. 103 as being unpatentable over Florence et al. (US 20250144795 A1) in view of Hong (US 20240165830 A1).

In the interest of compact prosecution, Examiner notes that Florence incorporates by reference prior applications and several research papers. Florence furthermore makes clear reference to other research papers such that while not incorporated by reference, the basis for combination under 35 USC § 103 is already present. For example, the cited and incorporated provisional application effectively includes the research paper Driess, Danny, et al. "Palm-e: An embodied multimodal language model." (2023) within its disclosure as an appendix, which itself is prior art regardless, and other research papers disclose with great detail certain neural network architectures.
Examiner further notes that none of the rejections below are presently believed to solely rely on any incorporated or otherwise cited reference of Florence.

Regarding Claim 1, Florence teaches: 
A feedback controller (See at least control system 101 and 201) for (Examiner notes this use of for appears to merely indicate intent or purpose) controlling a robotic manipulator (See at least agent 102 and 202, in particular as illustrated in Figure 1) according to a task (See at least [0003] “Robotics planning refers to scheduling the physical operations of robots in order to perform tasks”), the robotic manipulator includes one or more actuators operatively coupled to one or more joints of the robotic manipulator for moving an end effector (See at least [0030] “The robot hardware can include actuators” and [0071] “joints of a robotic agent” and [0085] “In FIG. 2, the agent 202 is a robot that has a base, one or more arms, and an end effector”), the feedback controller includes a circuitry configured to:
accept a feedback signal (See observations 106/206 in Figures 1 and 2 flow into Policy System 100/200) including a sequence of multi-modal observations (See at least [0029] “In some cases, the one or more observations 106 include an observation image, or a plurality of observation images”, [0030] “In other cases, the one or more observations 106 include lower-dimensional data generated from data gathered from other types of sensors that makes observation as the robot interacts with the environment, or from robot hardware”, and [0057] “the sequence of input tokens 112 is generated from data from two or more modalities” ) of a state of execution of the task (See at least Figure 1, in particular the box “environment 104” to “observations 106”), wherein the multi-modal observations include … , video frames of a camera observing the state of execution of the task (See at least [0007] “observation images” and [0029] “the observation images can include images captured by a visual sensor of the agent 102 or by a visual sensor located in the environment 104. The visual sensor can for example be a camera, e.g., a still camera or a video camera, a depth camera, or a lidar sensor”), and proprioceptive measurements of one or more actuators (See at least [0030] “observations 106 include … data gathered … from robot hardware”. Examiner furthermore notes that “of” is particularly broad and does not specify the relationship in any particular detail. Furthermore, and alternatively, at the high level of generality indicated (actuators in general) Examiner takes official notice that this is common knowledge); 
process the multi-modal observations with a neural network having a self-attention module with a hierarchically conditioned output (The phrase “hierarchically conditioned” in light of Applicant’s disclosure appears to be open to broad interpretation. See [0042] of Applicant’s originally filed specification, “Hierarchical conditioning can be achieved in various ways, such as by incorporating hierarchical information into the input embeddings or by using hierarchical attention mechanisms to attend to different levels of abstraction in the input sequence”. The disclosure thus indicates any kind of use or effect of hierarchy, wherein what is considered “hierarchy” or “hierarchical” is not defined or described.
As shown in Figures 1 and 2, the observations are first encoded using one or more neural networks 114 ([0046]). These are then sequenced with other encodings, such as those from the input text sequence processed by one or more other neural networks and other observations into input tokens ([0057]). These are then processed by a language model neural network. All of the models are disclosed as potentially being Transformer neural networks, which are self-attention-based architectures. Furthermore, under the broad meaning of the term “hierarchically conditioned” presently provided, the disclosure presently appears to read on the limitations. For example, the level of abstraction inherently must vary in the process of being encoded and tokenized, as while the data types vary (text, image, etc.), the tokens do not ([0057]). As another example, more than one attention layer is disclosed in a given model indicating a hierarchical relationship ([0048]). As another example, the sequence of tokens is of significance and may be considered as indicating hierarchy, particularly in the context of language, and furthermore tokens may be prepended, appended, or interleaved ([0058] – [0060]). Policy output 142 may also comprise actions having differing probabilities or numerical values by which further selection might be performed ([0069] – [0070]). Finally, the output is clearly hierarchical as it provides a sequence or order to the output command) to produce a skill of the robotic manipulator (See at least output text sequence 132/232. [0063] “In particular, the natural language instructions can define or otherwise specify a high-level plan that includes a sequence of two or more operations that should be executed by the agent 102 at the time step”) and an action conditioned on the skill (See at least [0065] “at each of the plurality of time steps, after generating the output text sequence 132 by using the language model neural network 130, the policy system 100 uses another learned module to map the output text sequence 132 to the one or more low-level actions (or “actions” for short) 144, and then control the agent based on the actions 144”), wherein each skill defines a combination of actions (See at least [0063] “Generally, the high-level plan defines the operations to be performed by the agent, and the sequential order in which the operations should be performed. Each of these operations can themselves include multiple low-level actions that can be performed by the agent 102”), and wherein the neural network is trained in a supervised manner (See at least [0113] “The process 300 can also be performed as part of selecting actions to be performed by an agent based on processing observations and natural language text sequences derived from a set of training dataset, e.g., observation images the actions in response to which that should be performed by the agent is known, in order to train the set of learnable modules to determine trained values for the parameters of the learnable modules” (emphasis added). Supervised merely means that for a given input, an associated output is already known) with demonstration data (See at least [0116] “The set of training data can include training data generated based on the interactions of the agent (or another agent) with an environment”. Examiner notes that the nature of “demonstration data” is not claimed with particularity scuh that it must be generated by an expert, human, etc. Therefore, any data based on historical interactions read on this limitation) (Examiner furthermore, alternatively, and separately notes that this appears to describe variations of commonly understood artificial intelligence training techniques (See for example [0003] of Applicant’s specification which indicates “behavioral cloning and other learning from demonstration (LfD) approaches” as “conventional”, and again alternatively that each of the models disclosed by reference in Florence must be considered) to produce a sequence of skills and a corresponding sequence of actions for the actuators of the robotic manipulator to perform the task (See again above, in particular [0063]); 
determine one or more control commands for the one or more actuators based on the produced action (See at least [0074] “The control system 101 then causes the agent 102 to perform the determined actions 144 at the time step. For example, the control system 101 can do this by generating instructions for the agent 102 that when executed will cause the agent 102 to perform the determined actions 144, by submitting one or more control inputs directly to the appropriate controls of the agent, or by using another appropriate control technique” and 0076] “The actions 144 may be control inputs to control a robot, e.g., torques for the joints of the robot”); and 
submit the one or more control commands to the one or more actuators causing a change of the state of execution of the task (See again [0074] and [0076]).
Florence does not explicitly disclose, but Hong explicitly discloses:
...
measurements of one or more visuo-tactile sensors attached to the end effector (See at least Figure 1A)
…
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to utilize a visuo-tactile/optical tactile sensor as taught by Hong in the system of Florence with a reasonable expectation of success. Florence discusses using data gathered from a variety of sensors, including those within the environment and on/within the robot ([0029] – [0030]), including “data from different sensor modalities” ([0007]). Furthermore, the sensor of Hong utilizes images which Florence is already disclosed in detail as handling. The use of such sensors on end-effectors are well-known and routine and would serve to provide additional actionable information during robotic manipulation tasks.

Regarding Claim 2, the combination of Florence and Hong teaches: 
The feedback controller of claim 1, 
Florence further discloses:
wherein to perform the control step, the feedback controller is configured to:
update the sequence of actions with the current action and update the sequence of skills with the current skill (See at least [0028] “In particular, at each of a plurality of time steps, the policy system 100 and the control system 101 control the agent based on a policy output 142 for the time step” and [0110] “The natural language instructions can define or otherwise specify a high-level plan that includes a sequence of two or more operations that should be executed by the agent at the current time step”).
Regarding Claim 3, the combination of Florence and Hong teaches: 
The feedback controller of claim 1, 
Florence further discloses:
wherein the multi-modal observations are processed in an iterative manner (This appears to be open to various interpretations. See Figure 2 wherein observations are a sequence or see various discussions of “time steps”, “control iterations”, and similar), and wherein the multi-modal observations in a current iteration correspond to state change of the robotic manipulator caused by the control commands executed in a previous iteration (See at least Figure 1 and Figure 2 control loop and [0038] “More specifically, at each of the plurality of time steps, the policy system 100 obtains the one or more observations 106 and the natural language text sequence 108 for the time step”).

Regarding Claim 4, the combination of Florence and Hong teaches: 
The feedback controller of claim 1, 
Florence further discloses:
wherein the circuitry is further configured to encode each observation of the multimodal observations into an embedding of the observation in a latent space (See at least [0043] “For each of the one or more observations 106, the encoded observation can be an embedding of the observation 106. An “embedding” as used in this specification is a sequence of one or more vectors of numeric values, e.g., floating point values or other values, each vector having a pre-determined dimensionality. The space of possible vectors having the pre-determined dimensionality is referred to as the “embedding space” ”).

Regarding Claim 5, the combination of Florence and Hong teaches: 
The feedback controller of claim 1, 
Florence further discloses:
wherein the multi-modal observations are processed in an iterative manner (This appears to be open to various interpretations. See Figure 2 wherein observations are a sequence or see various discussions of “time steps”, “control iterations”, and similar), and the circuitry is configured to execute a reward function conditioned upon a goal (Examiner notes that while Applicant uses the term “reward”, the claim does not further define or describe the term. Furthermore, Claim 6 which provides a limitation to further define the reward function appears to indicate that a “loss” might be considered as part of the reward function, as it uses the phrasing “negative distance to the goal”. Thus, what a “reward” or “reward function” is, is open to some interpretation and appears to indicate an objective function wherein the system is pushed towards a goal such that it may be considered as “rewarded”.
Therefore, see [0119] “The system updates the values for the parameters of the neural networks based on using a machine learning training technique, e.g., a gradient descent with backpropagation training technique that uses a suitable optimizer, e.g., stochastic gradient descent, RMSprop, or Adam optimizer, to optimize an objective function, e.g., a cross-entropy objective function that is specific to a next token prediction task” or alternatively many of the other neural networks incorporated by reference or directly referenced), to (Examiner notes that this appears to merely indicate the intended purpose of the preceding limitation and does not appear to further limit the scope of the claim as presently constructed. Therefore, this recitation is not considered a positively recited limitation) terminate an iteration of the processing of the multi-modal observations marking (Examiner notes that this appears to merely indicate the expected result of the preceding recitation and does not appear to further limit the scope of the claim as presently constructed. Therefore, this recitation is not considered a positively recited limitation) completion of the task (See at least [0098] “The policy system 200 can repeat these operations at the frequency of 1 Hz, or above, until some termination condition has been met, e.g., until it generates a particular output text sequence that include a plan that includes a terminal operation”).

Regarding Claim 6, the combination of Florence and Hong teaches: 
The feedback controller of claim 5, 
Florence further discloses:
wherein the reward function is modeled based on a negative distance to the goal and an indication function of reaching the goal (Examiner notes that Applicant only appears to use these terms once in the specification, without clear further description or definition thereof. Therefore, the terms are interpreted broadly under their plain English meaning.
See again [0119] or alternatively many of the other neural networks incorporated by reference or directly referenced).

Regarding Claim 7, the combination of Florence and Hong teaches: 
The feedback controller of claim 1, 
Florence further discloses:
wherein the architecture of the neural network comprises a high-level planner configured to predict a skill based on the feedback signal (See at least Language Model Neural Network 130/230 which generates a “high-level plan that includes a sequence of two or more operations that should be executed by the agent” ([0063])) and a low-level goal reaching module configured to output an action conditioned upon the predicted skill (See at least Policy Neural Network 140 which “map the output text sequence 132 to the one or more low-level actions” ([0065])).

Regarding Claims 8 – 20, the claims are directed to effectively the same subject matter as Claims 1 - 7 with respect to the application of prior art. The claims are therefore rejected under the same logic as Claims 1 – 7 above.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Driess et al. (Driess, Danny, et al. "PaLM-E: An Embodied Multimodal Language Model." arXiv preprint arXiv:2303.03378 (2023).). Effectively the research paper basis for Florence et al. provided as an appendix in the incorporated provisional application which consequently trends towards more specific and technical terms as well as providing an extensive list of prior art references within the References section.
Homanga et al. (Bharadhwaj, Homanga, et al. "RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking." arXiv preprint arXiv:2309.01918 (2023).) which discloses skills associated “action chunking”.
Kroemer et al. (Kroemer, Oliver, et al. "Towards learning hierarchical skills for multi-phase manipulation tasks." 2015 IEEE international conference on robotics and automation (ICRA). IEEE, 2015.) which discloses learning hierarchical skills from human demonstration data.
Tavassoli et al. (Tavassoli, Mehrdad, et al. "Learning skills from demonstrations: A trend from motion primitives to experience abstraction." IEEE Transactions on Cognitive and Developmental Systems 16.1 (2023): 57-74.) which discloses learning skills and motion primitives from demonstration data, including related scheduled hierarchy.
Cai et al. (Cai, Mingyu, et al. "Hierarchical Deep Learning for Intention Estimation of Teleoperation Manipulation in Assembly Tasks." arXiv preprint arXiv:2403.19770 (2024).) which discloses “incorporating multi-scale hierarchical information in neural networks” and learning from demonstrations.
Jin et al. (Jin, Piaopiao, et al. "Multi-modal fusion in contact-rich precise tasks via hierarchical policy learning." arXiv preprint arXiv:2202.08401 (2022).) which discloses multimodal sensor feedback utilized in hierarchical policy learning.
Kolluri et al. (US 20210362328 A1) which discloses demonstration learning including skill templates and action sequences.
Oleynik (US 20190291277 A1) which discloses machine learning including imitation learning for action sequences.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MATTHEW C GAMMON whose telephone number is (571)272-4919. The examiner can normally be reached M - F 10:00 - 6:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, ADAM MOTT can be reached on (571) 270-5376. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/MATTHEW C GAMMON/Examiner, Art Unit 3657                                                                                                                                                                                                        
/ADAM R MOTT/Supervisory Patent Examiner, Art Unit 3657
Read full office action
Prosecution Timeline

Apr 19, 2024
Application Filed
Jan 23, 2026
Non-Final Rejection — §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/556,276
Patent 12594673
Method of Calibrating Manipulator, Control System and Robot System
2y 5m to grant Granted Apr 07, 2026
18/550,894
Patent 12588646
MILKING SYSTEM COMPRISING A MILKING ROBOT
2y 5m to grant Granted Mar 31, 2026
18/324,527
Patent 12583110
ROBOT CONTROL SYSTEM
2y 5m to grant Granted Mar 24, 2026
19/008,421
Patent 12576523
CONTROLLING ROBOTS USING MULTI-MODAL LANGUAGE MODELS
2y 5m to grant Granted Mar 17, 2026
18/588,084
Patent 12544926
OBJECT INTERFERENCE CHECK METHOD
2y 5m to grant Granted Feb 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
65%
Grant Probability
88%
With Interview (+23.4%)
2y 9m
Median Time to Grant
Low
PTA Risk
Based on 102 resolved cases by this examiner. Grant probability derived from career allow rate.