Last updated: May 29, 2026

Application No. 18/102,053

USING EMBEDDINGS, GENERATED USING ROBOT ACTION MODELS, IN CONTROLLING ROBOT TO PERFORM ROBOTIC TASK

Final Rejection §102§103

Filed

Jan 26, 2023

Priority

Jan 27, 2022 — provisional 63/303,912

Examiner

BLUST, JASON W

Art Unit

2132

Tech Center

2100 — Computer Architecture & Software

Assignee

Gdm Holding LLC

OA Round

2 (Final)

Interview Optional

— +15.9% interview lift. Examiner has a relatively high allowance rate (80%); +15.9% interview lift. A written response may suffice.

Based on 279 resolved cases, 2023–2026

Examiner Intelligence

BLUST, JASON W View full profile →

Grants 80% — above average

Career Allowance Rate

222 granted / 279 resolved

+24.6% vs TC avg

Strong +16% interview lift

Without

With

+15.9%

Interview Lift

resolved cases with interview

Typical timeline

2y 4m

Avg Prosecution

20 currently pending

Career history

304

Total Applications

across all art units

Statute-Specific Performance

§101

2.2%

-37.8% vs TC avg

§103

76.3%

+36.3% vs TC avg

§102

14.2%

-25.8% vs TC avg

§112

3.2%

-36.8% vs TC avg

Black line = Tech Center average estimate • Based on career data from 279 resolved cases

Office Action

§102 §103

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
	Response to Arguments
Applicant's arguments filed 1/5/2026 have been fully considered but they are not persuasive. The applicant argues on pages 9-10 of the remarks, filed 1/5/2026, that the amended portions of the claims are not taught by the prior art.  The examiner respectfully disagrees, the specific portions of the prior art and details can be found in the rejection below.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claim(s) 1, 3, 4, 8-12, 15, 16, 19 and 20 is/are rejected under 35 U.S.C. 102(a)(1)/(a)(2) as being anticipated by Shah (WO 2019/183568, as listed on the IDS dated 6/1/2023)

In regards to claim 1, Shah teaches
A method implemented by one or more processors (¶5, fig. 4 processor(s) 402)
obtaining a first sensor data instance that is in a first modality and that is generated by one or more first sensors of a plurality of sensors of a robot; (Fig. 1, ¶37-40, 48 teaches semantic vision data 161, and one or more sensors.  Also see fig. 4, sensors 408)
processing, using a first modality action machine learning (ML) model trained for controlling the robot to perform a robotic task based on sensor data in the first modality, the first sensor data instance to generate first predicted action output that can be used to control the robot, wherein processing the first sensor data instance comprises: (fig. 1, ¶5, 16, and 37-45 teaches semantic vision data 161 (sensor data in a first modality) is processed by semantic branch 162 (ML model) to create embeddings 163, which are then further processed to generate attention weighted outputs 177.  The attention weighted outputs 177 are then used to determine an action prediction output 181 which corresponds to motion primitive for a robot (i.e. to control the robot) which can include motion primitives such as “”move forward”, “move backward”, “turn right”, “turn left”, “move up”, “move down”, and/or other motion primitive(s) (including more or less granular motion primitives)”)
 generating a first embedding by processing the first sensor data instance using first initial layers of the first modality action ML model; and processing the first embedding using first additional layers of the first modality action ML model to generate the first predicted action output that can be used in controlling the robot; (fig. 1, ¶37-45, semantic embedding 163 (first embedding) is created from the semantic vision data 161 using semantic branch 162 (first initial layers), which is then put through additional layers 175, 176, to generate attention weighted outputs 177. The attention weighted outputs 177 are then used to determine an action prediction output 181 which corresponds to motion primitive for a robot (i.e. to control the robot) which can include motion primitives such as “”move forward”, “move backward”, “turn right”, “turn left”, “move up”, “move down”, and/or other motion primitive(s) (including more or less granular motion primitives)”))
obtaining a second sensor data instance that is in a second modality that is distinct from the first modality and that is generated by one or more of the sensors of a robot; (fig. 1, ¶40-48, depth vision data 165 (second sensor data) which is different than semantic vision data 161)
processing, using a second modality action ML model trained for controlling the robot to perform the robotic task based on sensor data in the second modality, the second sensor data instance to generate second predicted action output that is disparate from the first action output and that can be used in controlling the robot, wherein processing the second sensor data instance comprises: generating a second embedding by processing the second sensor data instance using second initial layers of the second modality action ML model; and processing the second embedding using second additional layers of the second modality action ML model to generate the second predicted action output that can be used in controlling the robot; (fig. 1, ¶5, 16, 40-48 teaches depth branch 166 (second modality action ML model, initial layers), processes depth vision data 165 to generate depth embedding 167 (second embedding) which is then put through additional layers 175, 176, to generate attention weighted outputs 177. The attention weighted outputs 177 are then used to determine an action prediction output 181 which corresponds to motion primitive for a robot (i.e. to control the robot) which can include motion primitives such as “”move forward”, “move backward”, “turn right”, “turn left”, “move up”, “move down”, and/or other motion primitive(s) (including more or less granular motion primitives)”.  The Output from the depth (second) embeddings 167 are different from the semantic (first) embeddings 163 (i.e. they are disparate), and will each modify the outputs of the action prediction layers differently in order to determine a preferred action prediction output 181 for controlling the robot).
determining, based on analysis of the first embedding and the second embedding, a first weight for the first predicted action output and a second weight for the second predicted action output; (fig. 1, ¶42 and 58 attention weighted outputs 177 are generated from the semantic and depth embeddings)
determining a final predicted action output using the first weight and the second weight; and controlling the robot, using the final predicted action output, in an iteration of attempting a performance of the robotic task.  (fig. 1, ¶40-48, 58 teaches action prediction output 181 (final predicted action output)

In regards to claim 3, Shah further teaches
wherein determining, based on analysis of the first embedding and the second embedding, the first weight and the second weight comprises: determining the first weight as a first value between zero and one and the second weight as a second value between zero and one; (¶36 scores can be normalized using softmax, which limits outputs between 0 and 1)
wherein determining the final predicted action output comprises: determining the final predicted action output as a function of: the first predicted action output and the first weight, and the second predicted action output and the second weight.  (¶45 action prediction can be a corresponding probability for each of a set of motion primitives)

In regards to claim 4, Shah further teaches
wherein determining the final predicted output as the function of the first predicted action output and the first weight, and the second predicted action output and the second weight comprises: determining the final predicted action output as a weighted average of the first predicted action output and the second predicted action output, the weighted average weighting the first predicted action output based on the first weight and weighting the second predicted action output based on the second weight.  (¶45 action prediction can be a corresponding probability for each of a set of motion primitives)

In regards to claim 8, Shah further teaches
	In fig. 2 that the process is repeated (i.e. additional data is received, and additional weights, embeddings and decisions are generated as per claim 1), if the goal isn’t reached (see 218, decision No).  

In regards to claim 9, Shah further teaches
wherein the first weight is zero, the second weight is one, the additional first weight is one, and the additional second weight is zero.  (¶36 and 45 as the weights can be normalized with softmax at each step, which limits the total weights at each step to equal 1.  Therefore, it is possible at the first step for the first weight to be 0 and the second weight to be 1, and at the subsequent step the additional first weight to be 1 and the additional second weight to be 0)

In regards to claim 10, Shah further teaches
wherein the first sensor data instance is an RGB image that includes red, green, and blue channels, but lacks any depth channel, and wherein the first modality is an RGB modality.  (¶39 semantic vision data (first sensor data) can be 2D RGB (i.e. no depth channel)

In regards to claim 11, Shah further teaches
wherein the second sensor data instance is a depth image that includes a depth channel and wherein the second modality is a depth modality.  (¶40 the depth vision data (second sensor data) can use a 2.5D RGBD (i.e. red, green, blue, and depth channels)

In regards to claim 12, Shah further teaches
wherein the first modality is a first vision modality that includes one or more color channels and wherein the second modality is a second vision modality that lacks the one or more color channels.   (¶39 semantic vision data (first sensor data) can be 2D RGB (i.e. color channels).  ¶40 teaches the depth vision data can be generated using the depth values (i.e. only using the depth channel of the 2.5D RGBD and not the RGB color channels)

In regards to claim 15, Shah further teaches
wherein the first modality is a first vision modality that includes one or more depth channels and wherein the second modality is a second vision modality that lacks the one or more depth channels.  ¶39 semantic vision data (second sensor data) can be 2D RGB (i.e. no depth channel) ¶40 teaches the depth vision data (first sensor data) can be generated using the depth values (i.e. only using the depth channel of the 2.5D RGBD and not the RGB color channels)

In regards to claim 16, Shah further teaches
wherein the one or more of the sensors that generate the second sensor data instance exclude any of the one or more first sensors.  (¶40 the depth vision data and semantic vision data can be generated based on vision data from separate components) 

In regards to claim 19, Shah teaches
A method implemented by one or more processors of a robot, the method comprising: (fig. 1, ¶28-29)
at each of a plurality of iterations of controlling the robot in attempting performance of a robotic task: (fig. 2, shows that steps 206-218 are done iteratively)
processing a corresponding current first sensor data instance, using a first modality action machine learning (ML) model, to generate corresponding first predicted action output, wherein the corresponding first sensor data instance is in a first modality and is generated by one or more first sensors of a plurality of sensors of a robot that can be used in controlling the robot; (fig. 1, ¶5, 16, and 37-45 teaches semantic vision data 161 (sensor data in a first modality) is processed by semantic branch 162 (ML model) to create embeddings 163, which are then further processed to create action prediction output 181 for controlling a robot. The semantic embedding are fed to the attention weighted outputs 177 are then used to determine an action prediction output 181 which corresponds to motion primitive for a robot (i.e. to control the robot) which can include motion primitives such as “”move forward”, “move backward”, “turn right”, “turn left”, “move up”, “move down”, and/or other motion primitive(s) (including more or less granular motion primitives)”)
processing a corresponding current second sensor data instance, using a second modality action ML model, to generate corresponding second predicted action output that is disparate from the first action output that can be used in controlling the robot, wherein the corresponding second sensor data instance is in a second modality is distinct from the first modality and that is generated by one or more of the sensors of a robot; (fig. 1, ¶40-48, depth vision data 165 (second sensor data) which is different than semantic vision data 161, teaches depth branch 166 (second modality action ML model, initial layers), processes depth vision data 165 to generate depth embedding 167 (second embedding) which is then put through additional layers 175, 176, to generate attention weighted outputs 177). The attention weighted outputs 177 are then used to determine an action prediction output 181 which corresponds to motion primitive for a robot (i.e. to control the robot) which can include motion primitives such as “”move forward”, “move backward”, “turn right”, “turn left”, “move up”, “move down”, and/or other motion primitive(s) (including more or less granular motion primitives)”.  The Output from the depth (second) embeddings 167 are different from the semantic (first) embeddings 163 (i.e. they are disparate), and will each modify the outputs of the action prediction layers differently in order to determine a preferred action prediction output 181 for controlling the robot).
comparing: a corresponding first intermediate embedding generated using the first modality action ML model in generating the corresponding first predicted action output, and a corresponding second intermediate embedding generated using the second modality action ML model in generating the corresponding second predicted action output;  selecting, based on the comparing, either the corresponding first predicted action output or the corresponding second predicted action output; and controlling the robot using the selected one of the corresponding first predicted action output or the corresponding second predicted action output.  (fig. 1, ¶40-48, 58 teaches action prediction output 181 (final predicted action output), which can be an probability distribution, where the action with the highest probability (i.e. they are compared) is selected for the final action and used to issue commands to the robot.

In regards to claim 20, Shah further teaches
wherein the corresponding first predicted action outputs are utilized in a first subset of the iterations and wherein the corresponding second predicted action outputs are utilized in a second subset of the iterations. (see fig. 2, as the process is iterative and the inputs change during each iteration, it is possible that a first actions are used for a set of iteration and second action are chosen during another set of iterations.)

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 2, 17, and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Shah (WO 2019/183568, as listed on the IDS dated 6/1/2023)

In regards to claim 2, Shah teaches
wherein determining, based on analysis of the first embedding and the second embedding, the first weight and the second weight comprises: determining the first weight as one and the second weight as zero; (¶45 action prediction can be a corresponding probability for each of a set of motion primitives)
Shah may not explicitly teach
wherein determining the final predicted action output comprises: using the second predicted action output as the final predicted action output responsive to determining the first weight as one and the second weight as zero.  
However, as the weights can indicate a corresponding probability, this is merely a design decision, and it would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to have been able to define 0 as highest probability and 1 as lowest probability.  The motivation for such a modification is that this may enhance training and model effectiveness over other probability definitions.

In regards to claim 17, Shah may not explicitly teach
wherein the first additional layers comprise a first first component control head and a first second component control head, wherein the first predicted action output comprises a first first component set of values for controlling a first robotic component of the robot, and wherein the second predicted action output comprises a first second component set of values for controlling a second robotic component of the robot wherein the second additional layers comprise a second first component control head and second first second component control head, wherein the first predicted action output comprises a second first component set of values for controlling the first robotic component, and wherein the second predicted action output comprises a first second component set of values for controlling the second robotic component.  
However, Shah does teach in ¶45 that the output for a motion primitive used to control the robot, and ¶49 teaches that the control commands provided at a given instance based on an action determined utilizing the neural network model and can provide control commands to actuators associated with the wheels 107A and/or 107B, the robot arm 105 and/or the end effector 106.  
Therefore, it would have been obvious to one of ordinary skill in the art prior to the effective filing date to have been able to modify the system such that the actions include values for controlling the different actuators of the robot.  The motivation for such modifications is that this allows the neural network to have better control over all aspects of the robot in order to complete the desired functions.

In regards to claim 18, shah further teaches
wherein the first robotic component is one of a robot arm, a robot end effector, a robot base, or a robot head; and wherein the second robotic component is another one of the robot arm, the robot end effector, the robot base, or the robot head.  (¶49 teaches the robot can provide controls for a robot arm, end effector, and/or the wheels (i.e. robot base).  

Claim(s) 13 and 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Shah (WO 2019/183568, as listed on the IDS dated 6/1/2023) in view of Official Notice

In regards to claim 13, Shah may not explicitly teach
wherein the first modality is a first vision modality that includes one or more hyperspectral channels and wherein the second modality is a second vision modality that lacks the one or more hyperspectral channels.  
Shah does teach in ¶40 that the semantic vision and depth vision (first and second vision modalities) can use data from different components.  ¶80-81 teaches a plurality of different sensors that can be used, but doesn’t explicitly list one with “hyperspectral channels”
The examiner is taking official notice that sensors with hyperspectral channels are known and have existed prior to the effective filing date of the claimed invention.  Therefore, it would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to have been able to incorporate such hyperspectral vision data into the ML model and achieved predictable results.  The rationale to support a conclusion that the claim would have been obvious is that the substitution of one known element for another yields predictable results to one of ordinary skill in the art.

In regards to claim 14, Shah may not explicitly teach
wherein the first modality is a first vision modality that includes one or more thermal channels and wherein the second modality is a second vision modality that lacks the one or more thermal channels.  
Shah does teach in ¶40 that the semantic vision and depth vision (first and second vision modalities) can use data from different components.  ¶80-81 teaches a plurality of different sensors that can be used, but doesn’t explicitly list one with “thermal channels”
The examiner is taking official notice that sensors with thermal channels are known and have existed prior to the effective filing date of the claimed invention, for example FLIR (forward looking infrared).  Therefore, it would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to have been able to incorporate such vision data that includes thermal channels into the ML model and achieved predictable results.  The rationale to support a conclusion that the claim would have been obvious is that the substitution of one known element for another yields predictable results to one of ordinary skill in the art.

Claim(s) 5-7 is/are rejected under 35 U.S.C. 103 as being unpatentable over Shah (WO 2019/183568, as listed on the IDS dated 6/1/2023) in view of Teig (US 12,112,254)

In regards to claim 5, Shah doesn’t explicitly teach
wherein the first embedding is a first stochastic embedding parameterizing a first multivariate distribution and wherein the second embedding is a second stochastic embedding parameterizing a second multivariate distribution.  
Teig C21:1-23 teaches using probabilistic gaussian (multivariate distribution) noise (stochastic) in the training of layers.  Therefore, it would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to have been able to incorporate this technique into the training of the ML model of Shah, such that after training is completed the embeddings incorporated these aspects.  The motivation for making this modification is that help identify and remove portions of networks that are not passing important information (See C21:1-23) such that they can be removed, thereby improving the efficiency of the model and reducing inference costs.

In regards to claim 6, Shah and Teig further make obvious
wherein determining, based on analysis of the first embedding and the second embedding, the first weight and the second weight comprises: determining the first weight as a function of first covariances of the first multivariate distribution and determining the second weight as a function of second covariances of the second multivariate distribution. (Shah, fig. 1, ¶42 and 58 attention weighted outputs 177 are generated from the semantic and depth embeddings)

In regards to claim 7, Shah and Teig further make obvious
wherein determining, based on analysis of the first embedding and the second embedding, the first weight and the second weight comprises: determining the first weight as one and the second weight as zero responsive to the first covariances indicating a greater extent of variance than the second covariances.  (Shah, fig. 1, ¶42 and 58 attention weighted outputs 177 are generated from the semantic and depth embeddings, ¶36 and 45 as the weights can be normalized with softmax at each step, which limits the total weights at each step to equal 1 and is a representative of a probability distribution)

Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to JASON W BLUST whose telephone number is (571)272-6302. The examiner can normally be reached 12-8:30 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hosain Alam can be reached at (571) 272-3978. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/JASON W BLUST/Primary Examiner, Art Unit 2132

Read full office action

Prosecution Timeline

Jan 26, 2023

Application Filed

Oct 03, 2025

Non-Final Rejection mailed — §102, §103

Jan 05, 2026

Response Filed

Feb 25, 2026

Final Rejection mailed — §102, §103

May 26, 2026

Request for Continued Examination

May 28, 2026

Response after Non-Final Action

Precedent Cases

Applications granted by this same examiner with similar technology

18/433,939

Patent 12596485

HOST DEVICE GENERATING BLOCK MAP INFORMATION, METHOD OF OPERATING THE SAME, AND METHOD OF OPERATING ELECTRONIC DEVICE INCLUDING THE SAME

2y 2m to grant Granted Apr 07, 2026

18/946,677

Patent 12554417

DISTRIBUTED DATA STORAGE CONTROL METHOD, READABLE MEDIUM, AND ELECTRONIC DEVICE

1y 3m to grant Granted Feb 17, 2026

18/938,906

Patent 12535954

STORAGE DEVICE AND OPERATING METHOD THEREOF

1y 2m to grant Granted Jan 27, 2026

18/230,265

Patent 12530120

Maximizing Data Migration Bandwidth

2y 5m to grant Granted Jan 20, 2026

18/614,046

Patent 12530118

DATA PROCESSING METHOD AND RELATED DEVICE

1y 10m to grant Granted Jan 20, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

3-4

Expected OA Rounds

80%

Grant Probability

96%

With Interview (+15.9%)

2y 4m (~0m remaining)

Median Time to Grant

Moderate

PTA Risk

Based on 279 resolved cases by this examiner. Grant probability derived from career allowance rate.