Last updated: April 19, 2026
Application No. 18/663,491
BRIDGING LANGUAGE AND ENVIRONMENTS WITH RENDERING FUNCTIONS AND VISION-LANGUAGE MODELS

Final Rejection §103
Filed
May 14, 2024
Examiner
KASPER, BYRON XAVIER
Art Unit
3657
Tech Center
3600 — Transportation & Electronic Commerce
Assignee
Naver Labs Corporation
OA Round
2 (Final)
Interview Optional

— +18.4% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 103 resolved cases, 2023–2026
Examiner Intelligence

KASPER, BYRON XAVIER View full profile →
Grants 70% — above average
Career Allow Rate
72 granted / 103 resolved
+17.9% vs TC avg
Strong +18% interview lift
Without
With
+18.4%
Interview Lift
resolved cases with interview
Typical timeline
3y 0m
Avg Prosecution
36 currently pending
Career history
139
Total Applications
across all art units
Statute-Specific Performance

§101
10.9%
-29.1% vs TC avg
§103
56.3%
+16.3% vs TC avg
§102
11.9%
-28.1% vs TC avg
§112
16.4%
-23.6% vs TC avg
Black line = Tech Center average estimate • Based on career data from 103 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
1. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
2. This communication is responsive to Application No. 18/663,491 and the amendments filed on 12/18/2025.
3. Claims 1-21 are presented for examination.


Information Disclosure Statement
4. The information disclosure statement (IDS) submitted on 8/20/2024 has been fully considered by the Examiner.


Response to Arguments
5. Applicant’s arguments, see pages 9-10, filed 12/18/2025, with respect to the objection to Claim 17 for minor informalities have been fully considered and are persuasive.  The objection of 10/14/2025 has been withdrawn. 

6. Applicant’s arguments, see page 10, filed 12/18/2025, with respect to the interpretation of claims 1, 2, 3, 6, 7, 8, 9, 10, 18, and 19 under 35 U.S.C. 112(f), the interpretation of these claims remains the same as found in the Non-Final Rejection mailed 10/14/2025, unless otherwise noted.

7. Applicant’s arguments with respect to the rejection of claim(s) 1-7 and 10-20 under 35 U.S.C. 103 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Regarding independent claim 1, the Examiner agrees that the combination of US 20240311617 A1 Di Palo and US 20210053216 A1 to Diankov fails to teach all of the amendments of the claim. However, in light of the amendments and the Applicant’s remarks, an updated search was conducted, and a new ground of rejection concerning claim 1 has been determined, in which will be described later.
Regarding dependent claims 2-7 and 10-18, as all of these claims depend from claim 1, are still rejected, in which will be described later.
Regarding independent claims 19 and 20, as these claims contain similar limitations as claim 1, are still rejected for similar reasons as claim 1 is, in which will be described later.

8. The Examiner notes that dependent claims 8 and 9 were previously indicated as containing allowable subject matter in the Non-Final Rejection mailed 10/14/2025, but were objected to for depending upon a rejected claim. This objection to claims 8 and 9 remains the same, in which will also be described later.


Claim Rejections - 35 USC § 103
9. In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
10. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

11. Claim(s) 1, 3, 6, and 10-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wang et al. (US 20250319593 A1 hereinafter Wang) in view of Aparicio et al. (US 20240198526 A1 hereinafter Aparicio) and Di Palo et al. (US 20240311617 A1 hereinafter Di Palo).

Regarding Claim 1, Wang teaches a robot system comprising: image encodings generated based on image renderings of configurations, respectively of a robot in an environment ([0033] via “Using the code generated at the box 350, the operational task is executed in simulation at box 360, which includes simulating the articulation of the robot arm to move the workpiece according to the input task instruction. The simulated motions of the robot arm and workpiece from the box 360 are provided to a box 370, where RRT and/or optimization-based techniques are used to generate a collision-free path at box 380.”), ([0041] via “FIG. 5 is an illustration of the simulation and path generation steps of the path dataset generation pipeline of FIG. 3, … The complete robot motion program, including motions of all joints in the articulated robot, enables a simulation of robot and workpiece movement as shown in box 510.”);
an encoding module configured to receive input descriptive of an action to be performed by the robot ([0038] via “The final input to the large language model 440 is a task instruction 430. A non-limiting illustrative example of a task instruction is as follows: “Move Object W from Point S (start point) out of Bin B and place the Object W at Point G (goal point) in Fixture F”. The task instruction 430 is defined in plain language (e.g., English) as it would be spoken. It can be observed that the example task instruction listed above requires the large language model 440 to identify objects (workpiece W, bin B, fixture F), understand geometric information (start and goal points), and figure out how to perform the overall task using the tasks in the task library and in view of implied constraints (moving the workpiece up and out of the bin and then to the goal point).”), (Note: The Examiner interprets the large language model of Wang as the encoding module.); 
a scoring module configured to generate scores for the configurations based on comparisons of (a) the text encoding with (b) the image encoding of the respective configuration ([0033] via “A large language model block 340 takes the task descriptions and object names from the box 330 and generates code at box 350 (in a programming language such as Python) to simulate the operation. The large language model in the block 340 is the same as the large language model in the block 320, essentially performing a second operation (code generation) after the previous operation (task breakdown). Using the code generated at the box 350, the operational task is executed in simulation at box 360, which includes simulating the articulation of the robot arm to move the workpiece according to the input task instruction.”), ([0034] via “A final validation of generated paths is performed at box 390, including ensuring that the generated path is collision-free, is feasible from a robot reachability standpoint, and that the workpiece or object being moved reaches and is stable at the goal position. At decision diamond 392, it is determined whether all validation criteria are met.”), (Note: See Di Palo below, which teaches the inputted description of an action to be performed by the robot is in the form of a text description. Further, the Examiner interprets the validation step of Wang as the scoring.); 
a selection module configured to select k of the configurations based on the scores, where k is an integer greater than or equal to 1 ([0034] via “At decision diamond 392, it is determined whether all validation criteria are met. … Paths which pass all aspects of validation are added to the motion program database 230—for use in training the encoder/decoder neural network as shown in FIG. 2.”), ([0049] via “The path 520 is validated at 530, and validated paths are added to the database 230, as also shown in FIGS. 4 and 6.”), (Note: See Figures 3 and 6 of Wang as well.); and 
an actuation module configured to actuate the robot based on the selected k of the configurations based on actuating the robot to achieve the action described in the text ([0050] via “In production robotic operations, inputs 210 for a robot task are provided to the encoder/decoder 220 running on the controller 620, which computes a robot path at 250. The controller 620 provides the computed robot path to a robot running in the robotic workcell
630. The robot in the workcell 630 performs the operation using the motion program provided by the controller 620.”), (Note: See Figure 6 of Wang as well.).
Wang is silent on each of the configurations including positions and orientations of portions of the robot; and the encoding module configured to receive text descriptive of an action to be performed by the robot and to encode the text into a text encoding.
However, Aparicio teaches wherein each of the configurations include positions and orientations of portions of the robot ([0025] via “A simulation module 301 of the computing system 300 can be configured to perform simulations within the simulation environment that is generated. For example, at 304, the simulation module 301 can generate different grasp poses for a given end-effector to grasp the target object. Using the generated grasp poses, at 306, the simulation module 301 can execute each of the generated grasp poses on the target object.”).
Further, Di Palo teaches an encoding module configured to receive text descriptive of an action to be performed by the robot and to encode the text into a text encoding ([0101] via “The system receives a natural language description of the task to be performed by the agent (step 202). For example, the system can receive the description as a text input from a user or an external system.”), ([0102] via “The system processes, using a language model neural network, an input sequence derived from the natural language description of the task to generate an output text sequence (step 204).”), (Note: See Figure 2 of Di Palo as well.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Aparicio wherein each of the configurations include positions and orientations of portions of the robot. Doing so renders parameters of the robot that would be required for the robot to perform the action, as stated by Aparicio ([0025] via “A simulation module 301 of the computing system 300 can be configured to perform simulations within the simulation environment that is generated. For example, at 304, the simulation module 301 can generate different grasp poses for a given end-effector to grasp the target object. Using the generated grasp poses, at 306, the simulation module 301 can execute each of the generated grasp poses on the target object. If the object is successfully grasped by a given grasp pose, the simulation can proceed to 308, where the given grasp pose is simulated along one or more trajectories. … If the object is not successfully grasped during the simulation at 304, the simulation can return to 302, where one or more additional grasp poses can be generated.”).
In addition, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Di Palo wherein the encoding module is configured to receive text descriptive of an action to be performed by the robot and to encode the text into a text encoding. The courts have determined under the case KSR Int’l Co. v. Teleflex Inc., 550 U.S. 398, 415-421, 82 USPQ2d 1385, 1395-07 (2007), a number of rationales in which obviousness is concluded. The rationale that pertains to the present invention is rationale B: Simple Substitution of One Known Element for Another to Obtain Predictable Results. Specifically, in this case item 3 of rationale B is satisfied: a finding that one of ordinary skill in the art could have substituted one known element for another, and the results of the substitution would have been predictable. Text encoding input to describe an action to be performed by a robot is well known within robot control. While the invention of Wang teaches spoken language input to instruct/command a robot to perform an action, despite the lack of mention that the type of input includes a text input, the functionalities of the invention would still produce the same outcomes when swapping the spoken language input for the text input. Therefore, the simple substitution of a text input for the spoken language input would have been obvious to implement.

Regarding Claim 3, modified reference Wang teaches the robot system of claim 1 wherein the selection module is configured to select k of the configurations with the k highest scores ([0034] via “A final validation of generated paths is performed at box 390, including ensuring that the generated path is collision-free, is feasible from a robot reachability standpoint, and that the workpiece or object being moved reaches and is stable at the goal position. At decision diamond 392, it is determined whether all validation criteria are met. … Paths which pass all aspects of validation are added to the motion program database 230—for use in training the encoder/decoder neural network as shown in FIG. 2.”).

Regarding Claim 6, modified reference Wang teaches the robot system of claim 1, wherein the actuation module is configured to actuate the robot based on the k finetuned selected configurations ([0050] via “In production robotic operations, inputs 210 for a robot task are provided to the encoder/decoder 220 running on the controller 620, which computes a robot path at 250. The controller 620 provides the computed robot path to a robot running in the robotic workcell 630. The robot in the workcell 630 performs the operation using the motion program provided by the controller 620.”), (Note: See Figure 6 of Wang as well.).
Wang is silent on the robot system further comprising a vision-language model (VLM) module and a projection module configured to finetune the selected k of the configurations.
However, Di Palo teaches a vision-language model (VLM) module and a projection module configured to finetune the selected k of the configurations ([0045] via “When the system 100 is controlling the agent 104, the system 100 selects an action 108 to be performed by the agent in response to the current observation using a language-conditioned policy neural network 130.”), ([0125] via “In some other implementations, the system further trains (“fine-tunes”) the VLM to improve the accuracy of determining sub-goal completion using the VLM. For example, the system can fine-tune the VLM on a data set that includes images of the environment and corresponding text descriptions. For example, the data set can include images of the environment with various configurations of the agent and possible objects in the environment. This “in-domain” training data can make similarity scores generated using the VLM more accurately reflect similarities between sub-goals and observation images. In particular, the system can achieve this benefit with relatively few “in-domain” images.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Di Palo wherein the robot system further comprises a vision-language model (VLM) module and a projection module configured to finetune the selected k of the configurations. Doing so improves the accuracy in determining the most optimal configuration out of the possible selectable configurations, as stated above by Di Palo in paragraph [0125].

Regarding Claim 10, modified reference Wang teaches the robot system of claim 1, but is silent on wherein the encoding module is configured to encode the text using a vision-language model (VLM) text encoding algorithm.
However, Di Palo teaches wherein the encoding module is configured to encode the text using a vision-language model (VLM) text encoding algorithm ([0117] via “Returning to the description of FIG. 2, the system generates a respective text embedding for each of the sub-goals in the sequence by processing the natural language description of the sub-goal using a vision-language model (VLM) neural network (step 206).”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Di Palo wherein the encoding module is configured to encode the text using a vision-language model (VLM) text encoding algorithm. Doing so improves the way the robot interacts with its environment, as stated by Di Palo ([0034] via “The action selection system 100 uses a language model neural network 120
and a vision-language model (VLM) neural network 140 to (i) improve controlling an agent 104 interacting with an environment 106 to perform a task in the environment 106 or (ii) assist an agent 104 interacting with an environment 106 to perform a task in the environment 106.”) by taking in both text and image descriptions of the environment to train the robot on how to optimally perform a task, as stated by Di Palo ([0061] via “In either of these implementations, the system 100 can use descriptions generated by the language model neural network 120 and embeddings of sub-goals and observation images generated by the vision-language model (VLM) neural network 140 to improve the training of the policy neural network 130.”).

Regarding Claim 11, modified reference Wang teaches the robot system of claim 1, but is silent on wherein the encoding module includes a neural network configured to encode the text.
However, Di Palo teaches wherein the encoding module includes a neural network configured to encode the text ([0048] via “In particular, the policy neural network 130 is a neural network that is configured to receive an input that includes text and data characterizing the state of the environment and to generate (depending on the input) a policy output 122 that defines an action to be performed by the agent.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Di Palo wherein the encoding module includes a neural network configured to encode the text. Doing so improves the way the robot interacts with its environment, as stated by Di Palo ([0034] via “The action selection system 100 uses a language model neural network 120 and a vision-language model (VLM) neural network 140 to (i) improve controlling an agent 104 interacting with an environment 106 to perform a task in the environment 106 or (ii) assist an agent 104
interacting with an environment 106 to perform a task in the environment 106.”) by taking in both text and image descriptions of the environment to train the robot on how to optimally perform a task, as stated by Di Palo ([0061] via “In either of these implementations, the system 100 can use descriptions generated by the language model neural network 120 and embeddings of sub-goals and observation images generated by the vision-language model (VLM) neural network 140 to improve the training of the policy neural network 130.”).

Regarding Claim 12, modified reference Wang teaches the robot system of claim 1 wherein each of the configurations includes three-dimensional coordinates of a portion of the robot in the environment ([0042] via “The simulation at the box 510 was created based on gripper and workpiece motions, which in turn were synthesized by the large language model to complete the defined task. As such, the simulated robot motions at the box 510 are not guaranteed to be collision-free, particularly when considering all elements of the robot arm along with the workpiece, and even more so when the obstacle environment is complex. Thus, the simulated robot motions from the box 510 are used as input to a path generation box 520, where RRT and/or optimization-based techniques are used to generate a collision-free path.”), (Note: See Figure 5 of Wang as well.).

Regarding Claim 13, modified reference Wang teaches the robot system of claim 1, but is silent on wherein each of the configurations includes angles of a joint of the robot in the environment.
However, Aparicio teaches wherein each of the configurations includes angles of a joint of the robot in the environment ([0018] via “Robot poses for grasping the object can define the position and orientation of the end effector 116 when the robot grasps and moves the target object.”), ([0025] via “A simulation module 301 of the computing system 300 can be configured to perform simulations within the simulation environment that is generated. For example, at 304, the simulation module 301 can generate different grasp poses for a given end-effector to grasp the target object. Using the generated grasp poses, at 306, the simulation module 301 can execute each of the generated grasp poses on the target object.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Aparicio wherein each of the configurations includes angles of a joint of the robot in the environment. Doing so renders parameters of the robot that would be required for the robot to perform the action, as stated by Aparicio ([0025] via “A simulation module 301 of the computing system 300 can be configured to perform simulations within the simulation environment that is generated. For example, at 304, the simulation module 301 can generate different grasp poses for a given end-effector to grasp the target object. Using the generated grasp poses, at 306, the simulation module 301 can execute each of the generated grasp poses on the target object. If the object is successfully grasped by a given grasp pose, the simulation can proceed to 308, where the given grasp pose is simulated along one or more trajectories. … If the object is not successfully grasped during the simulation at 304, the simulation can return to 302, where one or more additional grasp poses can be generated.”).

Regarding Claim 14, modified reference Wang teaches the robot system of claim 1, but is silent on wherein each of the configurations includes three-dimensional coordinates of an object to be acted upon by the robot in the environment.
However, Di Palo teaches wherein each of the configurations includes three-dimensional coordinates of an object to be acted upon by the robot in the environment ([0075] via “In these implementations, the observations may include, e.g., one or more of: images, object position data, and sensor data to capture observations as the agent interacts with the environment, for example sensor data from an image, distance, or position sensor or from an actuator. For example in the case of a robot, the observations may include data characterizing the current state of the robot, e.g., one or more of: joint position, joint velocity, joint force, torque or acceleration, e.g., gravity-compensated torque feedback, and global or relative pose of an item held by the robot. … The observations may be defined in 1, 2 or 3 dimensions, and may be absolute and/or relative observations.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Di Palo wherein each of the configurations includes three-dimensional coordinates of an object to be acted upon by the robot in the environment. Doing so improves the image renderings of the environment by having more defined and absolute dimensions of the object in the environment, as stated above by Di Palo.

Regarding Claim 15, modified reference Wang teaches the robot system of claim 1 wherein each of the configurations includes at least one dimension describing the orientation of an object to be acted upon by the robot in the environment ([0030] via “The asset library is essentially a database of computer-aided design (CAD) models of objects which may be used in a task simulation. The objects in the asset library may include geometry primitives (e.g., cubes, spheres), workpiece designs, fixtures and other obstacles known to exist in a robotic workspace, etc.”).

Regarding Claim 16, modified reference Wang teaches the robot system of claim 1, but is silent on wherein the image encodings are generated using a vision-language model (VLM) image encoding algorithm based on the image renderings of configurations.
However, Di Palo teaches wherein the image encodings are generated using a vision-language model (VLM) image encoding algorithm based on the image renderings of configurations ([0075] via “In these implementations, the observations may include, e.g., one or more of: images, object position data, and sensor data to capture observations as the agent interacts with the environment, for example sensor data from an image, distance, or position sensor or from an actuator. For example in the case of a robot, the observations may include data characterizing the current state of the robot, e.g., one or more of: joint position, joint velocity, joint force, torque or acceleration, e.g., gravity-compensated torque feedback, and global or relative pose of an item held by the robot.”), ([0131] via “The system generates, using the VLM neural network and from the current observation image, an observation embedding of the current observation image (step 212). That is, the system processes the observation image using the VLM to generate the embedding of the observation image. For example, the system can process the observation image using the image encoder of the VLM to generate the embedding.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Di Palo wherein the image encodings are generated using a vision-language model (VLM) image encoding algorithm based on the image renderings of configurations. Doing so improves the way the robot interacts with its environment, as stated by Di Palo ([0034] via “The action selection system 100 uses a language model neural network 120 and a vision-language model (VLM) neural network 140
to (i) improve controlling an agent 104 interacting with an environment 106 to perform a task in the environment 106 or (ii) assist an agent 104 interacting with an environment 106 to perform a task in the environment 106.”) by taking in both text and image descriptions of the environment to train the robot on how to optimally perform a task, as stated by Di Palo ([0061] via “In either of these implementations, the system 100 can use descriptions generated by the language model neural network 120 and embeddings of sub-goals and observation images generated by the vision-language model (VLM) neural network 140 to improve the training of the policy neural network 130.”).

Regarding Claim 17, modified reference Wang teaches the robot system of claim 1, but is silent on wherein the image renderings are generated using a MuJoCo rendering algorithm.
However, Aparicio teaches wherein the image renderings are generated using a MuJoCo rendering algorithm ([0025] via “With continuing reference to FIG. 3, at 302, based on the robot models 310 and the object data 312, the computing system 200 can generate a simulation environment. In various examples, the simulation environment generated at 302
can define a multi-joint dynamics with contact (MuJoCo) environment or bullet physics-based (PyBullet) environment. … A robot represented by one of the robot models 310 can be spawned in the simulation environment at a predefined 6D coordinate pose.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Aparicio wherein the image renderings are generated using a MuJoCo rendering algorithm. Doing so incorporates a known simulation environment that simulates the pose of the robot in multiple dimensions, as stated above by Aparicio.

Regarding Claim 18, modified reference Wang teaches a training system comprising: the robot system of claim 1 (See the rejection of claim 1 under 35 U.S.C. 103 above.).
Wang is silent on a rendering module configured to generate the image renderings based on the configurations, respectively; and a second encoding module configured to encode the image renderings into the image encodings, respectively.
However, Aparicio teaches a rendering module configured to generate the image renderings based on the configurations, respectively ([0025] via “With continuing reference to FIG. 3, at 302, based on the robot models 310 and the object data 312, the computing system 200 can generate a simulation environment. ... A simulation module 301 of the computing system 300 can be configured to perform simulations within the simulation environment that is generated. For example, at 304, the simulation module 301 can generate different grasp poses for a given end-effector to grasp the target object. Using the generated grasp poses, at 306, the simulation module 301 can execute each of the generated grasp poses on the target object. If the object is successfully grasped by a given grasp pose, the simulation can proceed to 308, where the given grasp pose is simulated along one or more trajectories”).
Further, Di Palo teaches a second encoding module configured to encode the image renderings into the image encodings, respectively ([0120] via “For example, the VLM can include an image encoder neural network, e.g., a Vision Transformer or a convolutional neural network, that processes an image to generate an embedding of the image and a text encoder neural network, e.g., a Transformer-based decoder neural network, that receives text as input and generates an embedding of the text. The image encoder and the text encoder can have been jointly trained, e.g., through contrastive learning.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Aparicio wherein the training system comprises: a rendering module configured to generate the image renderings based on the configurations, respectively. Doing so generates different configurations of the robot to determine which configuration(s) results in a successful execution of the task, as stated above by Aparicio.
In addition, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Di Palo wherein the training system comprises: a second encoding module configured to encode the image renderings into the image encodings, respectively. Doing so improves the way the robot interacts with its environment, as stated by Di Palo ([0034] via “The action selection system
100 uses a language model neural network 120 and a vision-language model (VLM) neural network 140 to (i) improve controlling an agent 104 interacting with an environment 106 to perform a task in the environment 106 or (ii) assist an agent 104 interacting with an environment 106 to perform a task in the environment 106.”) by taking in both text and image descriptions of the environment to train the robot on how to optimally perform a task, as stated by Di Palo ([0061] via “In either of these implementations, the system 100 can use descriptions generated by the language model neural network 120 and embeddings of sub-goals and observation images generated by the vision-language model (VLM) neural network 140 to improve the training of the policy neural network 130.”).

Regarding Claim 19, Wang teaches a robot system comprising: image encodings generated based on image renderings of configurations, respectively of a robot in an environment ([0033] via “Using the code generated at the box 350, the operational task is executed in simulation at box 360, which includes simulating the articulation of the robot arm to move the workpiece according to the input task instruction. The simulated motions of the robot arm and workpiece from the box 360 are provided to a box 370, where RRT and/or optimization-based techniques are used to generate a collision-free path at box 380.”), ([0041] via “FIG. 5 is an illustration of the simulation and path generation steps of the path dataset generation pipeline of FIG. 3, … The complete robot motion program, including motions of all joints in the articulated robot, enables a simulation of robot and workpiece movement as shown in box 510.”);
an encoding module configured to receive input descriptive of an action to be performed by the robot ([0038] via “The final input to the large language model 440 is a task instruction 430. A non-limiting illustrative example of a task instruction is as follows: “Move Object W from Point S (start point) out of Bin B and place the Object W at Point G (goal point) in Fixture F”. The task instruction 430 is defined in plain language (e.g., English) as it would be spoken. It can be observed that the example task instruction listed above requires the large language model 440 to identify objects (workpiece W, bin B, fixture F), understand geometric information (start and goal points), and figure out how to perform the overall task using the tasks in the task library and in view of implied constraints (moving the workpiece up and out of the bin and then to the goal point).”), (Note: The Examiner interprets the large language model of Wang as the encoding module.); 
a scoring module configured to generate scores for the configurations based on comparisons of (a) the text encoding with (b) the image encoding of the respective configuration ([0033] via “A large language model block 340 takes the task descriptions and object names from the box 330 and generates code at box 350 (in a programming language such as Python) to simulate the operation. The large language model in the block 340 is the same as the large language model in the block 320, essentially performing a second operation (code generation) after the previous operation (task breakdown). Using the code generated at the box 350, the operational task is executed in simulation at box 360, which includes simulating the articulation of the robot arm to move the workpiece according to the input task instruction.”), ([0034] via “A final validation of generated paths is performed at box 390, including ensuring that the generated path is collision-free, is feasible from a robot reachability standpoint, and that the workpiece or object being moved reaches and is stable at the goal position. At decision diamond 392, it is determined whether all validation criteria are met.”), (Note: See Di Palo below, which teaches the inputted description of an action to be performed by the robot is in the form of a text description. Further, the Examiner interprets the validation step of Wang as the scoring.); 
a selection module configured to select k of the configurations based on the scores, where k is an integer greater than or equal to 1 ([0034] via “At decision diamond 392, it is determined whether all validation criteria are met. … Paths which pass all aspects of validation are added to the motion program database 230—for use in training the encoder/decoder neural network as shown in FIG. 2.”), ([0049] via “The path 520 is validated at 530, and validated paths are added to the database 230, as also shown in FIGS. 4 and 6.”), (Note: See Figures 3 and 6 of Wang as well.); and 
an actuation module configured to actuate the robot based on a dot product of the k image encodings of the selected k of the configurations and actuating the robot to achieve the action described in the text ([0050] via “In production robotic operations, inputs 210 for a robot task are provided to the encoder/decoder 220 running on the controller 620, which computes a robot path at 250. The controller 620 provides the computed robot path to a robot running in the robotic workcell 630. The robot in the workcell 630 performs the operation using the motion program provided by the controller 620.”), (Note: See Figure 6 of Wang as well.).
Wang is silent on each of the configurations including positions and orientations of portions of the robot; and the encoding module configured to receive text descriptive of an action to be performed by the robot and to encode the text into a text encoding.
However, Aparicio teaches each of the configurations including positions and orientations of portions of the robot ([0025] via “A simulation module 301 of the computing system 300 can be configured to perform simulations within the simulation environment that is generated. For example, at 304, the simulation module 301 can generate different grasp poses for a given end-effector to grasp the target object. Using the generated grasp poses, at 306, the simulation module 301 can execute each of the generated grasp poses on the target object.”).
Further, Di Palo teaches an encoding module configured to receive text descriptive of an action to be performed by the robot and to encode the text into a text encoding ([0101] via “The system receives a natural language description of the task to be performed by the agent (step 202). For example, the system can receive the description as a text input from a user or an external system.”), ([0102] via “The system processes, using a language model neural network, an input sequence derived from the natural language description of the task to generate an output text sequence (step 204).”), (Note: See Figure 2 of Di Palo as well.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Aparicio wherein each of the configurations includes positions and orientations of portions of the robot. Doing so renders parameters of the robot that would be required for the robot to perform the action, as stated by Aparicio ([0025] via “A simulation module 301 of the computing system 300 can be configured to perform simulations within the simulation environment that is generated. For example, at 304, the simulation module 301 can generate different grasp poses for a given end-effector to grasp the target object. Using the generated grasp poses, at 306, the simulation module 301 can execute each of the generated grasp poses on the target object. If the object is successfully grasped by a given grasp pose, the simulation can proceed to 308, where the given grasp pose is simulated along one or more trajectories. … If the object is not successfully grasped during the simulation at 304, the simulation can return to 302, where one or more additional grasp poses can be generated.”).
In addition, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Di Palo wherein the encoding module is configured to receive text descriptive of an action to be performed by the robot and to encode the text into a text encoding. The courts have determined under the case KSR Int’l Co. v. Teleflex Inc., 550 U.S. 398, 415-421, 82 USPQ2d 1385, 1395-07 (2007), a number of rationales in which obviousness is concluded. The rationale that pertains to the present invention is rationale B: Simple Substitution of One Known Element for Another to Obtain Predictable Results. Specifically, in this case item 3 of rationale B is satisfied: a finding that one of ordinary skill in the art could have substituted one known element for another, and the results of the substitution would have been predictable. Text encoding input to describe an action to be performed by a robot is well known within robot control. While the invention of Wang teaches spoken language input to instruct/command a robot to perform an action, despite the lack of mention that the type of input includes a text input, the functionalities of the invention would still produce the same outcomes when swapping the spoken language input for the text input. Therefore, the simple substitution of a text input for the spoken language input would have been obvious to implement.

Regarding Claim 20, Wang teaches a method comprising: receiving image encodings generated based on image renderings of configurations, respectively of a robot in an environment ([0033] via “Using the code generated at the box 350, the operational task is executed in simulation at box 360, which includes simulating the articulation of the robot arm to move the workpiece according to the input task instruction. The simulated motions of the robot arm and workpiece from the box 360 are provided to a box 370, where RRT and/or optimization-based techniques are used to generate a collision-free path at box 380.”), ([0041] via “FIG. 5 is an illustration of the simulation and path generation steps of the path dataset generation pipeline of FIG. 3, … The complete robot motion program, including motions of all joints in the articulated robot, enables a simulation of robot and workpiece movement as shown in box 510.”);
receiving input descriptive of an action to be performed by the robot ([0038] via “The final input to the large language model 440 is a task instruction 430. A non-limiting illustrative example of a task instruction is as follows: “Move Object W from Point S (start point) out of Bin B and place the Object W at Point G (goal point) in Fixture F”. The task instruction 430 is defined in plain language (e.g., English) as it would be spoken. It can be observed that the example task instruction listed above requires the large language model 440 to identify objects (workpiece W, bin B, fixture F), understand geometric information (start and goal points), and figure out how to perform the overall task using the tasks in the task library and in view of implied constraints (moving the workpiece up and out of the bin and then to the goal point).”), (Note: The Examiner interprets the large language model of Wang as the encoding module.); 
generating scores for the configurations based on comparisons of (a) the text encoding with (b) the image encoding of the respective configuration ([0033] via “A large language model block 340 takes the task descriptions and object names from the box 330 and generates code at box 350 (in a programming language such as Python) to simulate the operation. The large language model in the block 340 is the same as the large language model in the block 320, essentially performing a second operation (code generation) after the previous operation (task breakdown). Using the code generated at the box 350, the operational task is executed in simulation at box 360, which includes simulating the articulation of the robot arm to move the workpiece according to the input task instruction.”), ([0034] via “A final validation of generated paths is performed at box 390, including ensuring that the generated path is collision-free, is feasible from a robot reachability standpoint, and that the workpiece or object being moved reaches and is stable at the goal position. At decision diamond 392, it is determined whether all validation criteria are met.”), (Note: See Di Palo below, which teaches the inputted description of an action to be performed by the robot is in the form of a text description. Further, the Examiner interprets the validation step of Wang as the scoring.); 
selecting k of the configurations based on the scores, where k is an integer greater than or equal to 1 ([0034] via “At decision diamond 392, it is determined whether all validation criteria are met. … Paths which pass all aspects of validation are added to the motion program database 230—for use in training the encoder/decoder neural network as shown in FIG. 2.”), ([0049] via “The path 520 is validated at 530, and validated paths are added to the database 230, as also shown in FIGS. 4 and 6.”), (Note: See Figures 3 and 6 of Wang as well.); and 
actuating the robot based on the selected k of the configurations based on actuating the robot to achieve the action described in the text ([0050] via “In production robotic operations, inputs 210 for a robot task are provided to the encoder/decoder 220 running on the controller 620, which computes a robot path at 250. The controller 620 provides the computed robot path to a robot running in the robotic workcell 630. The robot in the workcell 630 performs the operation using the motion program provided by the controller
620.”), (Note: See Figure 6 of Wang as well.).
Wang is silent on each of the configurations including positions and orientations of portions of the robot; and receiving text descriptive of an action to be performed by the robot and to encode the text into a text encoding.
However, Aparicio teaches each of the configurations including positions and orientations of portions of the robot ([0025] via “A simulation module 301 of the computing system 300 can be configured to perform simulations within the simulation environment that is generated. For example, at 304, the simulation module 301 can generate different grasp poses for a given end-effector to grasp the target object. Using the generated grasp poses, at 306, the simulation module 301 can execute each of the generated grasp poses on the target object.”).
Further, Di Palo teaches receiving text descriptive of an action to be performed by the robot and to encode the text into a text encoding ([0101] via “The system receives a natural language description of the task to be performed by the agent (step 202). For example, the system can receive the description as a text input from a user or an external system.”), ([0102] via “The system processes, using a language model neural network, an input sequence derived from the natural language description of the task to generate an output text sequence (step 204).”), (Note: See Figure 2 of Di Palo as well.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Aparicio wherein each of the configurations includes positions and orientations of portions of the robot. Doing so renders parameters of the robot that would be required for the robot to perform the action, as stated by Aparicio ([0025] via “A simulation module 301 of the computing system 300 can be configured to perform simulations within the simulation environment that is generated. For example, at 304, the simulation module 301 can generate different grasp poses for a given end-effector to grasp the target object. Using the generated grasp poses, at 306, the simulation module 301 can execute each of the generated grasp poses on the target object. If the object is successfully grasped by a given grasp pose, the simulation can proceed to 308, where the given grasp pose is simulated along one or more trajectories. … If the object is not successfully grasped during the simulation at 304, the simulation can return to 302, where one or more additional grasp poses can be generated.”).
In addition, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Di Palo of receiving text descriptive of an action to be performed by the robot and to encode the text into a text encoding. The courts have determined under the case KSR Int’l Co. v. Teleflex Inc., 550 U.S. 398, 415-421, 82 USPQ2d 1385, 1395-07 (2007), a number of rationales in which obviousness is concluded. The rationale that pertains to the present invention is rationale B: Simple Substitution of One Known Element for Another to Obtain Predictable Results. Specifically, in this case item 3 of rationale B is satisfied: a finding that one of ordinary skill in the art could have substituted one known element for another, and the results of the substitution would have been predictable. Text encoding input to describe an action to be performed by a robot is well known within robot control. While the invention of Wang teaches spoken language input to instruct/command a robot to perform an action, despite the lack of mention that the type of input includes a text input, the functionalities of the invention would still produce the same outcomes when swapping the spoken language input for the text input. Therefore, the simple substitution of a text input for the spoken language input would have been obvious to implement.


12. Claim(s) 2 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wang et al. (US 20250319593 A1 hereinafter Wang) in view of Aparicio et al. (US 20240198526 A1 hereinafter Aparicio) and Di Palo et al. (US 20240311617 A1 hereinafter Di Palo), and further in view of Khaldi et al. ("Flexible and Efficient Topological Approaches for a Reliable Robots Swarm Aggregation" hereinafter Khaldi).

Regarding Claim 2, modified reference Wang teaches the robot system of claim 1, but is silent on wherein the scoring module is configured to generate the scores using cosine similarity.
However, Khaldi teaches wherein the scoring module is configured to generate the scores using cosine similarity (Section IV A paragraphs 1-4 via “Prior to introducing how each agent selects its k-NN that will interact with, we shall first briefly provide insight into the different k-NN based association distance metrics that are usually used in machine learning and from which we take inspiration to develop our newly proposed aggregation methods. k-NN is considered as one of the simplest lazy classification/regression algorithm that predicts the output for a new query point according to a similarity measure among its k closest neighbors in the training dataset [22]. The similarity between points is commonly determined using a distance function; the most used one is Euclidean distance. In literature and based on the properties of data, other sorts of distance measures exist. To this end, we use in our study the cosine distance (CosDist), the angular distance (AngDist), and the Minkowski distance (MinkDist) metrics. … Definition 1: For a given feature space, S, of dimensionality m, and having A = (x1, x2, …, xm) and B = (y1, y2, …, ym) as two points (feature vectors) in S. Then, the cosine similarity (CosSim) between A and B is defined as:                         
                            
                                
                                    C
                                    o
                                    s
                                    S
                                    i
                                    m
                                
                                
                                    A
                                    ,
                                    B
                                
                                
                                    (
                                    m
                                    )
                                
                            
                            =
                            
                                
                                    
                                        
                                            A
                                        
                                        
                                            (
                                            m
                                            )
                                        
                                    
                                    ∙
                                    
                                        
                                             
                                            B
                                        
                                        
                                            (
                                            m
                                            )
                                        
                                    
                                
                                
                                    
                                        
                                            A
                                        
                                        
                                            (
                                            m
                                            )
                                        
                                    
                                    
                                        
                                            B
                                        
                                        
                                            (
                                            m
                                            )
                                        
                                    
                                
                            
                            ,
                             
                            (
                            4
                            )
                        
                    . Their CosDist is:                         
                            
                                
                                    C
                                    o
                                    s
                                    D
                                    i
                                    s
                                    t
                                
                                
                                    A
                                    ,
                                     
                                    B
                                
                                
                                    (
                                    m
                                    )
                                
                            
                            =
                            1
                            -
                            
                                
                                    C
                                    o
                                    s
                                    S
                                    i
                                    m
                                
                                
                                    A
                                    ,
                                    B
                                
                                
                                    (
                                    m
                                    )
                                
                            
                        
                     … The cosine and angular distance metrics belong to the family of the cosine similarity, which just measures the cosine of the angle between two feature vectors. This family of metrics is mostly used in text mining and information retrieval. The Minkowski distance of order r is considered as a generalization form of the standard Euclidean metric (case r = 2), it is commonly applied in clustering methods.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Khaldi wherein the scoring module is configured to generate the scores using cosine similarity. Doing so improves the quality of the classification of the selection method, as stated by Khaldi (Section VII paragraph 2 via “On the other hand, in machine learning, distance metrics such as Euclidean, Cosine, Angular, Minkowski, and others can be used based on the property of data to improve k-NN classification/regression algorithms. So far, such metrics have not yet been applied to calculate distances between neighboring robots at real-time. Therefore, in this paper, three topological aggregation methods based on the Cosine, Angular, and Minkowski distance functions have been proposed to specifically improve the performance quality of the DW-KNN aggregation method.”).


13. Claim(s) 4 and 5 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wang et al. (US 20250319593 A1 hereinafter Wang) in view of Aparicio et al. (US 20240198526 A1 hereinafter Aparicio) and Di Palo et al. (US 20240311617 A1 hereinafter Di Palo), and further in view of Clark et al. (US 20230015238 A1 hereinafter Clark).

Regarding Claim 4, modified reference Wang teaches the robot system of claim 1, but is silent on wherein the renderings include at least two different image renderings of each configuration from different points of view.
However, Clark teaches wherein the renderings include at least two different image renderings of each configuration from different points of view ([0044] via “Each of these approaches utilizes a set of images, and over time creates an accurate representation of the tool center point. In one embodiment, the images may be taken by different cameras, providing different vantage points of a stationary robotic arm with the tool.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Clark wherein the renderings include at least two different image renderings of each configuration from different points of view. Doing so more accurately captures the configuration of the robot, as stated by Clark ([0047] via “If the data set evaluator 365 determines that the selections are not sufficiently accurate, in one embodiment vantage selection 370 selects a new vantage point for additional images. … In one embodiment, the data set evaluator 365 may alternatively trigger a reevaluation of an existing image, with additional helper data from helper data system 335. For example, if the first set of images are inaccurate, but with the additional helper data the system and/or the user may be able to select the tool contact area more accurately, the images may be reprocessed.”).

Regarding Claim 5, modified reference Wang teaches the robot system of claim 4, but is silent on wherein the different points of view are on a same horizontal plane.
However, Clark teaches wherein the different points of view are on a same horizontal plane ([0066] via “At block 530, another image from a different vantage point is displayed. In one embodiment, helper data based on prior selections is shown, to assist the user in selecting the tool contact area. FIGS. 5C and 5D illustrate one example of helper data. The helper data shown are the lines of sight from the prior images, showing the tool contact area previously selected.”), (Note: See Figures 5B-5D of Clark as well.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Clark wherein the different points of view are on a same horizontal plane. Collecting images of the configuration of the robot from different viewpoints more accurately captures the configuration of the robot, as stated by Clark ([0047] via “If the data set evaluator 365 determines that the selections are not sufficiently accurate, in one embodiment vantage selection 370 selects a new vantage point for additional images. … In one embodiment, the data set evaluator 365 may alternatively trigger a reevaluation of an existing image, with additional helper data from helper data system 335. For example, if the first set of images are inaccurate, but with the additional helper data the system and/or the user may be able to select the tool contact area more accurately, the images may be reprocessed.”).


14. Claim(s) 7 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wang et al. (US 20250319593 A1 hereinafter Wang) in view of Aparicio et al. (US 20240198526 A1 hereinafter Aparicio) and Di Palo et al. (US 20240311617 A1 hereinafter Di Palo), and further in view of Thrun ("A Probabilistic On-Line Mapping Algorithm for Teams of Mobile Robots" hereinafter Thrun).

Regarding Claim 7, modified reference Wang teaches the robot system of claim 1, but is silent on wherein the projection module is configured to finetune the selected k configurations based on one of gradient ascent and projected gradient ascent.
However, Thrun teaches wherein the projection module is configured to finetune the selected k configurations based on one of gradient ascent and projected gradient ascent (Section 3.2 paragraph 1 via “We now discuss how to find the most likely pose                         
                            
                                
                                    
                                        
                                            s
                                        
                                        
                                            t
                                        
                                    
                                
                                ^
                            
                        
                    , that is, how to compute (18). The basic problem is the continuous nature of the space of all robot poses, which makes it impossible to search this space exhaustively. In our approach,                         
                            
                                
                                    
                                        
                                            s
                                        
                                        
                                            t
                                        
                                    
                                
                                ^
                            
                        
                     is found using gradient ascent in log likelihood space. It is common practice to maximize the log likelihood instead of the likelihood, since the log likelihood is usually mathematically easier to handle. Note that maximizing the log likelihood is justified by the fact that the logarithm is a strictly monotonic function.”), (Section 3.2 paragraph 4 via “Gradient ascent then maximizes the log likelihood by successively changing the pose st in the direction of the gradient:                         
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                            ←
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                            +
                            α
                            
                                
                                    ∇
                                
                                
                                    s
                                    t
                                
                            
                            L
                        
                    . (23) Here, α > 0 is a step size commonly used in gradient ascent.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Thrun wherein the projection module is configured to finetune the selected k configurations based on one of gradient ascent and projected gradient ascent. Doing so incorporates a mathematical approach that maximizes the incredibly large search space of the robot poses in a way that is easier to compute, as stated above by Thrun in section 3.2 paragraph 1.


Examiner’s Note
15. The Examiner has cited particular paragraphs or columns and line numbers in the
references applied to the claims above for the convenience of the Applicant. Although the
specified citations are representative of the teachings of the art and are applied to specific
limitations within the individual claim, other passages and figures may apply as well. It is
respectfully requested of the Applicant in preparing responses, to fully consider the references
in their entirety as potentially teaching all or part of the claimed invention, as well as the
context of the passage as taught by the prior art or disclosed by the Examiner. See MPEP
2141.02 [R-07.2015] VI. A prior art reference must be considered in its entirety, i.e., as a whole,
including portions that would lead away from the claimed Invention. W.L. Gore & Associates,
Inc. v. Garlock, Inc., 721 F.2d 1540, 220 USPQ 303 (Fed. Cir. 1983), cert, denied, 469 U.S. 851
(1984). See also MPEP §2123.


Allowable Subject Matter
16. Claims 8 and 9 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

17. Claim 21 is allowed.

18. The following statement of reasons for the indication of allowable subject matter:

Regarding independent Claim 21, the prior art fails to disclose the limitation of: “wherein the scoring module is configured to generate a score for one of the configurations based on (a) a first score for the one of the configurations generated based on a first comparison of the text encoding with a first image encoding of the one of the configurations generated based on a first point of view and (b) a second score for the one of the configurations generated based on a second comparison of the text encoding with a second image encoding of the one of the configurations generated based on a second point of view that is different than the first point of view,” recited in lines 15-21 of claim 21. These specific limitations in combination with the other limitations are therefore novel over the prior art. JP 2022009697 A was deemed to be the closest prior art to this section of the claims. However, JP 2022009697 A fails to disclose multiple concepts that are integral to the scope of the present application, including generating a score via text encoding compared to image encodings, and wherein different viewpoints of the same configurations are used to develop the total score of each configuration. The invention of JP 2022009697 A simulates a robot performing a task with a plurality of workpieces, and scores the simulations as successful or unsuccessful based on certain criteria. While the configuration of the simulated environment of JP 2022009697 A is able to be taken from multiple viewpoints, each viewpoint is not in itself compared to the other viewpoints to aggregate a score for that full configuration. Further, while the simulated environment of JP 2022009697 A receives input from a user, this is only in the form of the model types and positions, not of an action to be performed by the robot. Thus, it would not be obvious to one of ordinary skill in the art to incorporate the teachings of JP 2022009697 A for these reasons, nor would there be any other obvious combinations of JP 2022009697 A with the previously presented prior art US 20250319593 A1 to Wang, US 20240198526 A1 to Aparicio, and US 20240311617 A1 to Di Palo, nor any other reference to teach such concepts. As such, the claims have been deemed allowable over the prior art of record.


Conclusion
19. Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

20. Any inquiry concerning this communication or earlier communications from the
examiner should be directed to BYRON X KASPER whose telephone number is (571)272-3895.
The examiner can normally be reached Monday - Friday 8 am - 5 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing
using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is
encouraged to use the USPTO Automated Interview Request (AIR) at
http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s
supervisor, Adam Mott can be reached on (571) 270-5376. The fax phone number for the
organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be
obtained from Patent Center. Unpublished application information in Patent Center is available
to registered users. To file and manage patent submissions in Patent Center, visit:
https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for
more information about Patent Center and https://www.uspto.gov/patents/docx for
information about filing in DOCX format. For additional questions, contact the Electronic
Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO
Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/BYRON XAVIER KASPER/Examiner, Art Unit 3657                                                                                                                                                                                                        
/ADAM R MOTT/Supervisory Patent Examiner, Art Unit 3657
Read full office action
Prosecution Timeline

May 14, 2024
Application Filed
Oct 02, 2025
Non-Final Rejection — §103
Dec 09, 2025
Examiner Interview Summary
Dec 09, 2025
Applicant Interview (Telephonic)
Dec 18, 2025
Response Filed
Mar 04, 2026
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/126,888
Patent 12594964
METHOD OF AND SYSTEM FOR GENERATING REFERENCE PATH OF SELF DRIVING CAR (SDC)
2y 5m to grant Granted Apr 07, 2026
18/649,939
Patent 12594137
HARD STOP PROTECTION SYSTEM AND METHOD
2y 5m to grant Granted Apr 07, 2026
18/231,501
Patent 12583101
METHOD FOR OPERATING A MODULAR ROBOT, MODULAR ROBOT, COLLISION AVOIDANCE SYSTEM, AND COMPUTER PROGRAM PRODUCT
2y 5m to grant Granted Mar 24, 2026
18/288,416
Patent 12576529
ROBOT SIMULATION DEVICE
2y 5m to grant Granted Mar 17, 2026
17/707,930
Patent 12564962
ROBOT REMOTE OPERATION CONTROL DEVICE, ROBOT REMOTE OPERATION CONTROL SYSTEM, ROBOT REMOTE OPERATION CONTROL METHOD, AND NON-TRANSITORY COMPUTER READABLE MEDIUM
2y 5m to grant Granted Mar 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
70%
Grant Probability
88%
With Interview (+18.4%)
3y 0m
Median Time to Grant
Moderate
PTA Risk
Based on 103 resolved cases by this examiner. Grant probability derived from career allow rate.