Last updated: May 29, 2026
Application No. 18/168,482
TECHNIQUES FOR CONTROLLING ROBOTS WITHIN ENVIRONMENTS MODELED BASED ON IMAGES

Final Rejection §103
Filed
Feb 13, 2023
Priority
Aug 29, 2022 — provisional 63/373,846
Examiner
WATTS III, JAMES MILLER
Art Unit
3657
Tech Center
3600 — Transportation & Electronic Commerce
Assignee
Nvidia Corporation
OA Round
4 (Final)
Interview Optional

— +15.5% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 74% grant rate with +15.5% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 46 resolved cases, 2023–2026
Examiner Intelligence

WATTS III, JAMES MILLER View full profile →
Grants 74% — above average
Career Allowance Rate
34 granted / 46 resolved
+21.9% vs TC avg
Strong +16% interview lift
Without
With
+15.5%
Interview Lift
resolved cases with interview
Typical timeline
2y 8m
Avg Prosecution
12 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
3.4%
-36.6% vs TC avg
§103
92.4%
+52.4% vs TC avg
§102
1.7%
-38.3% vs TC avg
§112
2.5%
-37.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 46 resolved cases
Office Action

§103
DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments 
Applicant’s arguments with respect to claim(s) 1-20 have been considered but are moot. Applicant’s amendments have changed the scope of the claims, and thus necessitated new grounds of rejection in view of Mildenhall and Park. In the proposed combination, Mildenhall and Park together suggest the benefits of sampling virtual images from a neural radiance field in order to generate signed distance functions representing an environment.

Claim Rejections - 35 USC § 103
The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.

Claim(s) 1, 5, 9-11, 15, 18-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Schoessler (US 20220288781 A1) in view of Rajkumar (US 10417781 B1), Park ("Deepsdf: Learning continuous signed distance functions for shape representation."), and Mildenhall ("Nerf: Representing scenes as neural radiance fields for view synthesis.")

Claim 1
Schoessler teaches
generating a representation of spatial occupancy within the environment based on the plurality of RGB images of the environment and the one or more camera poses, 
(Schoessler - [0033] FIG. 4 illustrates an example multimodal sensor setup 400. In particular embodiments, the robotic system 100 may use sensor data captured by multimodal sensors for task-based extrapolation for robot-human collision avoidance. … The multimodal sensor may be a camera 420, … As an example and not by way of limitation, the camera 420 may be based on RGB signal or kinetic signal.
[0034] In particular embodiments, the robotic system 100 may perform human pose estimation (e.g., how is a human moving) to determine the pose of each person in the environment based on sensor data captured by the multimodal sensors. As an example and not by way of limitation, human pose estimation may be based on RGB data captured by RGB cameras. 
[0035] In particular embodiments, the robotic system 100 may perform object detection (what objects are present in the environment), object segmentation, object localization, and object tracking (where are the objects) based on sensor data captured by the multimodal sensors.
determining one or more actions for the robot based on the representation of spatial occupancy and a goal; 
(Schoessler - [0049] After generating the occupancy map, the trajectory planning/adjustment module 728 may use different existing trajectory planning algorithms to figure out the best path that would avoid the volume that is defined by the occupancy map. Depending on the trajectory plan, the trajectory planning/adjustment module 728 may decide to delay or reschedule a task once the forecaster shows a “window of opportunity” (e.g. time picking up a tool by knowledge of future human actions). After a trajectory has been planned and the robot system 100 is in motion, the pose estimation and human motion prediction may keep updating in parallel. If a potential collision is predicted, the robot system 100 may adjust the current trajectory. 
[0023] In particular embodiments, the onboard computing system 152 may instruct the robotic limb 102 to achieve a desired pose. The onboard computing system 152 may access sensor data representing a scene from one or more sensors.)
and causing the robot to perform at least a portion of a movement based on the one or more actions.
(Schoessler - [0020] … As depicted, the one or more processor(s) 154 may be operably coupled with the memory 156 to perform various algorithms for instructing the robotic limb 102 to perform different operations. Such programs or instructions executed by the processor(s) 154 may be stored in any suitable article of manufacture that includes one or more tangible, computer-readable media at least collectively storing the instructions or routines, such as the memory 156. 
[0030] Certain embodiments disclosed herein may provide one or more technical advantages. A technical advantage of the embodiments may include improving safety of operations by robots when collaborating with human as the robotic system 100 may plan movement trajectories and tasks in a way that minimizes robot-human interference.)
EXAMINER NOTE: The above sections are included to promote clarity. The processor causes the robot to perform different operations, including movement tasks. 

Schoessler broadly discloses the generation of the occupancy representation. Schoessler does not explicitly teach the determination of camera poses from the RGB images. However, Rajkumar teaches 3D scene reconstruction applicable to Schoessler’s teachings, which involves
determining, based on image data from a plurality of red, green, blue (RGB) images of an environment, one or more camera poses associated with the plurality of RGB images;
(Rajkumar - [col 8, ln 49-61] In some implementations, the 3D scene reconstruction engine 109 applies estimation techniques such as structure from motion, … to generate the 3D representation of the scene or determine the relative pose of the object of interest. Structure from motion is a technique that allows for reconstructing a scene in three dimensions by inferring the geometrical features of the scene from camera motion. Visual odometry is a process of determining the position and orientation of an object by analyzing the associated camera images. For example, when odometry data 105 is not available, the differences in camera positions can be inferred based on the differences in image content
[col 5, ln 60-63] … For example, a RGB-D (red, green, blue, and depth) sensing camera can be used. )
generating a representation of spatial occupancy within the environment based on the plurality of images of the environment and the one or more camera poses;
(Rajkumar - [col 8, ln 18-33] For example, the automated annotation system 100 uses a 3D scene reconstruction engine 109 to create a 3D representation of the scene that includes the object of interest 203. This 3D representation of the scene is created by aligning multiple 2D images 103 and their corresponding depth information. Each of the images 103 can represent a view from a different position relative to the scene, and the odometry data 105 can indicate the differences in the camera positions. The odometry data 105 can thus be used to register the different camera positions in a 3D coordinate system. The depth information 104 can include a depth map showing distances of objects from the camera 130 at some or all regions of captured images 130. Using the known offsets between camera positions, the depth information 104 can be used to specify the shapes and contours of different sides of the objects to create the 3D representation.
[col 8 ln 34-48] The 3D scene reconstruction engine 109 uses the 3D representation of the scene to determine the absolute pose of the object of interest 203 within the scene. For example, the 3D scene reconstruction engine 109 can use multiple images 103 and their depth information 104 to set or refine the 3D pose and 3D bounding region for an object in the scene. For example, if CAD model 155 data is not available for an object, the overall shape of the object may not be known ahead of time. Nevertheless, depth information acquired from different camera positions can be used to determine different surfaces of the object, which can be pieced together to indicate the overall shape or volume of the object. Similarly, the relationship between reference positions on the object and a reference position, e.g., coordinate system, for the 3D representation of the scene can be determined.)

Note that Rajkumar utilizes depth images information (see at least [col 8, ln 34-48]), and Schoesler uses voxel maps (see [0048]) for spatial occupancy representation. While these representations can work, Park highlights the shortcomings of each of these methods and provides a better solution to represent spatial occupancy with signed distance functions. 
(Park – [Abstract] In this work, we introduce DeepSDF, a learned continuous Signed Distance Function (SDF) representation of a class of shapes that enables high quality shape representation, interpolation and completion from partial and noisy 3D input data.
[p.166, col 1, last line thru col 2, ln 10] Point-based. A point cloud is a lightweight 3D representation that closely matches the raw data that many sensors (i.e. LiDARs, depth cameras) provide, and hence is a natural fit for applying 3D learning. … A primary limitation, however, of learning with point clouds is that they do not describe topology and are not suitable for producing watertight surfaces.
[p.166, col 2, ln 34-46] Voxel-based. Voxels, which non-parametrically describe volumes with 3D grids of values, are perhaps the most natural extension into the 3D domain of the well-known learning paradigms (i.e., convolution) that have excelled in the 2D image domain. The most straightforward variant of voxel based learning is to use a dense occupancy grid (occupied / not occupied). Due to the cubically growing compute and memory requirements, however, current methods are only able to handle low resolutions (1283 or below). As such, voxel-based approaches do not preserve fine shape details [54, 13], and additionally voxels visually appear significantly different than high-fidelity shapes, since when rendered their normals are not smooth.)
EXAMINER NOTE: depth maps from Lidar or depth cameras (used by Rajkumar) do not capture the entire geometry of the object and can leave holes (not watertight). Voxel-based methods are straightforward, but do not preserve fine shape details.

In presenting the improved signed distance function-based representation of occupancy, Park teaches
wherein the representation of spatial occupancy is generated by: …
generating the representation of spatial occupancy based on the one or more virtual depth images … 
wherein the representation of spatial occupancy comprises a signed distance function
(Park - [p. 169-170, 5. Data Preparation] To train our continuous SDF model, we prepare the SDF samples X (Eq. 2) for each mesh, which consists of 3D points and their SDF values. While SDF can be computed through a distance transform for any watertight shapes from real or synthetic data, we train with synthetic objects, (e.g. ShapeNet [10]), for which we are provided complete 3D shape meshes. To prepare data, we start by normalizing each mesh to a unit sphere and sampling 500,000 spatial points x’s: … To obtain the shell of a mesh with proper orientation, we set up equally spaced virtual cameras around the object, and densely sample surface points, denoted Ps, with surface normals oriented towards the camera.)

Park’s signed distance functions are trained on virtual images obtained from the ShapeNet dataset. While this method of generating a signed distance function proves effective, Mildenhall highlights the limitations associated with virtual images from synthetic datasets.
(Mildenhall – [p. 100, col 1-2, 2.1 Neural 3D shape representations] Recent work has investigated the implicit representation of continuous 3D shapes as level sets by optimizing deep net works that map xyz coordinates to signed distance functions15 or occupancy fields.11 However, these models are limited by their requirement of access to ground truth 3D geometry, typically obtained from synthetic 3D shape datasets such as ShapeNet.2  …
…Though these techniques can potentially represent complicated and high-resolution geometry, they have so far been limited to simple shapes with low geometric complexity, resulting in oversmoothed renderings. We show that an alternate strategy of optimizing networks to encode 5D radiance fields (3D volumes with 2D view dependent appearance) can represent higher resolution geometry and appearance to render photorealistic novel views of complex scenes.

Mildenhall proposes the use of neural radiance fields to improve model fidelity, and thus teaches
wherein the representation of spatial occupancy is generated by:
generating a reconstruction of the environment using a neural radiance field (NeRF) model;
(Mildenhall - [p.101, 3. NEURAL RADIANCE FIELD SCENE REPRESENTATION] We represent a continuous scene as a 5D vector-valued func tion whose input is a 3D location x = (x, y, z) and 2D viewing direction (θ, φ), and whose output is an emitted color c = (r, g, b) and volume density σ. In practice, we express direction as a 3D Cartesian unit vector d. We approximate this continuous 5D scene representation with an MLP network FΘ : (x, d) → (c, σ) and optimize its weights Θ to map from each input 5D coordinate to its corresponding volume density and directional emitted color.
[p.101, 4. VOLUME RENDERING WITH RADIANCE FIELDS] Our 5D neural radiance field represents a scene as the volume density and directional emitted radiance at any point in space.)
generating the representation of spatial occupancy based on … the plurality of RGB images, 
(Mildenhall - [p. 1025.2. Implementation details] We optimize a separate neural continuous volume representation network for each scene. This requires only a dataset of captured RGB images of the scene, the corresponding camera poses and intrinsic parameters, and scene bounds (we use ground truth camera poses, intrinsics, and bounds for synthetic data, and use the COLMAP structure-from-motion package18 to estimate these parameters for real data). At each optimization iteration, we randomly sample a batch of camera rays from the set of all pixels in the dataset. We query the network at N random points along each ray and then use the volume rendering procedure described in Section 4 to render the color of each ray using these samples.)
EXAMINER NOTE: Because the RGB camera images are used to generate the neural radiance field reconstruction, and the reconstruction is used to generate the SDF, the RGB camera images are used to generate the representation of spatial occupancy.

It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to modify Schoessler’s robotic control scheme with Rajkumar’s suggestion to further base spatial occupancy on camera pose in order to better determine reference positions of objects relative to a common coordinate system. To further improve the fidelity and memory requirements, it would have been obvious to incorporate Park’s suggestions to use signed distance functions to represent occupancy, and Mildenhall’s suggestion to utilize Neural Radiance fields in order to generate quality signed distance functions.

Claim 11
Schoessler teaches
One or more non-transitory computer-readable media storing instructions that, when executed by at least one processor, cause the at least one processor to perform the steps of:
(Schoessler - [0020] For example, in some embodiments, the onboard computing system 152 may include, among other things, one or more processor(s) 154, memory 156, ...  As depicted, the one or more processor(s) 154 may be operably coupled with the memory 156 to perform various algorithms for instructing the robotic limb 102 to perform different operations.)
generating a representation of spatial occupancy within the environment based on the plurality of RGB images of the environment and the one or more camera poses, 
(Schoessler - [0033] FIG. 4 illustrates an example multimodal sensor setup 400. In particular embodiments, the robotic system 100 may use sensor data captured by multimodal sensors for task-based extrapolation for robot-human collision avoidance. … The multimodal sensor may be a camera 420, … As an example and not by way of limitation, the camera 420 may be based on RGB signal or kinetic signal.
[0034] In particular embodiments, the robotic system 100 may perform human pose estimation (e.g., how is a human moving) to determine the pose of each person in the environment based on sensor data captured by the multimodal sensors. As an example and not by way of limitation, human pose estimation may be based on RGB data captured by RGB cameras. 
[0035] In particular embodiments, the robotic system 100 may perform object detection (what objects are present in the environment), object segmentation, object localization, and object tracking (where are the objects) based on sensor data captured by the multimodal sensors.
determining one or more actions for the robot based on the representation of spatial occupancy and a goal; 
(Schoessler - [0049] After generating the occupancy map, the trajectory planning/adjustment module 728 may use different existing trajectory planning algorithms to figure out the best path that would avoid the volume that is defined by the occupancy map. Depending on the trajectory plan, the trajectory planning/adjustment module 728 may decide to delay or reschedule a task once the forecaster shows a “window of opportunity” (e.g. time picking up a tool by knowledge of future human actions). After a trajectory has been planned and the robot system 100 is in motion, the pose estimation and human motion prediction may keep updating in parallel. If a potential collision is predicted, the robot system 100 may adjust the current trajectory. 
[0023] In particular embodiments, the onboard computing system 152 may instruct the robotic limb 102 to achieve a desired pose. The onboard computing system 152 may access sensor data representing a scene from one or more sensors.)
and causing the robot to perform at least a portion of a movement based on the one or more actions.
(Schoessler - [0020] … As depicted, the one or more processor(s) 154 may be operably coupled with the memory 156 to perform various algorithms for instructing the robotic limb 102 to perform different operations. Such programs or instructions executed by the processor(s) 154 may be stored in any suitable article of manufacture that includes one or more tangible, computer-readable media at least collectively storing the instructions or routines, such as the memory 156. 
[0030] Certain embodiments disclosed herein may provide one or more technical advantages. A technical advantage of the embodiments may include improving safety of operations by robots when collaborating with human as the robotic system 100 may plan movement trajectories and tasks in a way that minimizes robot-human interference.)
EXAMINER NOTE: The above sections are included to promote clarity. The processor causes the robot to perform different operations, including movement tasks. 

Schoessler broadly discloses the generation of the occupancy representation. Schoessler does not explicitly teach the determination of camera poses from the RGB images. However, Rajkumar teaches 3D scene reconstruction applicable to Schoessler’s teachings, which involves
determining, based on image data from a plurality of red, green, blue (RGB) images of an environment, one or more camera poses associated with the plurality of RGB images;
(Rajkumar - [col 8, ln 49-61] In some implementations, the 3D scene reconstruction engine 109 applies estimation techniques such as structure from motion, … to generate the 3D representation of the scene or determine the relative pose of the object of interest. Structure from motion is a technique that allows for reconstructing a scene in three dimensions by inferring the geometrical features of the scene from camera motion. Visual odometry is a process of determining the position and orientation of an object by analyzing the associated camera images. For example, when odometry data 105 is not available, the differences in camera positions can be inferred based on the differences in image content
[col 5, ln 60-63] … For example, a RGB-D (red, green, blue, and depth) sensing camera can be used. )
generating a representation of spatial occupancy within the environment based on the plurality of images of the environment and the one or more camera poses;
(Rajkumar - [col 8, ln 18-33] For example, the automated annotation system 100 uses a 3D scene reconstruction engine 109 to create a 3D representation of the scene that includes the object of interest 203. This 3D representation of the scene is created by aligning multiple 2D images 103 and their corresponding depth information. Each of the images 103 can represent a view from a different position relative to the scene, and the odometry data 105 can indicate the differences in the camera positions. The odometry data 105 can thus be used to register the different camera positions in a 3D coordinate system. The depth information 104 can include a depth map showing distances of objects from the camera 130 at some or all regions of captured images 130. Using the known offsets between camera positions, the depth information 104 can be used to specify the shapes and contours of different sides of the objects to create the 3D representation.
[col 8 ln 34-48] The 3D scene reconstruction engine 109 uses the 3D representation of the scene to determine the absolute pose of the object of interest 203 within the scene. For example, the 3D scene reconstruction engine 109 can use multiple images 103 and their depth information 104 to set or refine the 3D pose and 3D bounding region for an object in the scene. For example, if CAD model 155 data is not available for an object, the overall shape of the object may not be known ahead of time. Nevertheless, depth information acquired from different camera positions can be used to determine different surfaces of the object, which can be pieced together to indicate the overall shape or volume of the object. Similarly, the relationship between reference positions on the object and a reference position, e.g., coordinate system, for the 3D representation of the scene can be determined.)

Note that Rajkumar utilizes depth images information (see at least [col 8, ln 34-48]), and Schoesler uses voxel maps (see [0048]) for spatial occupancy representation. While these representations can work, Park highlights the shortcomings of each of these methods and provides a better solution to represent spatial occupancy with signed distance functions. 
(Park – [Abstract] In this work, we introduce DeepSDF, a learned continuous Signed Distance Function (SDF) representation of a class of shapes that enables high quality shape representation, interpolation and completion from partial and noisy 3D input data.
[p.166, col 1, last line thru col 2, ln 10] Point-based. A point cloud is a lightweight 3D representation that closely matches the raw data that many sensors (i.e. LiDARs, depth cameras) provide, and hence is a natural fit for applying 3D learning. … A primary limitation, however, of learning with point clouds is that they do not describe topology and are not suitable for producing watertight surfaces.
[p.166, col 2, ln 34-46] Voxel-based. Voxels, which non-parametrically describe volumes with 3D grids of values, are perhaps the most natural extension into the 3D domain of the well-known learning paradigms (i.e., convolution) that have excelled in the 2D image domain. The most straightforward variant of voxel based learning is to use a dense occupancy grid (occupied / not occupied). Due to the cubically growing compute and memory requirements, however, current methods are only able to handle low resolutions (1283 or below). As such, voxel-based approaches do not preserve fine shape details [54, 13], and additionally voxels visually appear significantly different than high-fidelity shapes, since when rendered their normals are not smooth.)
EXAMINER NOTE: depth maps from Lidar or depth cameras (used by Rajkumar) do not capture the entire geometry of the object and can leave holes (not watertight). Voxel-based methods are straightforward, but do not preserve fine shape details.


In presenting the improved signed distance function-based representation of occupancy, Park teaches
wherein the representation of spatial occupancy is generated by: …
generating the representation of spatial occupancy based on the one or more virtual depth images … 
wherein the representation of spatial occupancy comprises a signed distance function
(Park - [p. 169-170, 5. Data Preparation] To train our continuous SDF model, we prepare the SDF samples X (Eq. 2) for each mesh, which consists of 3D points and their SDF values. While SDF can be computed through a distance transform for any watertight shapes from real or synthetic data, we train with synthetic objects, (e.g. ShapeNet [10]), for which we are provided complete 3D shape meshes. To prepare data, we start by normalizing each mesh to a unit sphere and sampling 500,000 spatial points x’s: … To obtain the shell of a mesh with proper orientation, we set up equally spaced virtual cameras around the object, and densely sample surface points, denoted Ps, with surface normals oriented towards the camera.)

Park’s signed distance functions are trained on virtual images obtained from the ShapeNet dataset. While this method of generating a signed distance function proves effective, Mildenhall highlights the limitations associated with virtual images from synthetic datasets.
(Mildenhall – [p. 100, col 1-2, 2.1 Neural 3D shape representations] Recent work has investigated the implicit representation of continuous 3D shapes as level sets by optimizing deep net works that map xyz coordinates to signed distance functions15 or occupancy fields.11 However, these models are limited by their requirement of access to ground truth 3D geometry, typically obtained from synthetic 3D shape datasets such as ShapeNet.2  …
…Though these techniques can potentially represent complicated and high-resolution geometry, they have so far been limited to simple shapes with low geometric complexity, resulting in oversmoothed renderings. We show that an alternate strategy of optimizing networks to encode 5D radiance fields (3D volumes with 2D view dependent appearance) can represent higher resolution geometry and appearance to render photorealistic novel views of complex scenes.)

Mildenhall proposes the use of neural radiance fields to improve model fidelity, and thus teaches
wherein the representation of spatial occupancy is generated by:
generating a reconstruction of the environment using a neural radiance field (NeRF) model;
(Mildenhall - [p.101, 3. NEURAL RADIANCE FIELD SCENE REPRESENTATION] We represent a continuous scene as a 5D vector-valued func tion whose input is a 3D location x = (x, y, z) and 2D viewing direction (θ, φ), and whose output is an emitted color c = (r, g, b) and volume density σ. In practice, we express direction as a 3D Cartesian unit vector d. We approximate this continuous 5D scene representation with an MLP network FΘ : (x, d) → (c, σ) and optimize its weights Θ to map from each input 5D coordinate to its corresponding volume density and directional emitted color.
[p.101, 4. VOLUME RENDERING WITH RADIANCE FIELDS] Our 5D neural radiance field represents a scene as the volume density and directional emitted radiance at any point in space.)
generating the representation of spatial occupancy based on … the plurality of RGB images, 
(Mildenhall - [p. 1025.2. Implementation details] We optimize a separate neural continuous volume representation network for each scene. This requires only a dataset of captured RGB images of the scene, the corresponding camera poses and intrinsic parameters, and scene bounds (we use ground truth camera poses, intrinsics, and bounds for synthetic data, and use the COLMAP structure-from-motion package18 to estimate these parameters for real data). At each optimization iteration, we randomly sample a batch of camera rays from the set of all pixels in the dataset. We query the network at N random points along each ray and then use the volume rendering procedure described in Section 4 to render the color of each ray using these samples.)
EXAMINER NOTE: Because the RGB camera images are used to generate the neural radiance field reconstruction, and the reconstruction is used to generate the SDF, the RGB camera images are used to generate the representation of spatial occupancy.

It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to modify Schoessler’s robotic control scheme with Rajkumar’s suggestion to further base spatial occupancy on camera pose in order to better determine reference positions of objects relative to a common coordinate system. To further improve the fidelity and memory requirements, it would have been obvious to incorporate Park’s suggestions to use signed distance functions to represent occupancy, and Mildenhall’s suggestion to utilize Neural Radiance fields in order to generate quality signed distance functions.






Claim 20 
Schoessler teaches
a robot; and
(Schoessler - [0019] In particular embodiments, a robotic system 100 may include a robotic limb that may perform operations to provide services to one or more users in different tasks such as cooking, gardening, painting, etc.)

a computing system that comprises: one or more memories storing instructions, and
one or more processors that are coupled to the one or more memories and, when executing the instructions, are configured to:
(Schoessler - [0020] For example, in some embodiments, the onboard computing system 152 may include, among other things, one or more processor(s) 154, memory 156, ...  As depicted, the one or more processor(s) 154 may be operably coupled with the memory 156 to perform various algorithms for instructing the robotic limb 102 to perform different operations.)
generate a representation of spatial occupancy within the environment based on the plurality of RGB images of the environment and the one or more camera poses, 
(Schoessler - [0033] FIG. 4 illustrates an example multimodal sensor setup 400. In particular embodiments, the robotic system 100 may use sensor data captured by multimodal sensors for task-based extrapolation for robot-human collision avoidance. … The multimodal sensor may be a camera 420, … As an example and not by way of limitation, the camera 420 may be based on RGB signal or kinetic signal.
[0034] In particular embodiments, the robotic system 100 may perform human pose estimation (e.g., how is a human moving) to determine the pose of each person in the environment based on sensor data captured by the multimodal sensors. As an example and not by way of limitation, human pose estimation may be based on RGB data captured by RGB cameras. 
[0035] In particular embodiments, the robotic system 100 may perform object detection (what objects are present in the environment), object segmentation, object localization, and object tracking (where are the objects) based on sensor data captured by the multimodal sensors.
determine one or more actions for the robot based on the representation of spatial occupancy and a goal; 
(Schoessler - [0049] After generating the occupancy map, the trajectory planning/adjustment module 728 may use different existing trajectory planning algorithms to figure out the best path that would avoid the volume that is defined by the occupancy map. Depending on the trajectory plan, the trajectory planning/adjustment module 728 may decide to delay or reschedule a task once the forecaster shows a “window of opportunity” (e.g. time picking up a tool by knowledge of future human actions). After a trajectory has been planned and the robot system 100 is in motion, the pose estimation and human motion prediction may keep updating in parallel. If a potential collision is predicted, the robot system 100 may adjust the current trajectory. 
[0023] In particular embodiments, the onboard computing system 152 may instruct the robotic limb 102 to achieve a desired pose. The onboard computing system 152 may access sensor data representing a scene from one or more sensors.)

and cause the robot to perform at least a portion of a movement based on the one or more actions.
(Schoessler - [0020] … As depicted, the one or more processor(s) 154 may be operably coupled with the memory 156 to perform various algorithms for instructing the robotic limb 102 to perform different operations. Such programs or instructions executed by the processor(s) 154 may be stored in any suitable article of manufacture that includes one or more tangible, computer-readable media at least collectively storing the instructions or routines, such as the memory 156. 
[0030] Certain embodiments disclosed herein may provide one or more technical advantages. A technical advantage of the embodiments may include improving safety of operations by robots when collaborating with human as the robotic system 100 may plan movement trajectories and tasks in a way that minimizes robot-human interference.)
EXAMINER NOTE: The above sections are included to promote clarity. The processor causes the robot to perform different operations, including movement tasks. 

Schoessler broadly discloses the generation of the occupancy representation. Schoessler does not explicitly teach the determination of camera poses from the RGB images. However, Rajkumar teaches 3D scene reconstruction applicable to Schoessler’s teachings, which involves
determine, based on image data from a plurality of red, green, blue (RGB) images of an environment, one or more camera poses associated with the plurality of RGB images;
(Rajkumar - [col 8, ln 49-61] In some implementations, the 3D scene reconstruction engine 109 applies estimation techniques such as structure from motion, … to generate the 3D representation of the scene or determine the relative pose of the object of interest. Structure from motion is a technique that allows for reconstructing a scene in three dimensions by inferring the geometrical features of the scene from camera motion. Visual odometry is a process of determining the position and orientation of an object by analyzing the associated camera images. For example, when odometry data 105 is not available, the differences in camera positions can be inferred based on the differences in image content
[col 5, ln 60-63] … For example, a RGB-D (red, green, blue, and depth) sensing camera can be used. )
generate a representation of spatial occupancy within the environment based on the plurality of images of the environment and the one or more camera poses;
(Rajkumar - [col 8, ln 18-33] For example, the automated annotation system 100 uses a 3D scene reconstruction engine 109 to create a 3D representation of the scene that includes the object of interest 203. This 3D representation of the scene is created by aligning multiple 2D images 103 and their corresponding depth information. Each of the images 103 can represent a view from a different position relative to the scene, and the odometry data 105 can indicate the differences in the camera positions. The odometry data 105 can thus be used to register the different camera positions in a 3D coordinate system. The depth information 104 can include a depth map showing distances of objects from the camera 130 at some or all regions of captured images 130. Using the known offsets between camera positions, the depth information 104 can be used to specify the shapes and contours of different sides of the objects to create the 3D representation.
[col 8 ln 34-48] The 3D scene reconstruction engine 109 uses the 3D representation of the scene to determine the absolute pose of the object of interest 203 within the scene. For example, the 3D scene reconstruction engine 109 can use multiple images 103 and their depth information 104 to set or refine the 3D pose and 3D bounding region for an object in the scene. For example, if CAD model 155 data is not available for an object, the overall shape of the object may not be known ahead of time. Nevertheless, depth information acquired from different camera positions can be used to determine different surfaces of the object, which can be pieced together to indicate the overall shape or volume of the object. Similarly, the relationship between reference positions on the object and a reference position, e.g., coordinate system, for the 3D representation of the scene can be determined.)

Note that Rajkumar utilizes depth images information (see at least [col 8, ln 34-48]), and Schoesler uses voxel maps (see [0048]) for spatial occupancy representation. While these representations can work, Park highlights the shortcomings of each of these methods and provides a better solution to represent spatial occupancy with signed distance functions. 
(Park – [Abstract] In this work, we introduce DeepSDF, a learned continuous Signed Distance Function (SDF) representation of a class of shapes that enables high quality shape representation, interpolation and completion from partial and noisy 3D input data.
[p.166, col 1, last line thru col 2, ln 10] Point-based. A point cloud is a lightweight 3D representation that closely matches the raw data that many sensors (i.e. LiDARs, depth cameras) provide, and hence is a natural fit for applying 3D learning. … A primary limitation, however, of learning with point clouds is that they do not describe topology and are not suitable for producing watertight surfaces.
[p.166, col 2, ln 34-46] Voxel-based. Voxels, which non-parametrically describe volumes with 3D grids of values, are perhaps the most natural extension into the 3D domain of the well-known learning paradigms (i.e., convolution) that have excelled in the 2D image domain. The most straightforward variant of voxel based learning is to use a dense occupancy grid (occupied / not occupied). Due to the cubically growing compute and memory requirements, however, current methods are only able to handle low resolutions (1283 or below). As such, voxel-based approaches do not preserve fine shape details [54, 13], and additionally voxels visually appear significantly different than high-fidelity shapes, since when rendered their normals are not smooth.)
EXAMINER NOTE: depth maps from Lidar or depth cameras (used by Rajkumar) do not capture the entire geometry of the object and can leave holes (not watertight). Voxel-based methods are straightforward, but do not preserve fine shape details.

In presenting the improved signed distance function-based representation of occupancy, Park teaches
wherein the representation of spatial occupancy is generated by: …
generating the representation of spatial occupancy based on the one or more virtual depth images … 
wherein the representation of spatial occupancy comprises a signed distance function
(Park - [p. 169-170, 5. Data Preparation] To train our continuous SDF model, we prepare the SDF samples X (Eq. 2) for each mesh, which consists of 3D points and their SDF values. While SDF can be computed through a distance transform for any watertight shapes from real or synthetic data, we train with synthetic objects, (e.g. ShapeNet [10]), for which we are provided complete 3D shape meshes. To prepare data, we start by normalizing each mesh to a unit sphere and sampling 500,000 spatial points x’s: … To obtain the shell of a mesh with proper orientation, we set up equally spaced virtual cameras around the object, and densely sample surface points, denoted Ps, with surface normals oriented towards the camera.)

Park’s signed distance functions are trained on virtual images obtained from the ShapeNet dataset. While this method of generating a signed distance function proves effective, Mildenhall highlights the limitations associated with virtual images from synthetic datasets.
(Mildenhall – [p. 100, col 1-2, 2.1 Neural 3D shape representations] Recent work has investigated the implicit representation of continuous 3D shapes as level sets by optimizing deep net works that map xyz coordinates to signed distance functions15 or occupancy fields.11 However, these models are limited by their requirement of access to ground truth 3D geometry, typically obtained from synthetic 3D shape datasets such as ShapeNet.2  …
…Though these techniques can potentially represent complicated and high-resolution geometry, they have so far been limited to simple shapes with low geometric complexity, resulting in oversmoothed renderings. We show that an alternate strategy of optimizing networks to encode 5D radiance fields (3D volumes with 2D view dependent appearance) can represent higher resolution geometry and appearance to render photorealistic novel views of complex scenes.

Mildenhall proposes the use of neural radiance fields to improve model fidelity, and thus teaches
wherein the representation of spatial occupancy is generated by:
generating a reconstruction of the environment using a neural radiance field (NeRF) model;
(Mildenhall - [p.101, 3. NEURAL RADIANCE FIELD SCENE REPRESENTATION] We represent a continuous scene as a 5D vector-valued func tion whose input is a 3D location x = (x, y, z) and 2D viewing direction (θ, φ), and whose output is an emitted color c = (r, g, b) and volume density σ. In practice, we express direction as a 3D Cartesian unit vector d. We approximate this continuous 5D scene representation with an MLP network FΘ : (x, d) → (c, σ) and optimize its weights Θ to map from each input 5D coordinate to its corresponding volume density and directional emitted color.
[p.101, 4. VOLUME RENDERING WITH RADIANCE FIELDS] Our 5D neural radiance field represents a scene as the volume density and directional emitted radiance at any point in space.)
generating the representation of spatial occupancy based on … the plurality of RGB images, 
(Mildenhall - [p. 1025.2. Implementation details] We optimize a separate neural continuous volume representation network for each scene. This requires only a dataset of captured RGB images of the scene, the corresponding camera poses and intrinsic parameters, and scene bounds (we use ground truth camera poses, intrinsics, and bounds for synthetic data, and use the COLMAP structure-from-motion package18 to estimate these parameters for real data). At each optimization iteration, we randomly sample a batch of camera rays from the set of all pixels in the dataset. We query the network at N random points along each ray and then use the volume rendering procedure described in Section 4 to render the color of each ray using these samples.)
EXAMINER NOTE: Because the RGB camera images are used to generate the neural radiance field reconstruction, and the reconstruction is used to generate the SDF, the RGB camera images are used to generate the representation of spatial occupancy.

It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to modify Schoessler’s robotic control scheme with Rajkumar’s suggestion to further base spatial occupancy on camera pose in order to better determine reference positions of objects relative to a common coordinate system. To further improve the fidelity and memory requirements, it would have been obvious to incorporate Park’s suggestions to use signed distance functions to represent occupancy, and Mildenhall’s suggestion to utilize Neural Radiance fields in order to generate quality signed distance functions.

Claims 5 and 15
The combination of Schoessler, Rajkumar, Park, and Mildenhall teaches the limitations of claim 1 as outlined above. The cited combination also teaches
wherein the representation of spatial occupancy further comprises at least one of a second signed distance function, a voxel representation of occupancy, or a point cloud.  
EXAMINER NOTE: Park’s disclosure deals with creating an SDF for individual objects. Park’s contribution to the currently proposed combination encompasses creating a signed distance function for each obstacle, which results in multiple signed distance functions. 

Claim 9
The combination of Schoessler, Rajkumar, Park, and Mildenhall teaches the limitations of claim 1 as outlined above. Schoessler further teaches
further comprising capturing the plurality of RGB images via a camera mounted on the robot.  
(Schoessler - [0022] … This disclosure contemplates that the one or more sensors can be located on the robotic limb 102 or external to the robotic limb 102, or both. Other sensors for sensing the pose of the robotic limb 102 may be built into the robotic system 100 of which the limb 102 is a part, and may include joint encoders, computation encoders, limit switches, motor current sensors, or any suitable combination thereof.)

Claim 10
The combination of Schoessler, Rajkumar, Park, and Mildenhall teaches the limitations of claim 1 as outlined above. Schoessler does not explicitly teach the following limitations in combination. However, Rajkumar teaches
further comprising capturing the plurality of RGB images via a camera that is moved across a portion of the environment.
(Rajkumar - [col 6, ln 4-9] In some implementations, odometry data 105 is acquired with the images 103. The odometry data 105 can indicate the position, e.g., location and orientation, of the camera 130 when each image 103 is acquired. When the camera 103 is moved by a machine, a position tracking device 140 can track the movement of the camera 130. For example, sensors in a robotic arm or a wheel of a robot can track movement as a robot moves the camera 130 to different positions for acquiring images 103. 
[col 8, ln 18-33] For example, the automated annotation system 100 uses a 3D scene reconstruction engine 109 to create a 3D representation of the scene that includes the object of interest 203. This 3D representation of the scene is created by aligning multiple 2D images 103 and their corresponding depth information. Each of the images 103 can represent a view from a different position relative to the scene, and the odometry data 105 can indicate the differences in the camera positions. )

It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to further modify Schoessler’s robot control scheme with Rajkumar’s suggestion to capture images via a moving camera in order to obtain odometry data needed for scene creation (map generation)

Claim 18
The combination of Schoessler, Rajkumar, Park, and Mildenhall teaches the limitations of claim 11 as outlined above. Schoessler further teaches
wherein the instructions, when executed by the at least one processor, further cause the at least one processor to perform the steps of: causing the plurality of RGB images to be captured via a camera that at least one of is mounted on the robot or is moved across a portion of the environment.
(Schoessler - [0022] … This disclosure contemplates that the one or more sensors can be located on the robotic limb 102 or external to the robotic limb 102, or both. Other sensors for sensing the pose of the robotic limb 102 may be built into the robotic system 100 of which the limb 102 is a part, and may include joint encoders, computation encoders, limit switches, motor current sensors, or any suitable combination thereof.
[0034] In particular embodiments, the robotic system 100 may perform human pose estimation (e.g., how is a human moving) to determine the pose of each person in the environment based on sensor data captured by the multimodal sensors. As an example and not by way of limitation, human pose estimation may be based on RGB data captured by RGB cameras.)


Claim 19
The combination of Schoessler, Rajkumar, Park, and Mildenhall teaches the limitations of claim 11 as outlined above. Schoessler further teaches
generating a three-dimensional (3D) reconstruction of the environment based on the plurality of RGB images
(Schoessler - [0033] … In particular embodiments, the multimodal sensor setup may offer the possibility to create a full three-dimensional (3D) scene representation of workspace and the sphere of influence of the robotic system. … The multimodal sensor may be a camera ... As an example and not by way of limitation, the camera 420 may be based on RGB signal.)
	and generating depth data based on the 3D reconstruction of the environment,
(Schoessler - [0037] In particular embodiments, the robotic system 100 may estimate a pose of an object based on attributes of the object. As an example and not by way of limitation, part segmentation of an object may provide information about where the spout of a teapot is. The attributes of the teapot may further provide information that the object can pour and needs to be at a certain angle to pour. )
EXAMINER NOTE: Depth data is necessarily part of the 3D scene representation of the workspace - particularly when tracking poses of objects.
wherein the representation of spatial occupancy is further generated based on the depth data.  
(Schoessler -[0050] … In FIG. 9, there may be an initially determined human location 905. There may be also an initially planned trajectory 910 for the robot arm 915. The robotic system 100 may have a pose forecast 920 for the human. Based on the above information, the robotic system 100 may further generate an occupancy map. The occupancy map may comprise occupancy grid 925, which indicates the inferred area to avoid.)


Claim(s) 6 and 16-17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Schoessler, Rajkumar, Park, and MIldenhall , and further in view of Kobilarov (US 10671076 B1).

Claims 6 and 16
The combination of Schoessler, Rajkumar, Park, and MIldenhall teaches the limitations of claims 2 and 12 as outlined above. Schoessler may not explicitly teach the following limitations in combination. However, Kobilarov teaches
wherein determining the one or more actions for the robot comprises, 
for each of one or more iterations:
(Kobilarov - [col 26, ln 18-25] In some instances, actions by other independent objects (e.g., the autonomous vehicle or other third-party objects) can be considered while determining the predictive routes and/or trajectories. As can be understood, in this configuration, the modules can provide feedback and can iteratively determine predictive routes and/or trajectories based on other predictive or speculative routes or trajectories.)
sampling a plurality of trajectories of the robot; computing a cost associated with each trajectory based on the representation of spatial occupancy;
(Kobilarov - [col 3, ln 43-53] In general, determining a trajectory for an autonomous vehicle can include utilizing a tree search algorithm such as Monte Carlo Tree Search (MCTS) to organize and search through possible trajectories, … and determining various costs and constraints associated with possible trajectories (e.g., longitudinal acceleration, lateral acceleration, power consumption, travel time, distance from obstacles or a centerline of a road, performance, comfort, etc.) to select a trajectory to optimize performance.)
and determining an action for the robot based on a first trajectory included in the plurality of trajectories that is associated with a lowest cost.
(Kobilarov - [col 4, ln 59 thru col 5, ln 6] Starting with an initial state (e.g., the context or automata), candidate trajectories can be evaluated using one or more tree search algorithms, such as a Monte Carlo Tree Search (MCTS) algorithm. For example, various possible trajectories can be modeled and stored as part of a MCTS search, and compared against the LTL formulas and/or evaluated to determine costs associated with various actions.  … If multiple trajectories are determined not to violate the LTL formula(s), a trajectory with a lowest cost (or a highest performance, comfort, etc.) can be selected)

It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to further modify Schoessler's robotic control scheme by implementing Kobilarov’s suggestion to determine costs associated with trajectories in order to increase accuracy and optimize performance of trajectory planning. 
(Kobilarov -  [col 5, ln 38-42] The prediction operations and systems described herein can improve a functioning of a computing device implemented in an autonomous vehicle by providing a robust framework to increase the accuracy of planning models to optimize performance of the autonomous vehicle.)

Claim 17
The combination of Schoessler, Rajkumar, Park, Mildenhall and Kobilarov teaches the limitations teaches the limitations of claim 16 as outlined above. The cited combination also teaches
wherein the cost is further computed based on a goal for the robot to achieve.  
(Kobilarov - [col 3, ln 43-53] In general, determining a trajectory for an autonomous vehicle can include utilizing a tree search algorithm such as Monte Carlo Tree Search (MCTS) to organize and search through possible trajectories, … and determining various costs and constraints associated with possible trajectories (e.g., longitudinal acceleration, lateral acceleration, power consumption, travel time, distance from obstacles or a centerline of a road, performance, comfort, etc.) to select a trajectory to optimize performance.
[col 31, ln 49 thru col 32, ln 15] As illustrated, the search tree 800 can represent different states at different times and/or based on different potential actions. … In some instances, each action modeled in the search tree 800 has one or more associated termination conditions. When the search tree 800 reaches a termination condition associated with an action (e.g., completion of a lane change, traversing a section of road, passage of a period of time, movement above a threshold distance, threshold velocity, threshold acceleration, etc.), the search tree 800 may branch and choose a new action to follow. In some instances, a termination condition can be a logical combination of at least two termination conditions. The search can continue until a termination condition is reached for a trajectory or route as they related to a goal, such as a destination.)
EXAMINER NOTE: The search is carried out (and therefore, the cost is computed) until a goal condition is reached.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to further modify Schoessler's robotic control scheme by implementing Kobilarov’s suggestion to determine costs associated with trajectories in order to increase accuracy and optimize performance of trajectory planning. 
(Kobilarov -  [col 5, ln 38-42] The prediction operations and systems described herein can improve a functioning of a computing device implemented in an autonomous vehicle by providing a robust framework to increase the accuracy of planning models to optimize performance of the autonomous vehicle.)


Claim(s) 7-8 is/are rejected under 35 U.S.C. 103 as being unpatentable over Schoessler, Rajkumar, Park, Mildenhall and Kobilarov as applied to claim 6 above, and further in view of Murray (US 20230286156 A1 - claiming priority to provisional application 63/318933 filed 3/11/2022)..

Claim 7
The combination of Schoessler, Rajkumar, Park, Mildenhall and Kobilarov teaches the limitations of claim 6 as outlined above. The combination cited in claim 6 does not explicitly teach the following limitations in combination. However, Murray teaches

wherein computing the cost associated with each trajectory comprises determining, based on the representation of spatial occupancy, whether one or more spheres bounding one or more links of the robot collide or intersect with one or more objects in the environment.  
(Murray - [0064] In some implementations, collision detector 252 implements software based collision detection or assessment, for example performing a bounding box-bounding box collision assessment or assessing based on a hierarchy of geometric (e.g., spheres) representation of the volume swept by the robots 202, 202b or portions thereof during movement.  
[0069] The path analyzer 256 may determine a path (e.g., optimal or optimized) using the motion planning graph with the cost values. …The path analyzer 256 may use or execute any variety of path finding algorithms, for example lowest cost path finding algorithms, taking into account cost values associated with each edge which represent likelihood of collision and optionally represent one or more of: a severity of collision, ...
[0127] … for example collision detection using swept volumes representing a volume swept by the given robot and/or a volume swept by other robots or other obstacles in the environment; or collision detection employing assessment of whether two spheres intersect.
[0143] At 606, the processor-based system sets cost values or cost functions of the edges in the motion planning graph based at least in part on the collision detection or assessment for the given robot (e.g., first robot R.sub.1). Cost values or cost functions can be representative of a collision assessment or risk (e.g., probability or occurrence) of collision.)

It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to further modify Schoessler with Murray’s suggestion to base cost on whether one or more bounding spheres collides or intersects with objects in order to prevent or reduce risk that robots collide with other objects in the environment. Murray states that this implementation also reduces programming effort on behalf of the operator, and that using spheres as bounding volumes is computationally inexpensive.
(Murray - [0117] … Using spheres as bounding volumes facilitates fast comparisons (i.e., it is computationally easy to determine if spheres overlap each other).
[0007] The structures and algorithms described herein advantageously employ staging poses to facilitate the operation of two or more robots operating in a shared workspace or workcell, preventing or at least reducing the risk that robots or robotic appendages of robots will collide with one another while operating to efficiently move one or more robots respectively to one or more goals to perform respective tasks in the shared workspace
[0010] The structures and algorithms described herein may advantageously reduce programming effort for multi-robot workspaces, by performing autonomous planning, for example during a runtime of one or more robots. In at least some implementations, operators do not need to program any safety zones, …)

Claim 8
The combination of Schoessler, Rajkumar, Park, Mildenhall and Kobilarov teaches the limitations of claim 6 as outlined above. The combination cited in claim 6 does not explicitly teach the following limitations in combination. However, Murray teaches
wherein the cost associated with each trajectory is computed based on a cost function that penalizes collisions between the robot and one or more objects in the environment when the robot moves according to the trajectory.  
(Murray - [0143] At 606, the processor-based system sets cost values or cost functions of the edges in the motion planning graph based at least in part on the collision detection or assessment for the given robot (e.g., first robot R.sub.1). Cost values or cost functions can be representative of a collision assessment or risk (e.g., probability or occurrence) of collision. Cost values can additionally be representative of a severity (e.g., magnitude of damage that would result) of a collision should a collision occur. For instance, a collision with a human or other animate object may be considered as more severe, and hence more costly, than collision with a wall or table or other inanimate object.)

It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to further modify Schoessler with Murray’s suggestion to use cost functions penalizing collisions in order to prevent or reduce risk that robots collide with other objects in the environment. Murray states that this implementation also reduces programming effort on behalf of the operator.
(Murray - [0007] The structures and algorithms described herein advantageously employ staging poses to facilitate the operation of two or more robots operating in a shared workspace or workcell, preventing or at least reducing the risk that robots or robotic appendages of robots will collide with one another while operating to efficiently move one or more robots respectively to one or more goals to perform respective tasks in the shared workspace
[0010] The structures and algorithms described herein may advantageously reduce programming effort for multi-robot workspaces, by performing autonomous planning, for example during a runtime of one or more robots. In at least some implementations, operators do not need to program any safety zones, …)





Allowable Subject Matter
Claims 2 and 12 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. The combination of Park and Mildenhall teaches the aspects of neural radiance fields and how and why to combine the two. However, MIldenhall explicitly teaches away from the use of NeRFs to generate meshes. Examiner is unable to find any other reasonable teaching or suggestion in the prior art which would lead one of ordinary skill to arrive at the claimed invention. The use of Neural Radiance Fields is relatively new, and rapidly evolving. Examiner is unable to find an explicit teaching or suggestion to use this model in the claimed combination. Claims 3-4 and 13-14 depend from the potentially allowable claims and are objected for the same reasons.

Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JAMES MILLER WATTS whose telephone number is (703)756-1249. The examiner can normally be reached 7:30-5:30 M-TH.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Adam Mott can be reached at 571-270-5376. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/JAMES MILLER WATTS III/Examiner, Art Unit 3657                                                                                                                                                                                                        
/ADAM R MOTT/Supervisory Patent Examiner, Art Unit 3657
Read full office action
Prosecution Timeline

Show 3 earlier events
Jul 16, 2025
Final Rejection mailed — §103
Sep 16, 2025
Response after Non-Final Action
Oct 27, 2025
Request for Continued Examination
Nov 03, 2025
Response after Non-Final Action
Dec 05, 2025
Non-Final Rejection mailed — §103
Feb 19, 2026
Response Filed
Apr 09, 2026
Final Rejection mailed — §103
May 06, 2026
Applicant Interview (Telephonic)
Precedent Cases

Applications granted by this same examiner with similar technology

18/366,688
Patent 12623346
PROCESSING PATH PLANNING SIMULATION DEVICE AND METHOD
2y 9m to grant Granted May 12, 2026
18/746,382
Patent 12611772
ROBOT CONTROL DEVICE AND ROBOT CONTROL METHOD
1y 10m to grant Granted Apr 28, 2026
18/273,824
Patent 12605829
ROBOT SYSTEM AND WORKPIECE SUPPLY METHOD
2y 9m to grant Granted Apr 21, 2026
18/301,774
Patent 12608005
SMART MOWER AND SMART MOWING SYSTEM
3y 0m to grant Granted Apr 21, 2026
18/261,111
Patent 12600040
SIMULATION DEVICE USING THREE-DIMENSIONAL POSITION INFORMATION OBTAINED FROM OUTPUT FROM VISION SENSOR
2y 9m to grant Granted Apr 14, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

5-6
Expected OA Rounds
74%
Grant Probability
89%
With Interview (+15.5%)
2y 8m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 46 resolved cases by this examiner. Grant probability derived from career allowance rate.