Prosecution Insights
Last updated: May 29, 2026
Application No. 18/758,024

METHODS AND APPARATUS FOR DETERMINING POSE AND SIZE OF OBJECTS USING THREE-DIMENSIONAL MACHINE LEARNING

Final Rejection §103
Filed
Jun 28, 2024
Examiner
LE, TIEN MINH
Art Unit
3656
Tech Center
3600 — Transportation & Electronic Commerce
Assignee
Boston Dynamics Inc.
OA Round
2 (Final)
68%
Grant Probability
Favorable
3-4
OA Rounds
11m
Est. Remaining
92%
With Interview

Examiner Intelligence

Grants 68% — above average
68%
Career Allowance Rate
58 granted / 85 resolved
+16.2% vs TC avg
Strong +24% interview lift
Without
With
+23.8%
Interview Lift
resolved cases with interview
Typical timeline
2y 10m
Avg Prosecution
17 currently pending
Career history
115
Total Applications
across all art units

Statute-Specific Performance

§101
0.8%
-39.2% vs TC avg
§103
93.9%
+53.9% vs TC avg
§102
3.7%
-36.3% vs TC avg
§112
0.4%
-39.6% vs TC avg
Black line = Tech Center average estimate • Based on career data from 85 resolved cases

Office Action

§103
DETAILED ACTION Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . This is a Final Office Action on the merits. Claims 1-22 are currently pending and are addressed below. Response to Amendment 1. The amendment filed 01/26/2026 has been entered. Claims 1-22 remain pending in the application. Response to Arguments 2. Regarding the rejection made under 35 USC 102, the Applicant’s amendments and arguments have been fully considered but are moot because of the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. Claim Rejections - 35 USC § 103 3. In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. 4. Claims 1-8 and 15-22 is/are rejected under 35 U.S.C. 103 as being unpatentable over Turpin et al. (US 20220305680, hereinafter Turpin) in view of Malisiewicz et al. (US 20180137642, hereinafter Malisiewicz). Regarding claim 1, Turpin teaches a method, comprising: receiving, by at least one computing device associated with a mobile robot, first sensor data and second sensor data (see at least Fig. 1A-4 and [0048]: “FIG. 4 illustrates a process 400 for determining one or more characteristics of objects in an environment using a plurality of perception modules arranged on a perception mast of a mobile manipulator robot designed in accordance with some embodiments. In act 410, a first color image and first depth information is captured by a first 2D camera (e.g., upper camera 244A) and a first depth sensor (e.g., upper depth sensor 250A) of a first perception module (e.g., upper perception module 244). For instance, in an example of picking boxes from a stack, the first color image and first distance information may represent information about boxes in the top portion of the stack. The process then proceeds to act 412, where a second color image and second depth information is captured by a second 2D camera (e.g., lower camera 244B) and a second depth sensor (e.g., lower depth sensor 250B) of a second perception module (e.g., the lower perception module). Continuing with the box picking example, the second color image and the second depth information may include information about boxes in the bottom portion of the stack, such that a combination of the information captured by the first perception module and the second perception module provides information for a vertical slice of the stack of boxes. Although shown as being performed sequentially in the process 400 of FIG. 4, it should be appreciated that acts 410 and 412 may be performed sequentially or at least partially in parallel using any suitable control strategy, examples of which are described herein.”); providing as input to at least one machine learning model, the first sensor data, the second sensor data, and camera intrinsics associated with at least one camera configured to sense the first sensor data and/or the second sensor data (see at least [0042]: “In some embodiments, one or both of the 2D camera and the depth sensor included within a perception module may have a fixed orientation (e.g., they may not actively pan and/or tilt). Additionally, the sensors within the upper and lower perception modules may be oriented at the same angle relative to the perception mast 240 or may be oriented at different angles relative to the perception mast to capture a desired field of view. For instance, the sensors of the upper perception module may be oriented to capture information about the environment at an angle of 90° relative to the vertical axis of the perception mast 240, whereas the sensors of the lower perception module may be oriented to capture information about the environment at an angle of 70° relative to the vertical axis of the perception mast 240 (i.e., facing downward toward the mobile base) to enable capture of information located near the mobile base.”; [0051]: “For example, these factors include the intrinsic properties of the cameras (e.g., focal lengths, principal points of the cameras) and the extrinsic properties of the cameras (e.g., the precise position and orientations of the RGB camera and the TOF depth sensor camera with respect to each other). A calibration sequence executed for each set of sensors in a perception module may be performed to determine these intrinsic and extrinsic properties for use in registering the RGB image and the depth information to generate an RGBD image in act 512 of process 500.”; [0052]: “Process 500 then proceeds to act 514, where one or more characteristics of objects in the environment are determined based on the RGBD image generated in act 512. In some embodiments, the RGBD image is provided as input to a trained statistical model (e.g., a machine learning model) that has been trained to identify the one or more characteristics. For instance, in the box picking example, the statistical model may be trained to recognize surfaces (e.g., faces) of boxes arranged in a stack. In another example, the statistical model may be trained to recognize other object characteristics such as the shape of signs, a category or type of object in the path of motion of the robot, or any other characteristic of one or more objects in the environment. Any suitable type of trained statistical model may be used to process an RGBD image and output one or more characteristics of object(s) in the environment.”), wherein the at least one machine learning model is trained to output polyhedron information representing a set of objects in an environment of the mobile robot (see at least [0056]: “In act 612, the sensor manager receives the request and generates a request to trigger one or multiple of the 2D cameras and depth sensors included in one or more perception modules arranged on the perception mast. In act 614, the RGBD camera software receives the request generated in act 612 and interfaces with the appropriate camera(s) and depth sensor(s) to begin capture of the corresponding information…In act 618, the trained statistical model (e.g., BoxDetector) outputs one or more characteristics of objects in the environment (e.g., identified surfaces of boxes), and information about the characteristic(s) is provided to the control circuity to perform one or more actions based, at least in part, on the identified characteristic(s).”); and controlling the mobile robot to perform an action based, at least in part, on the polyhedron information output from the at least one machine learning model (see at least [0049]: “After capturing 2D color and depth information from each of the plurality of perception modules, process 400 proceeds to act 414, where one or more characteristics of one or more objects in the environment are determined based on the captured information. Continuing with the box picking example, the characteristics may include faces of boxes in the stack using a box detection model trained to identify the faces of boxes in a stack based on the captured information…After determining the one or more characteristics of objects in the environment, process 400 proceeds to act 416 where one or more actions are performed based on the determined characteristic(s). Returning to the box picking example, after box faces in a stack are identified in act 414, the action performed in act 416 may include one or more of determining a next box in the stack to pick, updating a trajectory plan for the manipulator arm of the robot to pick a next box in the stack, determining whether to pick the next box in the stack using a top pick or a face pick, or controlling the manipulator arm of the robot to pick the next box in the stack. Of course, additional or alternative actions may also be performed depending on the task the robot is currently performing or will perform next. For instance, the object with which the manipulator arm may interact with next may not be arranged in a stack, but may be located in any configuration in the environment of the robot.”; [0056]: “In act 618, the trained statistical model (e.g., BoxDetector) outputs one or more characteristics of objects in the environment (e.g., identified surfaces of boxes), and information about the characteristic(s) is provided to the control circuity to perform one or more actions based, at least in part, on the identified characteristic(s).”). Turpin fails to explicitly teach that the polyhedron information is a three-dimensional (3D) representation of one or more polyhedrons representing a set of objects in an environment. However, Malisiewicz teaches a system and method for cuboid detection that outputs a three-dimensional (3D) representation of one or more polyhedrons representing a set of objects in an environment (see at least Figs. 2, 4-5, and [0020]: “For example, a cuboid detector (such as a deep cuboid detector) can process a consumer-quality Red-Green-Blue (RGB) image of a cluttered scene and localize some or all three-dimensional (3D) cuboids in the image. A cuboid can comprise a boxy or a box-like object and can include a polyhedron (which may be convex) with, e.g., 4, 5, 6, 7, 8, 10, 12, or more faces. For example, cuboids can include pyramids, cubes, prisms, parallelepipeds, etc. Cuboids are not limited to such polyhedral shapes from geometry and can include box-like structures such as, e.g., appliances (e.g., television sets, computer monitors, toasters, washing machines, refrigerators), furniture (e.g., sofas, chairs, beds, cribs, tables, book cases, cabinets), vehicles (e.g., automobiles, buses), etc.”). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Turpin to incorporate the teachings of Malisiewicz and provide a means to output a three-dimensional (3D) representation of one or more polyhedrons representing a set of objects in an environment, with a reasonable expectation of success, in order to produce a 3D interpretation of object and improve the accuracy of keypoints detected [0023]. Regarding claim 2, modified Turpin teaches the limitations of claim 1. Turpin further teaches wherein the camera intrinsics include one or more coordinates of the at least one camera and/or a viewing angle of the at least one camera (see at least [0042]: “In some embodiments, one or both of the 2D camera and the depth sensor included within a perception module may have a fixed orientation (e.g., they may not actively pan and/or tilt). Additionally, the sensors within the upper and lower perception modules may be oriented at the same angle relative to the perception mast 240 or may be oriented at different angles relative to the perception mast to capture a desired field of view. For instance, the sensors of the upper perception module may be oriented to capture information about the environment at an angle of 90° relative to the vertical axis of the perception mast 240, whereas the sensors of the lower perception module may be oriented to capture information about the environment at an angle of 70° relative to the vertical axis of the perception mast 240 (i.e., facing downward toward the mobile base) to enable capture of information located near the mobile base. As shown, in some embodiments, the lower perception module may be arranged along the perception mast 240 at a location above actuator 255 that enables capture of information near the mobile base, but without including the mobile base itself (or including only limited portions of the mobile base) in the captured information.”; [0051]: “As part of the registration process distortion in one or both of the color image and the depth information caused, for example, by motion of the mobile robot or objects in the environment, may be corrected. Several other factors may additionally or alternatively be taken into account to properly register the RGB image and the depth information. For example, these factors include the intrinsic properties of the cameras (e.g., focal lengths, principal points of the cameras) and the extrinsic properties of the cameras (e.g., the precise position and orientations of the RGB camera and the TOF depth sensor camera with respect to each other). A calibration sequence executed for each set of sensors in a perception module may be performed to determine these intrinsic and extrinsic properties for use in registering the RGB image and the depth information to generate an RGBD image in act 512 of process 500.”). Regarding claim 3, modified Turpin teaches the limitations of claim 1. Turpin further teaches wherein the camera intrinsics includes first camera intrinsics for a first camera configured to sense the first sensor data and second camera intrinsics for a second camera configured to sense the second sensor data (see at least [0042]: “In some embodiments, one or both of the 2D camera and the depth sensor included within a perception module may have a fixed orientation (e.g., they may not actively pan and/or tilt). Additionally, the sensors within the upper and lower perception modules may be oriented at the same angle relative to the perception mast 240 or may be oriented at different angles relative to the perception mast to capture a desired field of view. For instance, the sensors of the upper perception module may be oriented to capture information about the environment at an angle of 90° relative to the vertical axis of the perception mast 240, whereas the sensors of the lower perception module may be oriented to capture information about the environment at an angle of 70° relative to the vertical axis of the perception mast 240 (i.e., facing downward toward the mobile base) to enable capture of information located near the mobile base. As shown, in some embodiments, the lower perception module may be arranged along the perception mast 240 at a location above actuator 255 that enables capture of information near the mobile base, but without including the mobile base itself (or including only limited portions of the mobile base) in the captured information.”; [0051]: “As part of the registration process distortion in one or both of the color image and the depth information caused, for example, by motion of the mobile robot or objects in the environment, may be corrected. Several other factors may additionally or alternatively be taken into account to properly register the RGB image and the depth information. For example, these factors include the intrinsic properties of the cameras (e.g., focal lengths, principal points of the cameras) and the extrinsic properties of the cameras (e.g., the precise position and orientations of the RGB camera and the TOF depth sensor camera with respect to each other). A calibration sequence executed for each set of sensors in a perception module may be performed to determine these intrinsic and extrinsic properties for use in registering the RGB image and the depth information to generate an RGBD image in act 512 of process 500.”). Regarding claim 4, modified Turpin teaches the limitations of claim 1. Turpin further teaches wherein the first sensor data is image data received from a color camera and the second sensor data is depth data received from a depth sensor (see at least [0048]: “FIG. 4 illustrates a process 400 for determining one or more characteristics of objects in an environment using a plurality of perception modules arranged on a perception mast of a mobile manipulator robot designed in accordance with some embodiments. In act 410, a first color image and first depth information is captured by a first 2D camera (e.g., upper camera 244A) and a first depth sensor (e.g., upper depth sensor 250A) of a first perception module (e.g., upper perception module 244). For instance, in an example of picking boxes from a stack, the first color image and first distance information may represent information about boxes in the top portion of the stack. The process then proceeds to act 412, where a second color image and second depth information is captured by a second 2D camera (e.g., lower camera 244B) and a second depth sensor (e.g., lower depth sensor 250B) of a second perception module (e.g., the lower perception module). Continuing with the box picking example, the second color image and the second depth information may include information about boxes in the bottom portion of the stack, such that a combination of the information captured by the first perception module and the second perception module provides information for a vertical slice of the stack of boxes. Although shown as being performed sequentially in the process 400 of FIG. 4, it should be appreciated that acts 410 and 412 may be performed sequentially or at least partially in parallel using any suitable control strategy, examples of which are described herein.”). Regarding claim 5, modified Turpin teaches the limitations of claim 4. Turpin further teaches wherein the depth sensor is a time-of-flight sensor (see at least [0045]: “Perception module 242 also includes depth sensor 330 configured to capture depth information related to objects in the environment. Examples of depth sensor 330 include, but are not limited to, a stereoscopic camera, a time-of-flight camera, LiDAR, or any other depth sensor configured to capture depth information about the environment. In one embodiment, perception module 242 includes two LED-based light sources 310, an RGB monocular camera 320 and a time-of-flight camera 330. As noted above, the arrangement of the particular components within perception module 240 is not limiting, and the components may be arranged in any suitable manner. Preferably the 2D camera 320 and the depth sensor 330 are arranged to provide a similar field of view, which facilitates registration of the information captured by the 2D camera and the depth sensor, as discussed in more detail below.”). Regarding claim 6, modified Turpin teaches the limitations of claim 1. Turpin further teaches wherein the first sensor data is first image data received from a first color camera and the second sensor data is second image data received from a second color camera, wherein the first color camera and the second color camera have different fields of view (see at least Fig. 2A and [0040]: “As shown, the perception mast 240 includes a plurality of perception modules 242 arranged vertically along the perception mast.”; [0041]: “As shown, perception mast 240 also includes a lower perception module including lower 2D camera 244B and lower depth sensor 250A. The lower perception module is arranged along the same side of the perception mast 240 as the upper perception module and is located between the upper perception module and the actuator 255. The inventors have recognized that having multiple perception modules located on the perception mast 240 at different locations (e.g., near the top and bottom of the perception mast) provides the robot 200 with imaging capabilities not possible when only a single perception module is included. For instance, the sensors within the upper perception module may have a different field of view that is non-overlapping (or partially overlapping) with the field of view of the sensors within the lower perception module such that the combined field of view of both perception modules is larger than each individual perception module's field of view. Such an expanded field of view may be useful to image a tall stack of boxes or other objects in the environment with which the robot is to interact.”; [0043]: “Examples of 2D camera 320 include, but are not limited to, red-green-blue (RGB) cameras, monochrome cameras, prism cameras, or any other type of 2D camera configured to capture a 2D image of an environment.”). Regarding claim 7, modified Turpin teaches the limitations of claim 6. Turpin further teaches wherein the first color camera and the second color camera have at least partially overlapping fields of view (see at least Fig. 2A and [0040]: “As shown, the perception mast 240 includes a plurality of perception modules 242 arranged vertically along the perception mast.”; [0041]: “As shown, perception mast 240 also includes a lower perception module including lower 2D camera 244B and lower depth sensor 250A. The lower perception module is arranged along the same side of the perception mast 240 as the upper perception module and is located between the upper perception module and the actuator 255. The inventors have recognized that having multiple perception modules located on the perception mast 240 at different locations (e.g., near the top and bottom of the perception mast) provides the robot 200 with imaging capabilities not possible when only a single perception module is included. For instance, the sensors within the upper perception module may have a different field of view that is non-overlapping (or partially overlapping) with the field of view of the sensors within the lower perception module such that the combined field of view of both perception modules is larger than each individual perception module's field of view. Such an expanded field of view may be useful to image a tall stack of boxes or other objects in the environment with which the robot is to interact.”; [0043]: “Examples of 2D camera 320 include, but are not limited to, red-green-blue (RGB) cameras, monochrome cameras, prism cameras, or any other type of 2D camera configured to capture a 2D image of an environment.”). Regarding claim 8, modified Turpin teaches the limitations of claim 6. Turpin further teaches wherein the camera intrinsics includes first camera intrinsics for the first color camera and second camera intrinsics for the second color camera (see at least [0042]: “In some embodiments, one or both of the 2D camera and the depth sensor included within a perception module may have a fixed orientation (e.g., they may not actively pan and/or tilt). Additionally, the sensors within the upper and lower perception modules may be oriented at the same angle relative to the perception mast 240 or may be oriented at different angles relative to the perception mast to capture a desired field of view. For instance, the sensors of the upper perception module may be oriented to capture information about the environment at an angle of 90° relative to the vertical axis of the perception mast 240, whereas the sensors of the lower perception module may be oriented to capture information about the environment at an angle of 70° relative to the vertical axis of the perception mast 240 (i.e., facing downward toward the mobile base) to enable capture of information located near the mobile base. As shown, in some embodiments, the lower perception module may be arranged along the perception mast 240 at a location above actuator 255 that enables capture of information near the mobile base, but without including the mobile base itself (or including only limited portions of the mobile base) in the captured information.”; [0051]: “As part of the registration process distortion in one or both of the color image and the depth information caused, for example, by motion of the mobile robot or objects in the environment, may be corrected. Several other factors may additionally or alternatively be taken into account to properly register the RGB image and the depth information. For example, these factors include the intrinsic properties of the cameras (e.g., focal lengths, principal points of the cameras) and the extrinsic properties of the cameras (e.g., the precise position and orientations of the RGB camera and the TOF depth sensor camera with respect to each other). A calibration sequence executed for each set of sensors in a perception module may be performed to determine these intrinsic and extrinsic properties for use in registering the RGB image and the depth information to generate an RGBD image in act 512 of process 500.”). Regarding claim 15, modified Turpin teaches the limitations of claim 1. Turpin further teaches wherein the at least one machine learning model is configured to determine a first hypothesis and a second hypothesis for a polyhedron of the one or more polyhedrons, each of the first hypothesis and the second hypothesis being a pose and/or size estimate for the polyhedron of the one or more polyhedrons (see at least [0049]: “After capturing 2D color and depth information from each of the plurality of perception modules, process 400 proceeds to act 414, where one or more characteristics of one or more objects in the environment are determined based on the captured information”; [0052]: “Process 500 then proceeds to act 514, where one or more characteristics of objects in the environment are determined based on the RGBD image generated in act 512. In some embodiments, the RGBD image is provided as input to a trained statistical model (e.g., a machine learning model) that has been trained to identify the one or more characteristics. For instance, in the box picking example, the statistical model may be trained to recognize surfaces (e.g., faces) of boxes arranged in a stack. In another example, the statistical model may be trained to recognize other object characteristics such as the shape of signs, a category or type of object in the path of motion of the robot, or any other characteristic of one or more objects in the environment. Any suitable type of trained statistical model may be used to process an RGBD image and output one or more characteristics of object(s) in the environment.”; [0056]: “In act 618, the trained statistical model (e.g., BoxDetector) outputs one or more characteristics of objects in the environment (e.g., identified surfaces of boxes), and information about the characteristic(s) is provided to the control circuity to perform one or more actions based, at least in part, on the identified characteristic(s).”), and the polyhedron information is based on the first pose and/or size hypothesis or the second pose and/or size hypothesis (see at least [0049]: “After capturing 2D color and depth information from each of the plurality of perception modules, process 400 proceeds to act 414, where one or more characteristics of one or more objects in the environment are determined based on the captured information”; [0051]: “Process 500 then proceeds to act 512, where the RGB image and the depth information is combined to generate an RGBD image. The RGBD image may be conceptualized as a high-fidelity colorized 3D point cloud, which includes both color appearance as well as depth data and 3D geometric structure of objects in the environment.”). Turpin fails to explicitly teach that the polyhedron information is a three-dimensional (3D) representation of one or more polyhedrons. However, Malisiewicz teaches a system and method for cuboid detection that outputs a three-dimensional (3D) representation of one or more polyhedrons (see at least Figs. 2, 4-5, and [0020]: “For example, a cuboid detector (such as a deep cuboid detector) can process a consumer-quality Red-Green-Blue (RGB) image of a cluttered scene and localize some or all three-dimensional (3D) cuboids in the image. A cuboid can comprise a boxy or a box-like object and can include a polyhedron (which may be convex) with, e.g., 4, 5, 6, 7, 8, 10, 12, or more faces. For example, cuboids can include pyramids, cubes, prisms, parallelepipeds, etc. Cuboids are not limited to such polyhedral shapes from geometry and can include box-like structures such as, e.g., appliances (e.g., television sets, computer monitors, toasters, washing machines, refrigerators), furniture (e.g., sofas, chairs, beds, cribs, tables, book cases, cabinets), vehicles (e.g., automobiles, buses), etc.”). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Turpin to incorporate the teachings of Malisiewicz and provide a three-dimensional (3D) representation of one or more polyhedrons, with a reasonable expectation of success, in order to produce a 3D interpretation of object and improve the accuracy of keypoints detected [0023]. Regarding claim 16, modified Turpin teaches the limitations of claim 1. Turpin further teaches wherein controlling the mobile robot to perform an action based, at least in part, on the polyhedron information comprises: controlling the mobile robot to grasp a first object of the set of objects based, at least in part, on the polyhedron information; and/or controlling the mobile robot to orient an end effector of the mobile robot based, at least in part, on the polyhedron information (see at least [0049] After capturing 2D color and depth information from each of the plurality of perception modules, process 400 proceeds to act 414, where one or more characteristics of one or more objects in the environment are determined based on the captured information. Continuing with the box picking example, the characteristics may include faces of boxes in the stack using a box detection model trained to identify the faces of boxes in a stack based on the captured information. For other tasks or scenarios, the characteristic(s) determined in act 414 may be different. For instance, when the mobile manipulator robot is driving down an aisle of a warehouse, the perception modules may be configured to capture information, and the captured information may be used to detect obstructions in the robot's path, visual identifiers (e.g., barcodes located in the environment), or any other suitable characteristics of objects in the environment. Illustrative examples of how the captured information is combined to determine object characteristics is described in further detail below. After determining the one or more characteristics of objects in the environment, process 400 proceeds to act 416 where one or more actions are performed based on the determined characteristic(s). Returning to the box picking example, after box faces in a stack are identified in act 414, the action performed in act 416 may include one or more of determining a next box in the stack to pick, updating a trajectory plan for the manipulator arm of the robot to pick a next box in the stack, determining whether to pick the next box in the stack using a top pick or a face pick, or controlling the manipulator arm of the robot to pick the next box in the stack. Of course, additional or alternative actions may also be performed depending on the task the robot is currently performing or will perform next. For instance, the object with which the manipulator arm may interact with next may not be arranged in a stack, but may be located in any configuration in the environment of the robot.”). Turpin fails to explicitly teach that the polyhedron information is a 3D representation of the one or more polyhedrons. However, Malisiewicz teaches a system and method for cuboid detection that outputs a 3D representation of one or more polyhedrons (see at least Figs. 2, 4-5, and [0020]: “For example, a cuboid detector (such as a deep cuboid detector) can process a consumer-quality Red-Green-Blue (RGB) image of a cluttered scene and localize some or all three-dimensional (3D) cuboids in the image. A cuboid can comprise a boxy or a box-like object and can include a polyhedron (which may be convex) with, e.g., 4, 5, 6, 7, 8, 10, 12, or more faces. For example, cuboids can include pyramids, cubes, prisms, parallelepipeds, etc. Cuboids are not limited to such polyhedral shapes from geometry and can include box-like structures such as, e.g., appliances (e.g., television sets, computer monitors, toasters, washing machines, refrigerators), furniture (e.g., sofas, chairs, beds, cribs, tables, book cases, cabinets), vehicles (e.g., automobiles, buses), etc.”). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Turpin to incorporate the teachings of Malisiewicz and provide a 3D representation of one or more polyhedrons, with a reasonable expectation of success, in order to produce a 3D interpretation of object and improve the accuracy of keypoints detected [0023]. Regarding claim 17, modified Turpin teaches the limitations of claim 1. Turpin further teaches wherein the set of objects includes a set of boxes, and the at least one machine learning model includes a box detection model (see at least Figs. 8A-8C and [0050]: “To ensure that the entire width of the entire stack is considered when identifying boxes in the stack to, for example, determine a next box to pick, the perception mast may be rotated from left to right (or right to left), and while the perception mast is moving (or during short pauses between movements) the perception modules may capture information for multiple points in space that collectively cover the entire width of the stack of boxes. In some embodiments, the captured information may be stitched together into a single image that is provided to a trained box detection model (or other trained model depending on the particular task being performed by the robot). In other embodiments, each captured image may be provided separately to the box detection model and the results of the output for the model for each image may be considered together to perform box detection. Capturing images during movement of the perception mast and/or the mobile base may also be advantageous for other tasks, such as capturing perception information as the robot drives down an aisle of a warehouse to facilitate navigation of the robot and/or to detect markers located on physical surfaces in the warehouse to provide the robot with information that may inform its operation.”). Regarding claim 18, modified Turpin teaches the limitations of claim 1. Turpin further teaches wherein at least one object in the set of objects is represented by at least two polyhedrons (see at least Figs. 8A-8C and [0033]: “FIG. 8C depicts a robot 30a performing an order building task, in which the robot 30a places boxes 31 onto a pallet 33.”; [0050]: “To ensure that the entire width of the entire stack is considered when identifying boxes in the stack to, for example, determine a next box to pick, the perception mast may be rotated from left to right (or right to left), and while the perception mast is moving (or during short pauses between movements) the perception modules may capture information for multiple points in space that collectively cover the entire width of the stack of boxes. In some embodiments, the captured information may be stitched together into a single image that is provided to a trained box detection model (or other trained model depending on the particular task being performed by the robot). In other embodiments, each captured image may be provided separately to the box detection model and the results of the output for the model for each image may be considered together to perform box detection. Capturing images during movement of the perception mast and/or the mobile base may also be advantageous for other tasks, such as capturing perception information as the robot drives down an aisle of a warehouse to facilitate navigation of the robot and/or to detect markers located on physical surfaces in the warehouse to provide the robot with information that may inform its operation.”). Regarding claim 19, Turpin teaches a mobile robot, comprising: at least one first sensor module configured to sense first sensor data (see at least Fig. 1A-4 and [0048]: “FIG. 4 illustrates a process 400 for determining one or more characteristics of objects in an environment using a plurality of perception modules arranged on a perception mast of a mobile manipulator robot designed in accordance with some embodiments. In act 410, a first color image and first depth information is captured by a first 2D camera (e.g., upper camera 244A) and a first depth sensor (e.g., upper depth sensor 250A) of a first perception module (e.g., upper perception module 244). For instance, in an example of picking boxes from a stack, the first color image and first distance information may represent information about boxes in the top portion of the stack.”); at least one second sensor module configured to sense second sensor data (see at least Fig. 1A-4 and [0048]: “The process then proceeds to act 412, where a second color image and second depth information is captured by a second 2D camera (e.g., lower camera 244B) and a second depth sensor (e.g., lower depth sensor 250B) of a second perception module (e.g., the lower perception module). Continuing with the box picking example, the second color image and the second depth information may include information about boxes in the bottom portion of the stack, such that a combination of the information captured by the first perception module and the second perception module provides information for a vertical slice of the stack of boxes. Although shown as being performed sequentially in the process 400 of FIG. 4, it should be appreciated that acts 410 and 412 may be performed sequentially or at least partially in parallel using any suitable control strategy, examples of which are described herein.”); a processor (see at least [0059]: “An illustrative implementation of a computing system that may be used in connection with any of the embodiments of the disclosure provided herein is shown in FIG. 9. For example, any of the computing devices described above may be implemented as computing system 900. The computer system 900 may include one or more computer hardware processors 902 and one or more articles of manufacture that comprise non-transitory computer-readable storage media (e.g., memory 904 and one or more non-volatile storage devices 906).”) configured to: receive the first sensor data from the first sensor module and the second sensor data from the second sensor module (see at least Fig. 1A-4 and [0048]: “FIG. 4 illustrates a process 400 for determining one or more characteristics of objects in an environment using a plurality of perception modules arranged on a perception mast of a mobile manipulator robot designed in accordance with some embodiments. In act 410, a first color image and first depth information is captured by a first 2D camera (e.g., upper camera 244A) and a first depth sensor (e.g., upper depth sensor 250A) of a first perception module (e.g., upper perception module 244). For instance, in an example of picking boxes from a stack, the first color image and first distance information may represent information about boxes in the top portion of the stack. The process then proceeds to act 412, where a second color image and second depth information is captured by a second 2D camera (e.g., lower camera 244B) and a second depth sensor (e.g., lower depth sensor 250B) of a second perception module (e.g., the lower perception module). Continuing with the box picking example, the second color image and the second depth information may include information about boxes in the bottom portion of the stack, such that a combination of the information captured by the first perception module and the second perception module provides information for a vertical slice of the stack of boxes. Although shown as being performed sequentially in the process 400 of FIG. 4, it should be appreciated that acts 410 and 412 may be performed sequentially or at least partially in parallel using any suitable control strategy, examples of which are described herein.”); and provide as input to at least one machine learning model, the first sensor data, the second sensor data, and camera intrinsics associated with at least one camera configured to sense the first sensor data and/or the second sensor data (see at least [0042]: “In some embodiments, one or both of the 2D camera and the depth sensor included within a perception module may have a fixed orientation (e.g., they may not actively pan and/or tilt). Additionally, the sensors within the upper and lower perception modules may be oriented at the same angle relative to the perception mast 240 or may be oriented at different angles relative to the perception mast to capture a desired field of view. For instance, the sensors of the upper perception module may be oriented to capture information about the environment at an angle of 90° relative to the vertical axis of the perception mast 240, whereas the sensors of the lower perception module may be oriented to capture information about the environment at an angle of 70° relative to the vertical axis of the perception mast 240 (i.e., facing downward toward the mobile base) to enable capture of information located near the mobile base.”; [0051]: “For example, these factors include the intrinsic properties of the cameras (e.g., focal lengths, principal points of the cameras) and the extrinsic properties of the cameras (e.g., the precise position and orientations of the RGB camera and the TOF depth sensor camera with respect to each other). A calibration sequence executed for each set of sensors in a perception module may be performed to determine these intrinsic and extrinsic properties for use in registering the RGB image and the depth information to generate an RGBD image in act 512 of process 500.”; [0052]: “Process 500 then proceeds to act 514, where one or more characteristics of objects in the environment are determined based on the RGBD image generated in act 512. In some embodiments, the RGBD image is provided as input to a trained statistical model (e.g., a machine learning model) that has been trained to identify the one or more characteristics. For instance, in the box picking example, the statistical model may be trained to recognize surfaces (e.g., faces) of boxes arranged in a stack. In another example, the statistical model may be trained to recognize other object characteristics such as the shape of signs, a category or type of object in the path of motion of the robot, or any other characteristic of one or more objects in the environment. Any suitable type of trained statistical model may be used to process an RGBD image and output one or more characteristics of object(s) in the environment.”), wherein the at least one machine learning model is trained to output polyhedron information representing a set of objects in an environment of the mobile robot (see at least [0056]: “In act 612, the sensor manager receives the request and generates a request to trigger one or multiple of the 2D cameras and depth sensors included in one or more perception modules arranged on the perception mast. In act 614, the RGBD camera software receives the request generated in act 612 and interfaces with the appropriate camera(s) and depth sensor(s) to begin capture of the corresponding information…In act 618, the trained statistical model (e.g., BoxDetector) outputs one or more characteristics of objects in the environment (e.g., identified surfaces of boxes), and information about the characteristic(s) is provided to the control circuity to perform one or more actions based, at least in part, on the identified characteristic(s).”); and a controller configured to control the mobile robot to perform an action based, at least in part, on the polyhedron information output from the at least one machine learning model (see at least [0049]: “After capturing 2D color and depth information from each of the plurality of perception modules, process 400 proceeds to act 414, where one or more characteristics of one or more objects in the environment are determined based on the captured information. Continuing with the box picking example, the characteristics may include faces of boxes in the stack using a box detection model trained to identify the faces of boxes in a stack based on the captured information…After determining the one or more characteristics of objects in the environment, process 400 proceeds to act 416 where one or more actions are performed based on the determined characteristic(s). Returning to the box picking example, after box faces in a stack are identified in act 414, the action performed in act 416 may include one or more of determining a next box in the stack to pick, updating a trajectory plan for the manipulator arm of the robot to pick a next box in the stack, determining whether to pick the next box in the stack using a top pick or a face pick, or controlling the manipulator arm of the robot to pick the next box in the stack. Of course, additional or alternative actions may also be performed depending on the task the robot is currently performing or will perform next. For instance, the object with which the manipulator arm may interact with next may not be arranged in a stack, but may be located in any configuration in the environment of the robot.”; [0056]: “In act 618, the trained statistical model (e.g., BoxDetector) outputs one or more characteristics of objects in the environment (e.g., identified surfaces of boxes), and information about the characteristic(s) is provided to the control circuity to perform one or more actions based, at least in part, on the identified characteristic(s).”). Turpin fails to explicitly teach that the polyhedron information is a three-dimensional (3D) representation of one or more polyhedrons representing a set of objects in an environment. However, Malisiewicz teaches a system and method for cuboid detection that outputs a three-dimensional (3D) representation of one or more polyhedrons representing a set of objects in an environment (see at least Figs. 2, 4-5, and [0020]: “For example, a cuboid detector (such as a deep cuboid detector) can process a consumer-quality Red-Green-Blue (RGB) image of a cluttered scene and localize some or all three-dimensional (3D) cuboids in the image. A cuboid can comprise a boxy or a box-like object and can include a polyhedron (which may be convex) with, e.g., 4, 5, 6, 7, 8, 10, 12, or more faces. For example, cuboids can include pyramids, cubes, prisms, parallelepipeds, etc. Cuboids are not limited to such polyhedral shapes from geometry and can include box-like structures such as, e.g., appliances (e.g., television sets, computer monitors, toasters, washing machines, refrigerators), furniture (e.g., sofas, chairs, beds, cribs, tables, book cases, cabinets), vehicles (e.g., automobiles, buses), etc.”). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Turpin to incorporate the teachings of Malisiewicz and provide a means to output a three-dimensional (3D) representation of one or more polyhedrons representing a set of objects in an environment, with a reasonable expectation of success, in order to produce a 3D interpretation of object and improve the accuracy of keypoints detected [0023]. Regarding claim 20, Turpin teaches a non-transitory computer readable medium including a plurality of processor executable instructions stored thereon that, when executed by a processor (see at least [0059]: “An illustrative implementation of a computing system that may be used in connection with any of the embodiments of the disclosure provided herein is shown in FIG. 9. For example, any of the computing devices described above may be implemented as computing system 900. The computer system 900 may include one or more computer hardware processors 902 and one or more articles of manufacture that comprise non-transitory computer-readable storage media (e.g., memory 904 and one or more non-volatile storage devices 906).”), perform a method of: providing as input to at least one machine learning model, first sensor data, second sensor data, and camera intrinsics associated with at least one camera configured to sense the first sensor data and/or the second sensor data (see at least [0042]: “In some embodiments, one or both of the 2D camera and the depth sensor included within a perception module may have a fixed orientation (e.g., they may not actively pan and/or tilt). Additionally, the sensors within the upper and lower perception modules may be oriented at the same angle relative to the perception mast 240 or may be oriented at different angles relative to the perception mast to capture a desired field of view. For instance, the sensors of the upper perception module may be oriented to capture information about the environment at an angle of 90° relative to the vertical axis of the perception mast 240, whereas the sensors of the lower perception module may be oriented to capture information about the environment at an angle of 70° relative to the vertical axis of the perception mast 240 (i.e., facing downward toward the mobile base) to enable capture of information located near the mobile base.”; [0051]: “For example, these factors include the intrinsic properties of the cameras (e.g., focal lengths, principal points of the cameras) and the extrinsic properties of the cameras (e.g., the precise position and orientations of the RGB camera and the TOF depth sensor camera with respect to each other). A calibration sequence executed for each set of sensors in a perception module may be performed to determine these intrinsic and extrinsic properties for use in registering the RGB image and the depth information to generate an RGBD image in act 512 of process 500.”; [0052]: “Process 500 then proceeds to act 514, where one or more characteristics of objects in the environment are determined based on the RGBD image generated in act 512. In some embodiments, the RGBD image is provided as input to a trained statistical model (e.g., a machine learning model) that has been trained to identify the one or more characteristics. For instance, in the box picking example, the statistical model may be trained to recognize surfaces (e.g., faces) of boxes arranged in a stack. In another example, the statistical model may be trained to recognize other object characteristics such as the shape of signs, a category or type of object in the path of motion of the robot, or any other characteristic of one or more objects in the environment. Any suitable type of trained statistical model may be used to process an RGBD image and output one or more characteristics of object(s) in the environment.”), wherein the at least one machine learning model is trained to output polyhedron information representing a set of objects in an environment of a mobile robot (see at least [0056]: “In act 612, the sensor manager receives the request and generates a request to trigger one or multiple of the 2D cameras and depth sensors included in one or more perception modules arranged on the perception mast. In act 614, the RGBD camera software receives the request generated in act 612 and interfaces with the appropriate camera(s) and depth sensor(s) to begin capture of the corresponding information…In act 618, the trained statistical model (e.g., BoxDetector) outputs one or more characteristics of objects in the environment (e.g., identified surfaces of boxes), and information about the characteristic(s) is provided to the control circuity to perform one or more actions based, at least in part, on the identified characteristic(s).”); and controlling a mobile robot to perform an action based, at least in part, on the polyhedron information output from the at least one machine learning model (see at least [0049]: “After capturing 2D color and depth information from each of the plurality of perception modules, process 400 proceeds to act 414, where one or more characteristics of one or more objects in the environment are determined based on the captured information. Continuing with the box picking example, the characteristics may include faces of boxes in the stack using a box detection model trained to identify the faces of boxes in a stack based on the captured information…After determining the one or more characteristics of objects in the environment, process 400 proceeds to act 416 where one or more actions are performed based on the determined characteristic(s). Returning to the box picking example, after box faces in a stack are identified in act 414, the action performed in act 416 may include one or more of determining a next box in the stack to pick, updating a trajectory plan for the manipulator arm of the robot to pick a next box in the stack, determining whether to pick the next box in the stack using a top pick or a face pick, or controlling the manipulator arm of the robot to pick the next box in the stack. Of course, additional or alternative actions may also be performed depending on the task the robot is currently performing or will perform next. For instance, the object with which the manipulator arm may interact with next may not be arranged in a stack, but may be located in any configuration in the environment of the robot.”; [0056]: “In act 618, the trained statistical model (e.g., BoxDetector) outputs one or more characteristics of objects in the environment (e.g., identified surfaces of boxes), and information about the characteristic(s) is provided to the control circuity to perform one or more actions based, at least in part, on the identified characteristic(s).”). Turpin fails to explicitly teach that the polyhedron information is a three-dimensional (3D) representation of one or more polyhedrons representing a set of objects in an environment. However, Malisiewicz teaches a system and method for cuboid detection that outputs a three-dimensional (3D) representation of one or more polyhedrons representing a set of objects in an environment (see at least Figs. 2, 4-5, and [0020]: “For example, a cuboid detector (such as a deep cuboid detector) can process a consumer-quality Red-Green-Blue (RGB) image of a cluttered scene and localize some or all three-dimensional (3D) cuboids in the image. A cuboid can comprise a boxy or a box-like object and can include a polyhedron (which may be convex) with, e.g., 4, 5, 6, 7, 8, 10, 12, or more faces. For example, cuboids can include pyramids, cubes, prisms, parallelepipeds, etc. Cuboids are not limited to such polyhedral shapes from geometry and can include box-like structures such as, e.g., appliances (e.g., television sets, computer monitors, toasters, washing machines, refrigerators), furniture (e.g., sofas, chairs, beds, cribs, tables, book cases, cabinets), vehicles (e.g., automobiles, buses), etc.”). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Turpin to incorporate the teachings of Malisiewicz and provide a means to output a three-dimensional (3D) representation of one or more polyhedrons representing a set of objects in an environment, with a reasonable expectation of success, in order to produce a 3D interpretation of object and improve the accuracy of keypoints detected [0023]. Regarding claim 21, modified Turpin teaches the limitations of claim 1. Turpin further teaches wherein the polyhedron information includes, for each polyhedron of the one or more polyhedrons, a pose and/or size of the polyhedron (see at least [0049]: “After capturing 2D color and depth information from each of the plurality of perception modules, process 400 proceeds to act 414, where one or more characteristics of one or more objects in the environment are determined based on the captured information”; [0051]: “FIG. 5 illustrates a process 500 for combining information captured from a perception module that includes an RGB monocular camera and a time-of-flight (TOF) depth sensor to determine one or more characteristics of objects in the environment. In act 510, an RGB image is captured from the RGB monocular camera and depth information is captured by the TOF depth sensor in the perception module. Process 500 then proceeds to act 512, where the RGB image and the depth information is combined to generate an RGBD image. The RGBD image may be conceptualized as a high-fidelity colorized 3D point cloud, which includes both color appearance as well as depth data and 3D geometric structure of objects in the environment.”). Turpin fails to explicitly teach that the polyhedron information is a three-dimensional (3D) representation of one or more polyhedrons. However, Malisiewicz teaches a system and method for cuboid detection that outputs a three-dimensional (3D) representation of one or more polyhedrons (see at least Figs. 2, 4-5, and [0020]: “For example, a cuboid detector (such as a deep cuboid detector) can process a consumer-quality Red-Green-Blue (RGB) image of a cluttered scene and localize some or all three-dimensional (3D) cuboids in the image. A cuboid can comprise a boxy or a box-like object and can include a polyhedron (which may be convex) with, e.g., 4, 5, 6, 7, 8, 10, 12, or more faces. For example, cuboids can include pyramids, cubes, prisms, parallelepipeds, etc. Cuboids are not limited to such polyhedral shapes from geometry and can include box-like structures such as, e.g., appliances (e.g., television sets, computer monitors, toasters, washing machines, refrigerators), furniture (e.g., sofas, chairs, beds, cribs, tables, book cases, cabinets), vehicles (e.g., automobiles, buses), etc.”). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Turpin to incorporate the teachings of Malisiewicz and provide a three-dimensional (3D) representation of one or more polyhedrons, with a reasonable expectation of success, in order to produce a 3D interpretation of object and improve the accuracy of keypoints detected [0023]. Regarding claim 22, modified Turpin teaches the limitations of claim 1. Turpin further teaches wherein the pose and/or size of the polyhedron is specified as a size relative of the polyhedron (see at least [0051]: “FIG. 5 illustrates a process 500 for combining information captured from a perception module that includes an RGB monocular camera and a time-of-flight (TOF) depth sensor to determine one or more characteristics of objects in the environment. In act 510, an RGB image is captured from the RGB monocular camera and depth information is captured by the TOF depth sensor in the perception module. Process 500 then proceeds to act 512, where the RGB image and the depth information is combined to generate an RGBD image. The RGBD image may be conceptualized as a high-fidelity colorized 3D point cloud, which includes both color appearance as well as depth data and 3D geometric structure of objects in the environment.”). Turpin fails to explicitly teach that the pose and/or size of the polyhedron is specified as a distance, rotation and/or size relative to a predicted center point of the polyhedron. However, Malisiewicz teaches a system and method for cuboid detection that outputs a pose and/or size of a polyhedron that is specified as a distance, rotation and/or size relative to a predicted center point of the polyhedron (see at least [0035]: “The cuboid detector 200 can implement a deep cuboid detection pipeline. The first action of the deep cuboid detection pipeline can be determining Regions of Interest (RoIs) 220a1, 220b, in an image 202a where a cuboid might be present…In some implementations, instead of just producing a 2D bounding box, the cuboid detector 200 can output the normalized offsets of the vertices from the center of the RoI 220a1, 220b. The cuboid detector 200 can refine the predictions by performing iterative feature pooling. The dashed lines in FIG. 2 show the regions 224a, 224b of the convolutional feature map 228, corresponding to the RoI 220a1 in the image 202b and a refined RoI 220a2 in the image 202c, from which features can be pooled. The two fully connected layers 216 can process the region 224b of the convolutional feature map 228 corresponding to the refined RoI 220a2 to determine a further refined RoI and/or a representation of a cuboid 232 in the image 202d.”; [0039]: “The bounding box regression values (Δx, Δy, Δw, Δh) can be used to fit the initial object proposal tightly around the object. The keypoint locations can be encoded as offsets from the center of the RoI and can be normalized by the proposal width/height as shown in FIG. 3.”). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Turpin to incorporate the teachings of Malisiewicz and provide a pose and/or size of a polyhedron that is specified as a distance, rotation and/or size relative to a predicted center point of the polyhedron, with a reasonable expectation of success, in order to refine a representation of the polyhedron [0035]. Claim Rejections - 35 USC § 103 5. Claims 9-10 is/are rejected under 35 U.S.C. 103 as being unpatentable over Turpin et al. (US 20220305680, hereinafter Turpin) in view of Sagong et al. (US 20240290007, hereinafter Sagong) and in further view of Malisiewicz et al. (US 20180137642, hereinafter Malisiewicz). Regarding claim 9, modified Turpin teaches the limitations of claim 6. Turpin fails to explicitly teach wherein the at least one machine learning model is configured to determine a joint feature map based on the first image data and the second image data, wherein the polyhedron information is based on the joint feature map. However, Sagong teaches a method and apparatus for image generation based on neural scene representation that comprises a machine learning model configured to determine a joint feature map based on a first image data and a second image data, wherein a polyhedron information is based on the joint feature map (see at least Fig. 1 and [0006]: “In one or more general aspects, a processor-implemented method includes: extracting pyramid level color feature maps from two or more images; extracting pyramid level density feature maps based on a cost volume generated based on the color feature maps; generating neural scene representation (NSR) cube information representing a three-dimensional (3D) space based on the color feature maps and the density feature maps; and generating a two-dimensional (2D) scene of a field of view (FOV) different from a FOV of the two or more images based on the NSR cube information.”; [0067]: “The NSR data (e.g., an NSR statistical value) estimated by the encoder neural network may be stored as the NSR cube 208. The electronic device may generate a color feature map and a density feature map for each viewpoint based on the encoder neural network, and construct the NSR cube 208. A result of the color transformation operation on the color feature map and the density transformation operation on the density feature map may be data having the same dimension as the NSR cube 208. For example, when the NSR cube 208 has N voxels (or grid cells), a result of each transformation operation for color and density may have N mean values and N density values. In this example, N may be an integer greater than or equal to 1.” [0109]: “The density transformer 850 may include a well-known transformer network. However, examples are not limited to the density transformer 850, but other machine learning models designed and trained to extract NSR features from a representative density feature map (e.g., the representative density feature map 807-1) may also be used.” Sagong teaches utilizing machine learning models to determine a joint feature map (color feature map and density feature map) to generate a neural scene representation (NSR) cube information (polyhedron information) representing a three-dimensional (3D) space of an object.). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Turpin to incorporate the teachings of Sagong and provide a machine learning model configured to determine a joint feature map based on a first image data and a second image data, wherein a polyhedron information is based on the joint feature map, with a reasonable expectation of success, in order to reconstruct the polyhedron information with joint features from the images. The combination of Turpin and Sagong fails to explicitly teach that the polyhedron information is a 3D representation of the one or more polyhedrons. However, Malisiewicz teaches a system and method for cuboid detection that outputs a 3D representation of one or more polyhedrons (see at least Figs. 2, 4-5, and [0020]: “For example, a cuboid detector (such as a deep cuboid detector) can process a consumer-quality Red-Green-Blue (RGB) image of a cluttered scene and localize some or all three-dimensional (3D) cuboids in the image. A cuboid can comprise a boxy or a box-like object and can include a polyhedron (which may be convex) with, e.g., 4, 5, 6, 7, 8, 10, 12, or more faces. For example, cuboids can include pyramids, cubes, prisms, parallelepipeds, etc. Cuboids are not limited to such polyhedral shapes from geometry and can include box-like structures such as, e.g., appliances (e.g., television sets, computer monitors, toasters, washing machines, refrigerators), furniture (e.g., sofas, chairs, beds, cribs, tables, book cases, cabinets), vehicles (e.g., automobiles, buses), etc.”). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Turpin and Sagong to incorporate the teachings of Malisiewicz and provide a 3D representation of one or more polyhedrons, with a reasonable expectation of success, in order to produce a 3D interpretation of object and improve the accuracy of keypoints detected [0023]. Regarding claim 10, modified Turpin teaches the limitations of claim 6. Turpin fails to explicitly teach wherein the at least one machine learning model is configured to: determine a first feature map based on the first image data; determine a second feature map based on the second image data; and perform feature matching based on the first feature map and the second feature map to generate a correlation volume, wherein the polyhedron information is based on the correlation volume. However, Sagong teaches a method and apparatus for image generation based on neural scene representation that comprises a machine learning model is configured to: determine a first feature map based on the first image data (see at least Fig. 1 and [0062]: “The encoder neural network may estimate NSR data for representing a 3D scene based on input images. For example, the electronic device may extract a feature map (e.g., a 2D color feature map) from a pair of input images (e.g., a first image 201-1 and a second image 201-2) through the stereo-specific encoder 210. The electronic device may extract pyramid level color feature maps 202 using the stereo-specific encoder 210. The stereo-specific encoder 210 may include a 2D convolutional neural network (CNN). An example of the structure of the stereo-specific encoder 210 will be described below with reference to FIG. 4. The electronic device may generate pyramid level density feature maps 204 using the 3D encoder 230 from a cost volume 203 obtained based on the pyramid level color feature maps 202.”); determine a second feature map based on the second image data (see at least [0066]: “The density transformer 270 of the electronic device may perform a density transformation operation on a representative density feature map 207 of the pyramid level density feature maps 204.”); and perform feature matching based on the first feature map and the second feature map to generate a correlation volume, wherein the polyhedron information is based on the correlation volume (see at least [0013]: “The extracting of the density feature maps may include: generating a cost volume based on a correlation for each pyramid level between color feature maps extracted from a first image and color feature maps extracted from a second image; and generating a density feature map for a corresponding pyramid level based on the cost volume.”; [0067]: “The NSR data (e.g., an NSR statistical value) estimated by the encoder neural network may be stored as the NSR cube 208. The electronic device may generate a color feature map and a density feature map for each viewpoint based on the encoder neural network, and construct the NSR cube 208. A result of the color transformation operation on the color feature map and the density transformation operation on the density feature map may be data having the same dimension as the NSR cube 208. For example, when the NSR cube 208 has N voxels (or grid cells), a result of each transformation operation for color and density may have N mean values and N density values. In this example, N may be an integer greater than or equal to 1.” Sagong teaches generating a cost volume based on a correlation for each pyramid level between color feature maps extracted from a first image and color feature maps extracted from a second image (first feature map and second feature map) in order to construct a NSR cube (polyhedron information).). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Turpin to incorporate the teachings of Sagong and provide a machine learning model is configured to: determine a first feature map based on the first image data; determine a second feature map based on the second image data; and perform feature matching based on the first feature map and the second feature map to generate a correlation volume, wherein the polyhedron information is based on the correlation volume , with a reasonable expectation of success, in order to reconstruct the polyhedron information with joint features from the images. The combination of Turpin and Sagong fails to explicitly teach that the polyhedron information is a 3D representation of the one or more polyhedrons. However, Malisiewicz teaches a system and method for cuboid detection that outputs a 3D representation of one or more polyhedrons (see at least Figs. 2, 4-5, and [0020]: “For example, a cuboid detector (such as a deep cuboid detector) can process a consumer-quality Red-Green-Blue (RGB) image of a cluttered scene and localize some or all three-dimensional (3D) cuboids in the image. A cuboid can comprise a boxy or a box-like object and can include a polyhedron (which may be convex) with, e.g., 4, 5, 6, 7, 8, 10, 12, or more faces. For example, cuboids can include pyramids, cubes, prisms, parallelepipeds, etc. Cuboids are not limited to such polyhedral shapes from geometry and can include box-like structures such as, e.g., appliances (e.g., television sets, computer monitors, toasters, washing machines, refrigerators), furniture (e.g., sofas, chairs, beds, cribs, tables, book cases, cabinets), vehicles (e.g., automobiles, buses), etc.”). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Turpin and Sagong to incorporate the teachings of Malisiewicz and provide a 3D representation of one or more polyhedrons, with a reasonable expectation of success, in order to produce a 3D interpretation of object and improve the accuracy of keypoints detected [0023]. Claim Rejections - 35 USC § 103 6. Claims 11-14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Turpin et al. (US 20220305680, hereinafter Turpin) and Malisiewicz et al. (US 20180137642, hereinafter Malisiewicz) in view of Birchfield et al. (US 20220277472, hereinafter Birchfield). Regarding claim 11, modified Turpin teaches the limitations of claim 1. Turpin fails to explicitly teach wherein the 3D representation of the one or more polyhedrons is based on a pose estimate and size estimate for each polyhedron of the one or more polyhedrons. However, Birchfield teaches a method and system for determining a pose and relative dimensions of an object from an image wherein a 3D representation of one or more polyhedrons is based on a pose estimate and size estimate for each polyhedron of the one or more polyhedrons (see at least Fig. 3 and [0060]: “Techniques and systems described herein relate to techniques for determining a six degrees of freedom (6-DoF and/or 6DOF) pose and relative dimensions of an object from an image using one or more neural networks. A 6-DoF pose may refer to a three-dimensional (3D) position and orientation of an object. Relative dimensions of an object may refer to relative dimensions of a 3D bounding cuboid of the object, and can be indicated by a ratio of width to height to length of the 3D bounding cuboid. A system for object pose estimation may calculate a 6-DoF pose and relative dimensions of an object from an image depicting the object.”; [0061]: “In an embodiment, a system for object pose estimation obtains an RGB (red-green-blue) image depicting an object of a particular category…The system may utilize a neural network to extract features from the image, and calculate various outputs based at least in part on the extracted features. In some embodiments, the outputs include indications of a center of a bounding box of the object, indications of a size of the bounding box, indications of vertices of a bounding cuboid of the object, and relative dimensions of the bounding cuboid. The system may decode one or more of the outputs, and utilize a perspective-n-point (PnP) algorithm to calculate a 6-DoF pose and relative dimensions of the object. The system can be trained using various training images depicting objects with annotations indicating 6-DoF poses and relative dimensions of the objects.”; [0063]: “In various embodiments, the system performs category-level pose estimation, which comprises inferring poses and relative sizes of all objects within a specific category using an RGB image processed by a single neural network.”). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Turpin to incorporate the teachings of Birchfield and provide a means wherein a 3D representation of one or more polyhedrons is based on a pose estimate and size estimate for each polyhedron of the one or more polyhedrons, with a reasonable expectation of success, in order to take into consideration the pose and size of an object when creating information to depict the object to provide a more complete picture. Regarding claim 12, modified Turpin teaches the limitations of claim 11. Turpin fails to explicitly teach wherein the pose estimate is a six degree of freedom pose estimate. However, Birchfield teaches a method and system for determining a pose and relative dimensions of an object from an image wherein a pose estimate is a six degree of freedom pose estimate (see at least [0060]: “Techniques and systems described herein relate to techniques for determining a six degrees of freedom (6-DoF and/or 6DOF) pose and relative dimensions of an object from an image using one or more neural networks. A 6-DoF pose may refer to a three-dimensional (3D) position and orientation of an object. Relative dimensions of an object may refer to relative dimensions of a 3D bounding cuboid of the object, and can be indicated by a ratio of width to height to length of the 3D bounding cuboid. A system for object pose estimation may calculate a 6-DoF pose and relative dimensions of an object from an image depicting the object.”). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Turpin to incorporate the teachings of Birchfield and provide a means wherein a pose estimate is a six degree of freedom pose estimate, with a reasonable expectation of success, in order to take into consideration degree of freedom of the pose when creating information to depict the object to provide a more complete picture. Regarding claim 13, modified Turpin teaches the limitations of claim 11. Turpin further teaches wherein each polyhedron in the set of polyhedrons is a cuboid (see at least Figs. 8A-8C and [0033]: “FIG. 8C depicts a robot 30a performing an order building task, in which the robot 30a places boxes 31 onto a pallet 33.”; [0050]: “To ensure that the entire width of the entire stack is considered when identifying boxes in the stack to, for example, determine a next box to pick, the perception mast may be rotated from left to right (or right to left), and while the perception mast is moving (or during short pauses between movements) the perception modules may capture information for multiple points in space that collectively cover the entire width of the stack of boxes. In some embodiments, the captured information may be stitched together into a single image that is provided to a trained box detection model (or other trained model depending on the particular task being performed by the robot”; [0033]: “FIG. 8C depicts a robot 30a performing an order building task, in which the robot 30a places boxes 31 onto a pallet 33. In FIG. 8C, the pallet 33 is disposed on top of an autonomous mobile robot (AMR) 34, but it should be appreciated that the capabilities of the robot 30a described in this example apply to building pallets not associated with an AMR. In this task, the robot 30a picks boxes 31 disposed above, below, or within shelving 35 of the warehouse and places the boxes on the pallet 33. Certain box positions and orientations relative to the shelving may suggest different box picking strategies. For example, a box located on a low shelf may simply be picked by the robot by grasping a top surface of the box with the end effector of the robotic arm (thereby executing a “top pick”). However, if the box to be picked is on top of a stack of boxes, and there is limited clearance between the top of the box and the bottom of a horizontal divider of the shelving, the robot may opt to pick the box by grasping a side surface (thereby executing a “face pick”).”). Regarding claim 14, modified Turpin teaches the limitations of claim 13. Turpin further teaches wherein the size information includes a depth dimension of the cuboid (see at least [0045]: “Perception module 242 also includes depth sensor 330 configured to capture depth information related to objects in the environment. Examples of depth sensor 330 include, but are not limited to, a stereoscopic camera, a time-of-flight camera, LiDAR, or any other depth sensor configured to capture depth information about the environment. In one embodiment, perception module 242 includes two LED-based light sources 310, an RGB monocular camera 320 and a time-of-flight camera 330.”; [0051]: “FIG. 5 illustrates a process 500 for combining information captured from a perception module that includes an RGB monocular camera and a time-of-flight (TOF) depth sensor to determine one or more characteristics of objects in the environment. In act 510, an RGB image is captured from the RGB monocular camera and depth information is captured by the TOF depth sensor in the perception module. Process 500 then proceeds to act 512, where the RGB image and the depth information is combined to generate an RGBD image. The RGBD image may be conceptualized as a high-fidelity colorized 3D point cloud, which includes both color appearance as well as depth data and 3D geometric structure of objects in the environment.”; [0052]: “Process 500 then proceeds to act 514, where one or more characteristics of objects in the environment are determined based on the RGBD image generated in act 512. In some embodiments, the RGBD image is provided as input to a trained statistical model (e.g., a machine learning model) that has been trained to identify the one or more characteristics. For instance, in the box picking example, the statistical model may be trained to recognize surfaces (e.g., faces) of boxes arranged in a stack. In another example, the statistical model may be trained to recognize other object characteristics such as the shape of signs, a category or type of object in the path of motion of the robot, or any other characteristic of one or more objects in the environment. Any suitable type of trained statistical model may be used to process an RGBD image and output one or more characteristics of object(s) in the environment.”). Turpin fails to explicitly teach wherein the size estimate includes a depth dimension, a width dimension, and a height dimension of the cuboid. However, Birchfield teaches a method and system for determining a pose and relative dimensions of an object from an image wherein a size estimate includes a depth dimension, a width dimension, and a height dimension of the cuboid (see at least Fig. 3 and [0082]: “A 2D bounding box size 114 may be a collection of data that indicates width and height values of an object bounding box. A 2D bounding box size 114 may comprise two sets of data, in which each set of data comprises a value for each pixel of an input image 104. A first set of data may comprise width values and a second set of data may comprise height values. A system for object pose estimation 102 may utilize a 2D bounding box size 114 to determine a size of a bounding box of an object depicted in an input image 104. In some embodiments, a system for object for pose estimation 102 (e.g., via a 2D keypoint output decoding 128) determines pixel coordinates of an input image 104 that correspond to a center of an object, and utilizes width and height values from a 2D bounding box size 114 corresponding to the pixel coordinates to determine a size of a bounding box of the object.”; [0133]: “The 6DOF pose may be indicated by one or more values that correspond to an x-axis position, a y-axis position, a z-axis position, a roll-axis angle, a pitch-axis angle, and/or a yaw-axis angle of the object. The 6DOF pose may be indicated by a bounding cuboid (e.g., coordinates of vertices of the bounding cuboid within the image). The 6DOF pose may be relative to a position and/or orientation of a camera (e.g., a camera that captured the image), or any suitable reference point or plane. The one or more relative dimension values may be indicated by one or more values that correspond to a ratio of width to height to length of a bounding cuboid of the object.”; [0134]: “In some embodiments, a system (e.g., a system for object pose estimation) calculates an absolute scale based at least in part on the one or more relative dimension values. Absolute scale of an object, also referred to as absolute dimensions, may refer to dimensions of the object in the real-world. In some examples, an absolute scale of an object corresponds to dimensions of a bounding cuboid of the object in the real world. The system may obtain depth information to calculate an absolute scale of the object based on the one or more relative dimension values.”; [0135]: “For example, the system obtains depth information from an image and/or video capturing device that captured the image, or other system (e.g., a depth sensor) associated with the device, and uses the depth information to scale the one or more relative dimension values to calculate an absolute scale of the object. In various embodiments, an image and/or video capturing device that captured the image is a stereo device (e.g., a stereo camera), in which the system utilizes depth information determined based on one or more images captured by the device to scale the one or more relative dimension values to calculate an absolute scale of the object. The system may calculate an absolute scale of the object in any suitable manner.”). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Turpin to incorporate the teachings of Birchfield and provide a means wherein the size estimate includes a depth dimension, a width dimension, and a height dimension of the cuboid, with a reasonable expectation of success, in order to take into consideration depth, a width, and height of an object when creating information to depict the object to provide a more complete picture. Conclusion THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. Any inquiry concerning this communication or earlier communications from the examiner should be directed to TIEN MINH LE whose telephone number is (571)272-3903. The examiner can normally be reached Monday to Friday (8:30am-5:30pm eastern time). Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Khoi Tran can be reached on (571)272-6919. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /T.M.L./Examiner, Art Unit 3656 /KHOI H TRAN/Supervisory Patent Examiner, Art Unit 3656
Read full office action

Prosecution Timeline

Jun 28, 2024
Application Filed
Oct 24, 2025
Non-Final Rejection mailed — §103
Jan 22, 2026
Applicant Interview (Telephonic)
Jan 22, 2026
Examiner Interview Summary
Jan 26, 2026
Response Filed
Apr 16, 2026
Final Rejection mailed — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12629833
TRAJECTORY PLANNING SYSTEMS AND METHODS
3y 4m to grant Granted May 19, 2026
Patent 12617439
Impact Load Detection
4y 4m to grant Granted May 05, 2026
Patent 12608023
Article Transport Facility
1y 6m to grant Granted Apr 21, 2026
Patent 12566070
DETERMINATION APPARATUS AND DETERMINATION METHOD
2y 3m to grant Granted Mar 03, 2026
Patent 12528325
A CONTROL SYSTEM FOR A VEHICLE
4y 9m to grant Granted Jan 20, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

3-4
Expected OA Rounds
68%
Grant Probability
92%
With Interview (+23.8%)
2y 10m (~11m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 85 resolved cases by this examiner. Grant probability derived from career allowance rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month