Last updated: May 29, 2026
Application No. 19/008,102
SYSTEM AND METHOD FOR PICK POSE ESTIMATION FOR ROBOTIC PICKING WITH ARBITRARILY SIZED END EFFECTORS

Non-Final OA §103
Filed
Jan 02, 2025
Priority
Jan 29, 2024 — EU 24154465.9
Examiner
NELESKI, ELIZABETH ROSE
Art Unit
3658
Tech Center
3600 — Transportation & Electronic Commerce
Assignee
Siemens Aktiengesellschaft
OA Round
1 (Non-Final)
Interview Optional

— +16.3% interview lift. Examiner has a relatively high allowance rate (74%); +16.3% interview lift. A written response may suffice.
Based on 99 resolved cases, 2023–2026
Examiner Intelligence

NELESKI, ELIZABETH ROSE View full profile →
Grants 74% — above average
Career Allowance Rate
73 granted / 99 resolved
+21.7% vs TC avg
Strong +16% interview lift
Without
With
+16.3%
Interview Lift
resolved cases with interview
Typical timeline
3y 0m
Avg Prosecution
18 currently pending
Career history
120
Total Applications
across all art units
Statute-Specific Performance

§103
90.0%
+50.0% vs TC avg
§102
9.2%
-30.8% vs TC avg
§112
0.8%
-39.2% vs TC avg
Black line = Tech Center average estimate • Based on career data from 99 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
	
Joint Inventors
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Priority
Acknowledgement is made of applicant’s claim for foreign priority under 35 USC 119 (a)-(d) to application EP24154465.9 filed 01/29/2024. Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55. As such, the effective filing date of the application is 01/29/2024.
	
Status of Claims
Claims 1-15 are now pending.
	
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-5, 8, and 9-15 are rejected under 35 U.S.C. 103 as being unpatentable over Aiglstorfer et al (US 20210047115 A1), hereinafter Aiglstorfer in view of Stoppi et al (US 20220410381 A1), hereinafter Stoppi.
Regarding claim 1, Aiglestorfer discloses:
A method for robotic picking of objects, comprising: 
acquiring, via an imaging system, one or more images of a scene, the scene including one or more objects (see at least [0061]: “Accordingly, the cameras 8a, 8b detect a three-dimensional image of the inside of at least the containers 9a and 9b and the goods stored therein.”)
performing, by a computing system comprising one or more processors: 
estimating a pick point on an object… from the one or more objects, selected to be picked by an end effector of a robot (see at least [0061]: “sensor system 8a, 8b, which serves for the detection of a gripping surface pose and a gripping surface size of a gripping surface of a good…”)
estimating a pick pose for the end effector (see at least [0070]: “Accordingly, the controller 11 calculates a gripping pose for the movable robot head 4 in which the suction gripper 5a is in contact with the gripping surface 24f.”)
wherein the end effector defines an oblong footprint of contact, which is modeled as a 2D shape with specified dimensions (see at least Figures 4-7.)
the estimation comprising: 
determining a pick surface utilizing neighboring points around the pick point (see at least [0070]: “a gripping surface pose and a gripping surface size of a gripping surface of a good…”)
reprojecting a set of points, which define an extent of the pick surface, with respect to a normal of the pick surface, to create a planar representation of the pick surface (see at least Figures 8 and 9 which depict planar object surfaces.)
and computing a yaw-orientation based on alignment of a longer dimension of the end effector model with a longer dimension of the planar representation of the pick surface (see at least Figures 8 and 9, wherein one side of the end-effector, out of three equal sides, is aligned with the longer dimension of the object. See further [0062]: “…calculating from the determined gripping surface pose and gripping surface size a gripping pose for the movable robot head 4 in which a maximum number of suction grippers 5 is in contact with the gripping surface of the mentioned goods…”)
and outputting the estimated pick pose to a controller configured to control the end effector to pick the selected object (see at least [0062]: “Moreover, the controller 11 serves for moving the robot head 5 into the calculated gripping pose and for activating the suction grippers 5 that are in contact with or are coming into contact with the gripping surface of the mentioned goods, so as to collect the goods.”)
Aiglstorfer does not disclose, but Stoppi, in an analogous field of endeavor teaches: 
An object mask, the object mask produced by performing instance segmentation based on the one or more images, the object mask corresponding to an object, from the one or more objects (See at least [0015]: “In more detail, continuing the above example where the main camera 10 is a stereo array camera, the instance segmentation mask and the depth map may be computed based on images (e.g., color images, polarization raw frames, and the like) captured by the same main camera 10, and therefore are captured from the same viewpoint. As such, the instance segmentation mask may be used to directly identify the pixels (or 3-D points) of the depth map that correspond to the same instances of objects. For example, an instance segmentation mask may identify three different boxes and assign arbitrary example labels ‘A’, ‘B’, and ‘C’ to the pixels of the instance segmentation mask that correspond to the three different boxes in a captured 2-D image (e.g., a 2-D color intensity image). Each pixel of a depth map encodes a distance between the camera and a depicted surface of scene 1, but does not encode information about which points of the depth map correspond to different objects (e.g., the depth map may encode the shapes of the different boxes in the scene, but does not encode semantic information such as whether the boxes are parts of a same object or multiple different objects). When the depth map and the instance segmentation mask depict the same scene from the same viewpoint, then each portion of the instance segmentation mask assigns an object instance to corresponding portions of the depth map, thereby labeling the points of the depth map with semantic information about the scene, such as the class of object and an instance identifier (e.g., labeling depth points corresponding to the shapes of three boxes in the scene with different labels such as ‘A’, ‘B’, and ‘C’).”)
It would have been prima facie obvious for one of ordinary skill in the art before the effective filing date of the claimed invention, with a reasonable expectation for success, to combine the invention of Aiglstorfer with the object masks as taught by Stoppi. This is because both Aiglstorfer and Stoppi are directed to object picking. Paragraph [0004] of Stoppi’s disclosure states: “Generalizing to different objects and different environments is especially hard for vision systems that are often used to guide the bin-picking process. These vision systems typically capture images of a scene using cameras and analyze the captured images to estimate the location and orientation of the objects to pick. This is a challenging task, especially in the case of unknown environments (including varying lighting conditions) and irregular conditions and orientations of objects in the bin.” Stoppi’s invention is directed to solving this issue. One of ordinary skill in the art would be reasonably drawn to the improvements provided.

	Regarding claim 2, the combination of Aiglstorfer and Stoppi teaches the method according to claim 1. Aiglstorfer further discloses:
	wherein the end effector comprises an array of gripping elements modeled as a rectangular shape of specified length and width (see at least Figure 6.)

	Regarding claim 3, the combination of Aiglstorfer and Stoppi teaches the method according to claim 2.
	Aiglstorfer does not explicitly disclose, but Stoppi, in an analogous field of endeavor teaches:
	wherein the estimated pick point is computed using a grasp neural network based on the one or more images to determine an optimal grasping location on the object mask for a single gripping element (see at least [0131]: “In some embodiments where a pickability score is computed based on 2-D images (e.g., an instance segmentation mask) alone, the object selector 150 may further include an object pickability network (or object pickability neural network) that is trained to compute object pickability scores based on a supplied instance segmentation mask. For example, in some embodiments the object pickability network is trained to classify each instance of an object into one of a plurality of different pickability scores (e.g., “low pickability,” “moderate pickability,” and “high pickability”) or may be configured to compute a numerical pickability score (e.g., a normalized value from 0.0 to 1.0, where 0.0 represents a “difficult” pickability score and 1.0 represents an “easy” pickability score).”)
It would have been prima facie obvious for one of ordinary skill in the art before the effective filing date of the claimed invention, with a reasonable expectation for success, to combine the invention of Aiglstorfer with the object masks as taught by Stoppi. This is because both Aiglstorfer and Stoppi are directed to object picking. Paragraph [0004] of Stoppi’s disclosure states: “Generalizing to different objects and different environments is especially hard for vision systems that are often used to guide the bin-picking process. These vision systems typically capture images of a scene using cameras and analyze the captured images to estimate the location and orientation of the objects to pick. This is a challenging task, especially in the case of unknown environments (including varying lighting conditions) and irregular conditions and orientations of objects in the bin.” Stoppi’s invention is directed to solving this issue. One of ordinary skill in the art would be reasonably drawn to the improvements provided.

	Regarding claim 4, the combination of Aiglstorfer and Stoppi teaches the method according to claim 1. 
	Aiglstorfer does not explicitly disclose, but Stoppi, in an analogous field of endeavor teaches:
	wherein the object mask is produced by: 
computing one or more instance segmentation masks detecting the one or more objects in the scene based on the one or more images, wherein each instance segmentation mask comprises a set of pixels that denote a particular object (see at least [0115]: “ In more detail, continuing the above example where the main camera 10 is a stereo array camera, the instance segmentation mask and the depth map may be computed based on images (e.g., color images, polarization raw frames, and the like) captured by the same main camera 10, and therefore are captured from the same viewpoint. As such, the instance segmentation mask may be used to directly identify the pixels (or 3-D points) of the depth map that correspond to the same instances of objects.”)
using the one or more instance segmentation masks for segmenting a depth map of the scene obtained from the one or more images, to therefrom produce a point cloud representation of the selected object (See at least [0120]: “For the sake of discussion, all of the objects x detected in a scene (e.g., detected by the instance segmentation module 110 in operation 330) may be represented by a set X. For each object x in the set, x.sub.mask is the set of pixels that belong to object x, as identified by the instance segmentation mask. In addition, x.sub.cloud is a point cloud of 3-D points (e.g., having x, y, and z coordinates) selected from the points in a depth map, as filtered or extracted from the depth map by the portion of the instance segmentation mask x.sub.mask corresponding to that object x. The current location (e.g., 3-D coordinates) of the end effector 26 or gripper of the robot arm 24 may be represented by R.”)
It would have been prima facie obvious for one of ordinary skill in the art before the effective filing date of the claimed invention, with a reasonable expectation for success, to combine the invention of Aiglstorfer with the object masks as taught by Stoppi. This is because both Aiglstorfer and Stoppi are directed to object picking. Paragraph [0004] of Stoppi’s disclosure states: “Generalizing to different objects and different environments is especially hard for vision systems that are often used to guide the bin-picking process. These vision systems typically capture images of a scene using cameras and analyze the captured images to estimate the location and orientation of the objects to pick. This is a challenging task, especially in the case of unknown environments (including varying lighting conditions) and irregular conditions and orientations of objects in the bin.” Stoppi’s invention is directed to solving this issue. One of ordinary skill in the art would be reasonably drawn to the improvements provided.

Regarding claim 5, the combination of Aiglstorfer and Stoppi teaches the method according to claim 1.
	Aiglstorfer does not explicitly disclose, but Stoppi, in an analogous field of endeavor teaches: 
wherein the scene includes multiple objects, and wherein the method comprises selecting the object, from the multiple objects, by determining a pickability measure of the object masks corresponding to each of the multiple objects to ensure that the selected object to be picked is not occluded (see at least [0118]: “In operation 370, the object selector 150 of the object pick estimator 100 selects an object to be picked based on object pickability scores computed for the one or more objects detected through operation 330 and/or operation 350. In particular, the object pickability scores represent an estimated ease of picking up an object using a particular robotic arm 24 with a particular end effector 26 (e.g., an end effector with gripping fingers or an end effector with a suction gripper). In the discussion herein, a high pickability score indicates an object that is more easily pickable, while a low pickability score indicates an object that is more difficult to pick. For example, objects that are underneath or blocked by objects above them may be effectively unpickable in the current state of the bin and would have very low pickability scores, whereas objects that are on top of a pile and that are laterally spaced from other objects may have high pickability scores.”)
It would have been prima facie obvious for one of ordinary skill in the art before the effective filing date of the claimed invention, with a reasonable expectation for success, to combine the invention of Aiglstorfer with the object masks as taught by Stoppi. This is because both Aiglstorfer and Stoppi are directed to object picking. Paragraph [0004] of Stoppi’s disclosure states: “Generalizing to different objects and different environments is especially hard for vision systems that are often used to guide the bin-picking process. These vision systems typically capture images of a scene using cameras and analyze the captured images to estimate the location and orientation of the objects to pick. This is a challenging task, especially in the case of unknown environments (including varying lighting conditions) and irregular conditions and orientations of objects in the bin.” Stoppi’s invention is directed to solving this issue. One of ordinary skill in the art would be reasonably drawn to the improvements provided.

Regarding claim 8, the combination of Aiglstorfer and Stoppi teaches the method according to claim 1.
	Aiglstorfer does not explicitly disclose, but Stoppi, in an analogous field of endeavor teaches: 
wherein creating the planar representation of the pick surface comprises: projecting the set of points in the object mask into a depth map, and rotating the points in the depth map with respect to the normal of the pick surface to produce a 2D image with a viewing direction perpendicular to the pick surface (see at least [0157]: “In one embodiment, instance segmentation masks delineating the various object instances are also be generated. Each instance segmentation mask may be a 2-D image having the same dimensions as the input image, where the value of each pixel may correspond to a label (e.g. a particular instance of the object depicted by the pixel). A different instance segmentation mask may be generated for different images depicting different viewpoints of the objects of interest. For example, a first segmentation mask may be generated to depict object instances in a first image captured by a first camera, and a second segmentation mask may be generated to depict object instances in a second image captured by a second camera. As convolutional neural network such as, for example, Mask R-CNN, may be employed for generating the instance segmentation masks.”)
It would have been prima facie obvious for one of ordinary skill in the art before the effective filing date of the claimed invention, with a reasonable expectation for success, to combine the invention of Aiglstorfer with the object masks as taught by Stoppi. This is because both Aiglstorfer and Stoppi are directed to object picking. Paragraph [0004] of Stoppi’s disclosure states: “Generalizing to different objects and different environments is especially hard for vision systems that are often used to guide the bin-picking process. These vision systems typically capture images of a scene using cameras and analyze the captured images to estimate the location and orientation of the objects to pick. This is a challenging task, especially in the case of unknown environments (including varying lighting conditions) and irregular conditions and orientations of objects in the bin.” Stoppi’s invention is directed to solving this issue. One of ordinary skill in the art would be reasonably drawn to the improvements provided.

Regarding claim 9, the combination of Aiglstorfer and Stoppi teaches the method according to claim 8.
Aiglstorfer does not explicitly disclose, but Stoppi teaches wherein creating the planar representation of the pick surface further comprises processing the 2D image to generate a contour representing an outline of the pick surface (see at least [0164]: “At block 430, the matching algorithm identifies features of a first object instance in a first segmentation mask. The identified features for the first object instance may include a shape of the region of the object instance, a feature vector in the region, and/or keypoint predictions in the region. The shape of the region for the first object instance may be represented via a set of points sampled along the contours of the region. Where a feature vector in the region is used as the feature descriptor, the feature vector may be an average deep learning feature vector extracted via a convolutional neural network.”)
It would have been prima facie obvious for one of ordinary skill in the art before the effective filing date of the claimed invention, with a reasonable expectation for success, to combine the invention of Aiglstorfer with the object masks as taught by Stoppi. This is because both Aiglstorfer and Stoppi are directed to object picking. Paragraph [0004] of Stoppi’s disclosure states: “Generalizing to different objects and different environments is especially hard for vision systems that are often used to guide the bin-picking process. These vision systems typically capture images of a scene using cameras and analyze the captured images to estimate the location and orientation of the objects to pick. This is a challenging task, especially in the case of unknown environments (including varying lighting conditions) and irregular conditions and orientations of objects in the bin.” Stoppi’s invention is directed to solving this issue. One of ordinary skill in the art would be reasonably drawn to the improvements provided.

Regarding claim 10, the combination of Aiglstorfer and Stoppi teaches the method according to claim 9.
Aiglstorfer does not explicitly disclose, but Stoppi teaches: 
wherein the contour is generated from the reprojected points by infilling, or inpainting, or opening operation, or combinations thereof (see at least [0116]: “In addition, in embodiments where the depth map and the instance segmentation mask are captured from different viewpoints, the object pick estimator 100 re-projects the points in the point cloud of the depth map to the viewpoint of the images used to compute the instance segmentation mask (in accordance with extrinsic camera parameters of the cameras used to capture the images used to capture the instance segmentation mask and the images used to generate the depth map), thereby enabling the instance segmentation mask to identify which points of the depth map correspond to the different detected instances of objects.”)
It would have been prima facie obvious for one of ordinary skill in the art before the effective filing date of the claimed invention, with a reasonable expectation for success, to combine the invention of Aiglstorfer with the object masks as taught by Stoppi. This is because both Aiglstorfer and Stoppi are directed to object picking. Paragraph [0004] of Stoppi’s disclosure states: “Generalizing to different objects and different environments is especially hard for vision systems that are often used to guide the bin-picking process. These vision systems typically capture images of a scene using cameras and analyze the captured images to estimate the location and orientation of the objects to pick. This is a challenging task, especially in the case of unknown environments (including varying lighting conditions) and irregular conditions and orientations of objects in the bin.” Stoppi’s invention is directed to solving this issue. One of ordinary skill in the art would be reasonably drawn to the improvements provided.

Regarding claim 11, the combination of Aiglstorfer and Stoppi teaches the method according to claim 9.
Aiglstorfer does not explicitly disclose, but Stoppi teaches: 
wherein creating the planar representation of the pick surface further comprises fitting a primitive shape of minimum area that includes all points in the contour and therefrom estimating planar dimensions of the pick surface (see at least [0164]: “At block 430, the matching algorithm identifies features of a first object instance in a first segmentation mask. The identified features for the first object instance may include a shape of the region of the object instance, a feature vector in the region, and/or keypoint predictions in the region. The shape of the region for the first object instance may be represented via a set of points sampled along the contours of the region. Where a feature vector in the region is used as the feature descriptor, the feature vector may be an average deep learning feature vector extracted via a convolutional neural network.”)
It would have been prima facie obvious for one of ordinary skill in the art before the effective filing date of the claimed invention, with a reasonable expectation for success, to combine the invention of Aiglstorfer with the object masks as taught by Stoppi. This is because both Aiglstorfer and Stoppi are directed to object picking. Paragraph [0004] of Stoppi’s disclosure states: “Generalizing to different objects and different environments is especially hard for vision systems that are often used to guide the bin-picking process. These vision systems typically capture images of a scene using cameras and analyze the captured images to estimate the location and orientation of the objects to pick. This is a challenging task, especially in the case of unknown environments (including varying lighting conditions) and irregular conditions and orientations of objects in the bin.” Stoppi’s invention is directed to solving this issue. One of ordinary skill in the art would be reasonably drawn to the improvements provided.

Regarding claim 12, the combination of Aiglstorfer and Stoppi teaches the method according to claim 1.
Aiglstorfer further discloses wherein the method comprises outputting the estimated pick pose to the controller subject to determining a complete overlap between the aligned end effector model and the planar representation of the pick surface (see at least Figures 6-9.)

Regarding claim 13, the combination of Aiglstorfer and Stoppi teaches the method according to claim 1.
Aiglstorfer further discloses wherein the pick pose outputted to the controller is defined by: position coordinates defining a center of the end effector determined based on said alignment of the end effector model, a normal vector of the pick surface and the yaw-orientation defining an angular orientation of the end effector in the plane of the pick surface (see at least [0062]: “FIG. 3 shows a storage and order-picking system 12 comprising a building 13 as well as a gate at an incoming goods department 14 and a gate at an outgoing goods department 15. In the region of the incoming goods department 14, there is a first conveying path 16 and two second conveying paths 17a and 17b. The first conveying path 16 connects the incoming goods department 14 to the robot system 2c. The two second conveying paths 17a and 17b connect the robot system 2c to a storage area 18, which comprises multiple storage racks 19 as well as storage-and-retrieval units 20a and 20b, which move in rack aisles running between the storage racks 19. At that end of the rack aisles, which is opposite to the second conveying paths 17a and 17b, there is a third conveying path 21, which is designed annularly and leads to a further robot system 2d in the present example. A fourth conveying path 22, which connects the robot system 2d to the outgoing goods department 15 in terms of conveyor technology, is arranged in the operating range of the robot system 2d as well.”)

Regarding claim 14, the combination of Aiglstorfer and Stoppi teaches the method according to claim 1.
Aiglstorfer further discloses:
A non-transitory computer-readable storage medium including instructions that, when processed by one or more processors, configure the one or more processors to perform the method according to claim 1 (see at least Figures 1-3, which depict a robotic system. The use of processors and non-transitory computer-readable storage mediums is implied and would be known to one of ordinary skill in the art before the effective filing date of the claimed invention.)

Regarding claim 15, Aiglestorfer discloses:
An autonomous system for robotic picking, comprising: 
an imaging system configured to acquire one or more images of a scene, the scene including one or more objects (see at least [0061]: “Accordingly, the cameras 8a, 8b detect a three-dimensional image of the inside of at least the containers 9a and 9b and the goods stored therein.”)
a robot comprising an end effector controllable by a controller, one or more processors, and memory storing instructions executable by the one or more processors (see at least Fig. 1) to:
estimate a pick point on an object… from the one or more objects, selected to be picked by an end effector of a robot (see at least [0061]: “sensor system 8a, 8b, which serves for the detection of a gripping surface pose and a gripping surface size of a gripping surface of a good…”)
estimate a pick pose for the end effector (see at least [0070]: “Accordingly, the controller 11 calculates a gripping pose for the movable robot head 4 in which the suction gripper 5a is in contact with the gripping surface 24f.”)
wherein the end effector defines an oblong footprint of contact, which is modeled as a 2D shape with specified dimensions (see at least Figures 4-7.)
the estimation comprising: 
determine a pick surface utilizing neighboring points around the pick point (see at least [0070]: “a gripping surface pose and a gripping surface size of a gripping surface of a good…”)
reproject a set of points, which define an extent of the pick surface, with respect to a normal of the pick surface, to create a planar representation of the pick surface (see at least Figures 8 and 9 which depict planar object surfaces.)
and compute a yaw-orientation based on alignment of a longer dimension of the end effector model with a longer dimension of the planar representation of the pick surface (see at least Figures 8 and 9, wherein one side of the end-effector, out of three equal sides, is aligned with the longer dimension of the object. See further [0062]: “…calculating from the determined gripping surface pose and gripping surface size a gripping pose for the movable robot head 4 in which a maximum number of suction grippers 5 is in contact with the gripping surface of the mentioned goods…”)
and output the estimated pick pose to a controller configured to control the end effector to pick the selected object (see at least [0062]: “Moreover, the controller 11 serves for moving the robot head 5 into the calculated gripping pose and for activating the suction grippers 5 that are in contact with or are coming into contact with the gripping surface of the mentioned goods, so as to collect the goods.”)
Aiglstorfer does not disclose, but Stoppi, in an analogous field of endeavor teaches: 
An object mask, the object mask produced by performing instance segmentation based on the one or more images, the object mask corresponding to an object, from the one or more objects (See at least [0015]: “In more detail, continuing the above example where the main camera 10 is a stereo array camera, the instance segmentation mask and the depth map may be computed based on images (e.g., color images, polarization raw frames, and the like) captured by the same main camera 10, and therefore are captured from the same viewpoint. As such, the instance segmentation mask may be used to directly identify the pixels (or 3-D points) of the depth map that correspond to the same instances of objects. For example, an instance segmentation mask may identify three different boxes and assign arbitrary example labels ‘A’, ‘B’, and ‘C’ to the pixels of the instance segmentation mask that correspond to the three different boxes in a captured 2-D image (e.g., a 2-D color intensity image). Each pixel of a depth map encodes a distance between the camera and a depicted surface of scene 1, but does not encode information about which points of the depth map correspond to different objects (e.g., the depth map may encode the shapes of the different boxes in the scene, but does not encode semantic information such as whether the boxes are parts of a same object or multiple different objects). When the depth map and the instance segmentation mask depict the same scene from the same viewpoint, then each portion of the instance segmentation mask assigns an object instance to corresponding portions of the depth map, thereby labeling the points of the depth map with semantic information about the scene, such as the class of object and an instance identifier (e.g., labeling depth points corresponding to the shapes of three boxes in the scene with different labels such as ‘A’, ‘B’, and ‘C’).”)
It would have been prima facie obvious for one of ordinary skill in the art before the effective filing date of the claimed invention, with a reasonable expectation for success, to combine the invention of Aiglstorfer with the object masks as taught by Stoppi. This is because both Aiglstorfer and Stoppi are directed to object picking. Paragraph [0004] of Stoppi’s disclosure states: “Generalizing to different objects and different environments is especially hard for vision systems that are often used to guide the bin-picking process. These vision systems typically capture images of a scene using cameras and analyze the captured images to estimate the location and orientation of the objects to pick. This is a challenging task, especially in the case of unknown environments (including varying lighting conditions) and irregular conditions and orientations of objects in the bin.” Stoppi’s invention is directed to solving this issue. One of ordinary skill in the art would be reasonably drawn to the improvements provided.

Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over Aiglstorfer and Stoppi in view of Fan (US 20220072707 A1), hereinafter Fan.
Regarding claim 6, the combination of Aiglstorfer and Stoppi teaches the method of claim 2. 
Aiglstorfer does not explicitly disclose, but Fan, in an analogous field of endeavor teaches:
wherein the number or reach of the neighboring points around the pick point in the object mask is determined based on a dimension of a single gripping element (see at least [0024]: “At box 150, the point cloud, the grasp pose, and the success rate from the physical environment simulation at the box 140 are recorded to form the grasping database. A point cloud depth image 152 depicts the pile of objects from the box 140 from a particular perspective or point of view. In a preferred embodiment, the depth image 152 is as viewed from the approach direction computed for a best grasp. From the image 152, the computations at the box 140 determined several candidate grasps which could be employed by the robot gripper. Each of the candidate grasps is represented by a grasp pose and point map 154—indicating points which could be used as a grasp target, with the angle of approach defined by the point of view of the image 152, and using a gripper angle and grip width computed at the box 140. Thus, the data stored at the box 150 includes the depth map from the desired approach angle, the point map 154 indicating the x/y/z coordinates of grasps including the best grasp, the gripper rotation angle and gripper width, and the grasp success rate from the physical environment simulation. The points in the point map 154 would be rank ordered in terms of grasp quality, and should result in a successful grasp of an object from the pile of objects in a bin.”)
It would have been prima facie obvious for one of ordinary skill in the art before the effective filing date of the claimed invention, with a reasonable expectation for success, to combine the invention of Aiglstorfer with the method as taught by Fan. This is because as stated in [0005] of Fan: “In light of the circumstances described above, there is a need for a robot grasp learning technique which generates high quality grasp candidates without manual teaching, is computationally efficient, and provides grasp scenarios which are applicable to real-world situations involving jumbled-together parts and collision avoidance between the robot arm and bin sides.”

Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Aiglstorfer and Stoppi in view of Ren et al. (US 20240112376 A1), hereinafter Ren.
Regarding claim 7, the combination of Aiglstorfer and Stoppi teaches the method of claim 1. 
Aiglstorfer does not explicitly disclose, but Ren, in an analogous field of endeavor teaches:
wherein the set of points that are reprojected are obtained by removing points in the object mask that do not belong to the pick surface based on a clustering method (see at least [0034]: “The region of overlapping image data between the ground projections of the reference and target frames may be identified, and a determination whether or not any pixels in this overlapping region belong to a detected object may be made. For example, object detection may be performed using two-dimensional (2D) object detection (e.g., from images) or 3D object detection (e.g., from images, a 3D point cloud of LiDAR or RADAR detections), a representation of the detected objects (e.g., an object or segmentation mask) may be projected onto the ground plane, and each pixel in the overlapping region may be compared against a corresponding pixel of the projected representation of the detected objects (e.g., projected object and/or segmentation mask(s) generated from the reference and/or target frames) to determine whether the pixel belongs to a detected object. If a determination is made that there are no detected objects (or less than or equal to a threshold number of detected object pixels, such as zero) in the overlapping region, color statistics may be transferred from the ground projection of the reference frame to the target frame. If a determination is made that there is more than some designated threshold number or percentage of pixels in the overlapping region (e.g., 50%) that belong to a detected object, the ground projection of the reference frame from the preceding time slice may be used to transfer color statistics to the target frame. If there is less than the designated threshold number or percentage of points or pixels, those points or pixels may be removed from the ground projection for the target and source frames, the remaining pixels from each ground projection may be clustered (e.g., using k-means clustering), and the majority cluster from each ground projection may be used to transfer the color statistics from the majority cluster from the ground projection of the reference frame to the target image.”)
It would have been prima facie obvious for one of ordinary skill in the art before the effective filing date of the claimed invention, with a reasonable expectation for success, to combine the invention of Aiglstorfer with the method of Ren because, as stated in [0005] of Ren’s disclosure states: “One existing technique for harmonizing color among different images of the same environment chooses one of the images as a reference image, determines global color statistics for the reference image, and transfers the global color statistics to the other images to match their global color statistics to those of the reference image. However, this technique is often ineffective at harmonizing color; for example, when different cameras view different objects or different parts of the same object. Applying a stitching algorithm using unharmonized camera images as input results in an unharmonized stitched image that has noticeable color mismatches at the seams where two images were stitched together, which may be perceived as artifacts. As such, conventional techniques generate distracting artifacts in regions of a stitched image that are potentially important for a driver or autonomous system to safely maneuver a vehicle. As a result, there is a need for improved color harmonization techniques that improve the visual quality of stitched images.”

Conclusion


	

	
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ELIZABETH NELESKI whose telephone number is (571)272-6064. The examiner can normally be reached 10 - 6.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, THOMAS WORDEN can be reached at (571) 272-4876. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/E.R.N./Examiner, Art Unit 3658
/JASON HOLLOWAY/Primary Examiner, Art Unit 3658
Read full office action
Prosecution Timeline

Jan 02, 2025
Application Filed
Apr 08, 2026
Non-Final Rejection mailed — §103
May 14, 2026
Response Filed
Precedent Cases

Applications granted by this same examiner with similar technology

18/318,328
Patent 12629843
ROBOT AND METHOD FOR CONTROLLING THEREOF
3y 0m to grant Granted May 19, 2026
18/897,430
Patent 12600044
GUIDE DOG ROBOT FOR THE VISUALLY IMPAIRED PERSONS AND CONTROL METHOD THEREOF
1y 6m to grant Granted Apr 14, 2026
17/733,704
Patent 12560222
METHOD FOR PERFORMING ROTATIONAL SPEED SYNCHRONISATION
3y 10m to grant Granted Feb 24, 2026
17/725,717
Patent 12545410
POSITION-SENSITIVE CONTROLLER FOR AIRCRAFT SEATING
3y 9m to grant Granted Feb 10, 2026
18/188,104
Patent 12515346
ROBOT AND CONTROL METHOD THEREFOR
2y 9m to grant Granted Jan 06, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
74%
Grant Probability
90%
With Interview (+16.3%)
3y 0m (~1y 7m remaining)
Median Time to Grant
Low
PTA Risk
Based on 99 resolved cases by this examiner. Grant probability derived from career allowance rate.