8DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 1/24/24, 3/25/25 & 6/24/25 are in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement are being considered by the examiner.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claim(s) 1-2, 4-9, 11-14 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by: LIU YUAN ET AL: "Gen6D: Generalizable Model-Free 6-DoF Object Pose Estimation from RGB Images”, 11 November 2022 (2022-11-11), SPRINGER INTERNATIONAL PUBLISHING, XP047639542, ISBN: 978-3-031-19824-3 vol. 13692, pages 298-315, DOI: https://doi.org/ 10.1007/978-3-031-19824-3_18.
Regarding claim 1. LIU discloses a method for determining a pose of a target object, the method to be implemented by a computing device that stores a database related to a specific type to which the target object belongs, the database including a plurality of template images each containing a reference object that belongs to the specific type (page 299, [2]: "Given input reference images of an arbitrary object with known poses, Gen6D is able to directly predict its object pose in any query images"), the template images corresponding respectively to different deflection angles that are relative to a reference angle in which the reference object is captured (page 302, [4]: "Given N, images of an object with known camera poses, called reference images, our target is to predict the pose of the object in a query image"; page 304, [4]: "To account for in-plane rotations, every reference image is rotated by N, predefined angles and all rotated versions are used in the element-wise product with the query image"), the method comprising:
obtaining an input image that contains the target object belonging to the specific type (page 302, [4]: "our target is to predict the pose of the object in a query image");
selecting a matching image that best matches with the input image from among the template images in the database based on an appearance of the target object in the input image (page 300, [3]: "compare the query image with every reference image to produce similarity scores and select the reference image with highest similarity score"),
wherein the matching image is one of the template images in which an angle of the reference object shown in the template image is closest to an angle of the target object shown in the input image (page 304, [2]: "Viewpoint selection aims to select a reference image whose viewpoint is the nearest to the query image");
performing a keypoint matching procedure based on the input image and the matching image, so as to identify a plurality of first feature points that are shown in the input image and that are related to the appearance of the target object, and a plurality of second feature points that are shown in the matching image and that respectively match with the first feature points (page 305, [5]: "to construct the features on these vertices, we first select N,, = 6 reference images that are near to the input pose. We extract feature maps on these selected reference images by a 2D CNN. Then, these feature maps are unprojected into the 3D volume and we compute the mean and variance of features among all reference images as features for volume vertices. For the query image, we also extract its feature map by the same 2D CNN, unproject feature map into the 3D volume using the input pose and concatenate the unprojected query features with the mean and variance of reference image features"); and
generating a pose-determination result that indicates the pose of the target object based on relationships among the first feature points and the second feature points (page 305, [5]: "Finally, we apply a 3D CNN on the concatenated features of the volume to predict a pose residual to update the input pose").
Regarding claim 2. LIU discloses the database further including
a plurality of reference feature datasets that correspond respectively to the template images, each of the reference feature datasets indicating an appearance feature of the reference object at the angle shown in the corresponding one of the template images, wherein selecting a matching image includes: generating a target feature dataset that corresponds to the input image based on a plurality of feature parts of the target object shown in the input image, where the target feature dataset indicates an appearance feature of the target object at the angle shown in the input image; calculating, for each of the reference feature datasets, a degree of matching between the target feature dataset and the reference feature dataset; and selecting, as the matching image, one of the template images that corresponds to the reference feature dataset having a highest degree of matching with the target feature dataset among the template images (page 300, [3] “use neural networks to pixel-wisely compare the query image with every reference image to produce similarity scores and select the reference image with highest similarity score. This pixel-wise comparison enables our selector to concentrate on object regions and reduces the influence of cluttered background. Furthermore, we add global normalization layers and self-attention layers to share similarity information cross different reference images. These two kinds of layers enable every reference images to commute with each other, which provides context information for the selector to select the most similar reference image”; page 304, [2,3,4] “Viewpoint selection aims to select a reference image whose viewpoint is the nearest to the query image. Meanwhile, we will estimate an in-plane rotation between the query image and the selected reference image. We approximately regard the viewpoint of the selected reference image as the viewpoint of the query image, which along with the estimated in-plane rotation forms an initial rotation for the object pose. As shown in Fig. 5, we design a viewpoint selector to compare the query image with every reference image to compute similarity scores. Specifically, we first extract feature maps by applying a VGG [52]-11 on reference images and the query image. Then, for every feature map of reference images, we compute its element-wise product with the feature map of the query image to produce a correlation score map. Finally, the correlation score map is processed by a similarity network to produce a similarity score and a relative in-plane rotation to align the query image with the reference image. In our viewpoint selector, we have three special designs. In-Plane Rotation. To account for in-plane rotations, every reference image is rotated by Na predefined angles and all rotated versions are used in the element-wise product with the query engine”).
Regarding claim 4. LIU discloses the template images including an original template image and a plurality of produced template images, the method further comprising, before obtaining an input image and selecting a matching image: obtaining the original template image; generating the reference feature dataset that corresponds to the original template image based on the original template image; generating the produced template images by rotating the original template image multiple times, respectively; and generating the reference feature datasets that correspond respectively to the produced template images based on the produced template images (page 302, [4] “Given Nr images of an object with known camera poses, called reference
images, our target is to predict the pose of the object in a query image.
The object pose here means a translation t and a rotation R that transform
the object coordinate xobj to the camera coordinate xcam = Rxobj + t. All the
intrinsics parameters of images are already known”; page 304, [2,3,4] “Viewpoint selection aims to select a reference image whose viewpoint is the nearest to the query image. Meanwhile, we will estimate an in-plane rotation between the query image and the selected reference image. We approximately regard the viewpoint of the selected reference image as the viewpoint of the query image, which along with the estimated in-plane rotation forms an initial rotation for the object pose. As shown in Fig. 5, we design a viewpoint selector to compare the query image with every reference image to compute similarity scores. Specifically, we first extract feature maps by applying a VGG [52]-11 on reference images and the query image. Then, for every feature map of reference images, we compute its element-wise product with the feature map of the query image to produce a correlation score map. Finally, the correlation score map is processed by a similarity network to produce a similarity score and a relative in-plane rotation to align the query image with the reference image. In our viewpoint selector, we have three special designs. In-Plane Rotation. To account for in-plane rotations, every reference image is rotated by Na predefined angles and all rotated versions are used in the element-wise product with the query engine”).
Regarding claim 5. LIU discloses wherein generating the produced template images includes generating the produced template images each by rotating the original template image at the corresponding one of the deflection angles that corresponds to the produced template images (page 302, [4] “Given Nr images of an object with known camera poses, called reference images, our target is to predict the pose of the object in a query image. The object pose here means a translation t and a rotation R that transform the object coordinate xobj to the camera coordinate xcam = Rxobj + t. All the intrinsics parameters of images are already known”; page 304, [2,3,4] “Viewpoint selection aims to select a reference image whose viewpoint is the nearest to the query image. Meanwhile, we will estimate an in-plane rotation between the query image and the selected reference image. We approximately regard the viewpoint of the selected reference image as the viewpoint of the query image, which along with the estimated in-plane rotation forms an initial rotation for the object pose. As shown in Fig. 5, we design a viewpoint selector to compare the query image with every reference image to compute similarity scores. Specifically, we first extract feature maps by applying a VGG [52]-11 on reference images and the query image. Then, for every feature map of reference images, we compute its element-wise product with the feature map of the query image to produce a correlation score map. Finally, the correlation score map is processed by a similarity network to produce a similarity score and a relative in-plane rotation to align the query image with the reference image. In our viewpoint selector, we have three special designs. In-Plane Rotation. To account for in-plane rotations, every reference image is rotated by Na predefined angles and all rotated versions are used in the element-wise product with the query engine”).
Regarding claim 6. LIU discloses the database further including a plurality of reference pose datasets that correspond respectively to the template images, each of the reference pose datasets indicating a pose of the reference object shown in the corresponding one of the template images, wherein: performing a keypoint matching procedure further includes generating a calibration dataset based on the first feature points and the second feature points, where the calibration dataset indicates the relationships among the first feature points and the second feature points, and the pose-determination result is generated based on the reference pose dataset that corresponds to the matching image, and on the calibration dataset (page 305, [4,5] “Specifically, since the objects are already normalized inside an unit sphere at the origin, we build a volume within the unit cube at the origin with S3 v = 323 vertices. As shown in Fig. 6 (a), to construct the features on these vertices, we first select Nn = 6 reference images that are near to the input pose. We extract feature maps on these selected reference images by a 2D CNN. Then, these feature maps are unprojected into the 3D volume and we compute the mean and variance of features among all reference images as features for volume vertices. For the query image, we also extract its feature map by the same 2D CNN, unproject feature map into the 3D volume using the input pose and concatenate the unprojected query features with the mean and variance of reference image features. Finally, we apply a 3D CNN on the concatenated features of the volume
to predict a pose residual to update the input pose. Similarity Approximation . Instead of regressing the rigid pose residual directly, we approximate it with a similarity transformation, as shown in Fig. 6 (b). The approximate similarity transformation consists of a 2D in-plane offset, a scale factor and a residual 3D rotation. The reason of using this approximation is that it avoids direct regression of the 3D translation from the red circle to the solid green circle in Fig. 6, which is out of the scope of the feature volume. Instead, we regress a similarity transformation from red circle to dotted green circle, which can be easily inferred from the features defined in the volume. More details can be found in the supplementary materials. In our implementation, we apply the refiner iteratively 3 times by default.”).
Regarding claim 7. LIU discloses wherein each one of the first feature points is homogeneous with one of the second feature points in a one-to-one relationship, and the calibration dataset indicates, for each one of the first feature points, the relationship between the first feature point and the respective one of the second feature points using six degrees of freedom in three-dimensional space (page 305, [4,5] “Specifically, since the objects are already normalized inside an unit sphere at the origin, we build a volume within the unit cube at the origin with S3 v = 323 vertices. As shown in Fig. 6 (a), to construct the features on these vertices, we first select Nn = 6 reference images that are near to the input pose. We extract feature maps on these selected reference images by a 2D CNN. Then, these feature maps are unprojected into the 3D volume and we compute the mean and variance of features among all reference images as features for volume vertices. For the query image, we also extract its feature map by the same 2D CNN, unproject feature map into the 3D volume using the input pose and concatenate the unprojected query features with the mean and variance of reference image features. Finally, we apply a 3D CNN on the concatenated features of the volume
to predict a pose residual to update the input pose. Similarity Approximation . Instead of regressing the rigid pose residual directly, we approximate it with a similarity transformation, as shown in Fig. 6 (b). The approximate similarity transformation consists of a 2D in-plane offset, a scale factor and a residual 3D rotation. The reason of using this approximation is that it avoids direct regression of the 3D translation from the red circle to the solid green circle in Fig. 6, which is out of the scope of the feature volume. Instead, we regress a similarity transformation from red circle to dotted green circle, which can be easily inferred from the features defined in the volume. More details can be found in the supplementary materials. In our implementation, we apply the refiner iteratively 3 times by default.”; page 306, [1] “Discussion. The key difference between our volume-based refiner and other pose refiners [29,56,73] is that our pose refiner does not require rendering an image on the input pose, which thus is more suitable for the model-free pose estimation. Meanwhile, since the 3D volume is constructed by multiple reference images with different poses, our volume-based refiner is able to know the image features under different poses and infer how pose changes affect the image features for unseen objects. In comparison, previous pose refiners [29,56,73] only compare a rendered image with the input query image to compute a pose residual. Such a 2D image does not provide enough 3D structure information to infer how pose changes affect image patterns, especially for unseen objects. Thus, it is hard for these methods to predict correct pose residuals for unseen objects”; Page 312, [1] “we propose an easy-to-use 6-DoF pose estimator Gen6D for unseen objects. To predict poses for unseen objects, Gen6D does not require the object model but only needs some posed images of the object to predict its pose in arbitrary environments. In Gen6D, we design a novel viewpoint selector and a novel volume-based pose refiner. Experiments demonstrate the superior performance of Gen6D estimator in predicting poses for unseen objects in the model-free setting”).
Regarding claim 8. Claim 8 is rejected for the same reasons and rational as provided above for claim 1.
Regarding claim 9. Claim 9 is rejected for the same reasons and rational as provided above for claim 2.
Regarding claim 11. Claim 11 is rejected for the same reasons and rational as provided above for claim 4.
Regarding claim 12. Claim 12 is rejected for the same reasons and rational as provided above for claim 5.
Regarding claim 13. Claim 13 is rejected for the same reasons and rational as provided above for claim 6.
Regarding claim 14. Claim 14 is rejected for the same reasons and rational as provided above for claim 7.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 3 & 10 is/are rejected under 35 U.S.C. 103 as being unpatentable over LIU as applied to claim 1 above, and further in view of US PG Pub 2022/0189049 to Watson et al.
Regarding claim 3. LIU dose not disclose calculating a Minkowski distance between the target feature dataset and the reference feature dataset. However, Watson in the same art of pose calculation/determination, discloses calculating a Minkowski distance between the target feature dataset and the reference feature dataset (“other distance metrics may be implemented (e.g., l.sub.2 distance, Manhattan distance, Minkowski distance, Hamming distance, etc.)”, paragraph 61).
Therefore, it would have been obvious to a person with ordinary skill in the art before the effective filing date of the claimed invention to have modified LIU' s pose estimation to include: calculating a Minkowski distance between the target feature dataset and the reference feature dataset.
It would have been obvious to a person with ordinary skill in the art before the effective filing date of the claimed invention to have modified LIU' s pose estimation by the teaching of Watson because of the following reasons: (a) During training of the model, a secondary depth estimation network that takes a single image input may be used to guide the training process to reduce overfitting, (paragraph 5 Watson) and (b) for one of ordinary skill in the art it would be a simple substitution of one known element, Minkowski distance, for another distance metric to obtain predictable results.
Regarding claim 10. Claim 10 is rejected for the same reasons and rational as provided above for claim 3.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. US PG Pub 20130322767 to CHAO et al. discloses a method for estimating camera pose includes: obtaining an image of a location captured via a camera, where the image includes a target object and edge line features outside of the target object; and calculating a pose of the camera with respect to the target object based on the edge line features.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CHRISTOPHER D. WAIT, Esq. whose telephone number is (571)270-5976. The examiner can normally be reached Monday-Friday, 9:30- 6:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Abderrahim Merouan can be reached at 571 270-5254. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
CHRISTOPHER D. WAIT, Esq.
Primary Examiner
Art Unit 2683
/CHRISTOPHER WAIT/Primary Examiner, Art Unit 2683