Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 02/14/2024 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1-4, 7, 9, 11-14, 17, and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over a non-patent literature titled “CenterSnap: Single-Shot Multi-Object 3D Shape Reconstruction and Categorical 6D Pose and Size Estimation” by Irshad et al. (hereinafter Irshad) dated 03/03/2022, available at https://arxiv.org/abs/2203.01929 in view of a non-patent literature titled “Self-supervised Neural Articulated Shape and Appearance Models” by Wei et al. (hereinafter Wei) dated 05/17/2022, available at https://arxiv.org/abs/2205.08525v1.
For claim 1, Irshad as applied teaches a method, comprising:
obtaining one or more images of an environment having one or more objects (see, e.g., abstract, 2nd and 3rd full pars. of sec. I, 1st full par. of sec. II, and 1st full par of section III and FIGS. 1 and 2, which teach using an RGB-D image as input);
generating, using a trained artificial intelligence (AI) encoder, first information associated with the one or more images based at least in part on the one or more images, (see, e.g., 3rd full par. of section I, 1st full par. of section III, 1st full par. of sec. III(B) and FIG. 2, which teach predicting joint shape, pose and size codes using a 3D auto-encoder), the first information comprising a plurality of joint codes and a plurality of shape codes associated with the one or more images (see, e.g., 3rd full par. of section I, 1st full par. of section III, 1st full par. of sec. III(B) and FIG. 2, which teach predicting joint shape, pose and size codes using a 3D auto-encoder);
generating, using a trained AI decoder, second information associated with the one or more objects based at least in part on the plurality of joint codes and the plurality of shape codes associated with the one or more images (see, e.g., e.g., 3rd full par. in sec. I, 1st full par. of sec. III(B), 1st full par. of sec. III(C), 2nd full par. of sec. III(D) and FIGS. 1-3, which teach using a trained decoder network to optimize and reconstruct the input point cloud coupled with 6D pose and scales), the second information comprising shape information, one or more joint types, and one or more joint states corresponding to at least one of the one or more objects (see, e.g., e.g., 3rd full par. in sec. I, 1st full par. of sec. III(B), 1st full par. of sec. III(C), 2nd full par. of sec. III(D) and FIGS. 1-3, which teach using a decoder network to reconstruct the input point cloud coupled with 6D pose and scale/size, wherein the 6D pose is denoted by a 3D rotation and translation; the examiner interprets the reconstructed pose as the claimed joint states); and
storing the second information in memory (see, e.g., 4th full par. of sec. I and see, e.g., 8th full par. of sec. IV, which teach outputting the reconstructed point cloud, which suggests that the reconstructed point cloud is stored).
Irshad as applied does not explicitly teach that the second information includes one or more joint types. Wei in the analogous art teaches recognizing and handling multiple joint types while reconstructing (see, e.g., abstract, 6th full par. of sec. 1, 4th full par. of sec. 4.2, and 1st full par. of sec. 4.3 and FIGS. 3 and 4 of Wei).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Irshad to recognize and handle multiple joint types as taught by Wei because doing so would yield predictable results of reconstructing objects to be more closely resembling to real-world object (see 1st full par. of sec. 5 and FIG. 3 of Wei and MPEP 2143(I)(D)).
For claim 11, Irshad as applied teaches a system, comprising:
one or more memories (see, e.g., 8th full par. of sec. IV, which teach using a desktop computer); and
one or more processors coupled to the one or more memories (see, e.g., 8th full par. of sec. IV, which teach using a desktop computer), the one or more processors being configured to cause the system to:
obtain one or more images of an environment having one or more objects (see, e.g., abstract, 2nd and 3rd full pars. of sec. I, 1st full par. of sec. II, and 1st full par of section III and FIGS. 1 and 2, which teach using an RGB-D image as input);
generate, using a trained artificial intelligence (AI) encoder, first information associated with the one or more images based at least in part on the one or more images (see, e.g., 3rd full par. of section I, 1st full par. of section III, 1st full par. of sec. III(B) and FIG. 2, which teach predicting joint shape, pose and size codes using a 3D auto-encoder), the first information comprising a plurality of joint codes and a plurality of shape codes associated with the one or more images (see, e.g., 3rd full par. of section I, 1st full par. of section III, 1st full par. of sec. III(B) and FIG. 2, which teach predicting joint shape, pose and size codes using a 3D auto-encoder);
generate, using a trained AI decoder, second information associated with the one or more objects based at least in part on the plurality of joint codes and the plurality of shape codes associated with the one or more images (see, e.g., e.g., 3rd full par. in sec. I, 1st full par. of sec. III(B), 1st full par. of sec. III(C), 2nd full par. of sec. III(D) and FIGS. 1-3, which teach using a trained decoder network to optimize and reconstruct the input point cloud coupled with 6D pose and scales), the second information comprising shape information, one or more joint types, and one or more joint states corresponding to at least one of the one or more objects (see, e.g., e.g., 3rd full par. in sec. I, 1st full par. of sec. III(B), 1st full par. of sec. III(C), 2nd full par. of sec. III(D) and FIGS. 1-3, which teach using a decoder network to reconstruct the input point cloud coupled with 6D pose and scale/size, wherein the 6D pose is denoted by a 3D rotation and translation; the examiner interprets the reconstructed pose as the claimed joint states); and
store the second information in memory (see, e.g., 4th full par. of sec. I and see, e.g., 8th full par. of sec. IV, which teach outputting the reconstructed point cloud, which requires storing the reconstructed point cloud).
Irshad as applied does not explicitly teach that the second information includes one or more joint types. Wei in the analogous art teaches recognizing and handling multiple joint types while reconstructing (see, e.g., abstract, 6th full par. of sec. 1, 4th full par. of sec. 4.2, and 1st full par. of sec. 4.3 and FIGS. 3 and 4 of Wei).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Irshad to recognize and handle multiple joint types as taught by Wei because doing so would yield predictable results of reconstructing objects to be more closely resembling to real-world object (see 1st full par. of sec. 5 and FIG. 3 of Wei and MPEP 2143(I)(D)).
For claims 2 and 12, Irshad in view of Wei teaches that generating the first information comprises:
generating a plurality of feature maps associated with the one or more images based at least in part on the one or more images (see, e.g., 3rd full par. of sec. I and 1st full par. of sec. III(A), which teach generating a feature pyramid backbone from extracted features of the image); and
inferring the first information based at least in part on the plurality of feature maps using the trained AI encoder (see, e.g., 3rd full par. of sec. I, 1st full par. of sec. III and 1st full par. of sec. III(B) and FIG. 2, which teach predicting joint shape, pose and size codes using the 3D auto-encoder and the feature-based heatmaps).
For claims 3 and 13, Irshad in view of Wei teaches that the first information further comprises:
a segmentation mask, one or more three-dimensional (3D) bounding boxes associated with the one or more objects, one or more poses associated with the one or more objects, a depth map, a heatmap, or any combination thereof (see, e.g., 1st full par. of sec. III, 1st full par. of sec. III(A), 1st full par. of sec. III(B), 1st full par of sec. III(C), and 1st and 2nd pars. of sec. III(D), and FIG. 2, which teach generating poses, heatmaps and masks).
For claims 4 and 14, Irshad as applied teaches that generating the second information comprises:
inferring the shape information for each of the one or more objects using a trained AI geometry decoder based at least in part on the plurality of joint codes and the plurality of shape codes (see, e.g., e.g., 3rd full par. in sec. I, 1st full par. of sec. III(B), 1st full par. of sec. III(C), 2nd full par. of sec. III(D) and FIGS. 1-3, which teach using a trained decoder network to optimize and reconstruct the input point cloud, which includes joint shape, pose and size codes).
inferring the one or more joint types and the one or more joint states for each of the one or more objects using a trained AI joint decoder based at least in part on the plurality of joint codes (see, e.g., e.g., 3rd full par. in sec. I, 1st full par. of sec. III(B), 1st full par. of sec. III(C), 2nd full par. of sec. III(D) and FIGS. 1-3, which teach using a decoder network to reconstruct the input point cloud coupled with 6D pose and scale/size, wherein the 6D pose is denoted by a 3D rotation and translation; the examiner interprets the reconstructed pose as the claimed joint states).
Irshad does not explicitly teach inferring one or more joint types based on the joint codes. Wei in the analogous art teaches recognizing and handling multiple joint types based on the articulation codes (see, e.g., abstract, 6th full par. of sec. 1, 4th full par. of sec. 4.2, and 1st full par. of sec. 4.3 and FIGS. 3 and 4 of Wei).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Irshad to infer and handle multiple joint types as taught by Wei because doing so would yield predictable results of reconstructing objects to be more closely resembling to real-world object (see 1st full par. of sec. 5 and FIG. 3 of Wei and MPEP 2143(I)(D)).
For claims 7 and 17, Irshad as applied teaches training the AI decoder based at least in part on a plurality of object categories and a plurality of joint types, wherein the one or more objects correspond to at least two or more of the plurality of object categories and at least one of the joint types (see, e.g., 1st full par. of sec. III(B) and 2nd full par. of sec. IV and FIG. 3, which teach training the decoder model based on multiple categories).
While Irshad does not explicitly teach, Wei in the analogous art teaches training its model based on multiple joint types (see, e.g., 1st full par. of sec. 4.1 and 1st full par. of sec. 4.3 of Wei).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Irshad to train on multiple joint types as taught by Wei because doing so would yield predictable results of recognizing and handling different types of joints (see, e.g., abstract, 6th full par. of sec. 1, 4th full par. of sec. 4.2, and 1st full par. of sec. 4.3 and FIGS. 3 and 4 of Wei and MPEP 2143(I)(D)).
For claims 9 and 19, Irshad in view of Wei teaches that the one or more images comprises a pair of stereo images, a red-green-blue-depth (RGB-D) image, or a combination thereof (see, e.g., abstract, 2nd and 3rd full pars. of sec. I, 1st full par. of sec. II, 1st full par of section III and FIGS. 1 and 2 of Irshad, which teach using an RGB-D image as input).
Claim(s) 5 and 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Irshad in view of Wei and further in view of a non-patent literature titled “ShAPO: Implicit Representations for Multi-Object Shape, Appearance, and Pose Optimization” by Irshad et al. (hereinafter Irshad2) dated 07/27/2022, available at https://arxiv.org/abs/2207.13691v1.
For claims 5 and 15, Irshad as applied teaches that:
the one or more joint states comprises an amount of articulation associated with a particular joint (see, e.g., 1st full par. of sec. III, 1st full par. of sec. III(C), 2nd full par. of sec. III(D), 3rd full par of sec. IV, which teaches that the pose includes 3D rotation and translation, which are evaluated in n° and m cm).
While Irshad does not explicitly teach, Wei in the analogous art teaches that the one or more joint types comprises a prismatic joint or a revolute joint (see, e.g., abstract, 6th full par. of sec. 1, 4th full par. of sec. 4.2, and 1st full par. of sec. 4.3 and FIGS. 3 and 4 of Wei, which teach recognizing and handling multiple joint types, e.g., revolution and prismatic joints).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Irshad to recognize and handle multiple joint types as taught by Wei because doing so would yield predictable results of reconstructing objects to be more closely resembling to real-world object (see 1st full par. of sec. 5 and FIG. 3 of Wei and MPEP 2143(I)(D)).
Irshad in view of Wei does not explicitly teach that the shape information comprises one or more signed distance functions for each of the one or more objects. Irshad 2 in the analogous art teaches representing shape information as an implicit signed distance field (SDF) (see, e.g., 1st and 2nd full pars. of sec. 3 and 1st full par. of sec. 3.3 of Irshad 2).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Irshad in view Wei to represent the shape information as SDFs as taught by Irshad 2 because doing so would allow inferring 3D shapes along with 6D pose, size and appearances in an RGB-D observation (see 1st and 2nd full pars. of sec. 3 of Irshad 2).
Claim(s) 10 and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Irshad in view of Wei and further in view of us patent application publication no. 2023/0077856 to Irshad et al. (hereinafter Irshad3).
For claim 10, while Irshad in view of Wei does not explicitly teach, Irshad3 in the analogous art teaches capturing the one or more images using a camera of a robotic device (see, e.g., pars. 32 and 70 and FIGS. 1 and 4 of Irshad3); and controlling the robotic device based at least in part on the stored second information (see, e.g., pars. 6, 32 and 72 and FIGS. 1 and 4 of Irshad3).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Irshad in Wei to control a robotic device as taught by Irshad because doing so would yield predictable results of allowing the robotic device to navigate about an environment (see, e.g., pars. 67-68 of Irshad3 and MPEP 2143(I)(D)).
For claim 20, while Irshad in view of Wei does not explicitly teach, Irshad 3 in the analogous art teaches:
a robot coupled to the one or more processors (see, e.g., pars. 18 and 20 and FIG. 1);
a camera communicably coupled to the one or memories and the one or more processors (see, e.g., pars. 32, 70 and 78 and FIG. 1), wherein the one or more processors is configured to cause the system to:
capture the one or more images using the camera (see, e.g., pars. 32 and 70 and FIGS. 1 and 4 of Irshad3), and
control the robot based at least in part on the stored second information (see, e.g., pars. 6, 32 and 72 and FIGS. 1 and 4 of Irshad3).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Irshad in Wei to control a robotic device as taught by Irshad because doing so would yield predictable results of allowing the robotic device to navigate about an environment (see, e.g., pars. 67-68 of Irshad3 and MPEP 2143(I)(D)).
Allowable Subject Matter
Claims 6, 8, 16 and 18 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
In regard to claims 6 and 16, when considered as a whole, prior art of record fails to disclose or render obvious, alone or in combination:
“training the AI decoder based at least in part on a joint space regularization among a plurality of articulated objects, the joint space regularization indicating joint space similarities among the articulated objects.”
In regard to claims 8 and 18, when considered as a whole, prior art of record fails to disclose or render obvious, alone or in combination:
“training the AI encoder based at least in part shape and joint code labels obtained from training the AI decoder.”
Additional Citations
The following table lists several references that are relevant to the subject matter claimed and disclosed in this Application. The references are not relied on by the Examiner, but are provided to assist the Applicant in responding to this Office action.
Citation
Relevance
Goforth et al. (us pat. pub. 2021/0150228)
Describes methods and systems for jointly estimating a pose and a shape of an object perceived by an autonomous vehicle. The system includes data and program code collectively defining a neural network which has been trained to jointly estimate a pose and a shape of a plurality of objects from incomplete point cloud data. The neural network includes a trained shared encoder neural network, a trained pose decoder neural network, and a trained shape decoder neural network. The method includes receiving an incomplete point cloud representation of an object, inputting the point cloud data into the trained shared encoder, outputting a code representative of the point cloud data. The method also includes generating an estimated pose and shape of the object based on the code. The pose includes at least a heading or a translation and the shape includes a denser point cloud representation of the object.
Pollefeys et al. (us pat. pub. 2020/0302634)
Describes a data processing system is provided. One embodiment includes a processor having associated memory, the processor being configured to execute instructions using portions of the memory to cause the processor to, at classification time, receive an input image frame from an image source. The input image frame includes an articulated object and a target object. The processor is further caused to process the input image frame using a trained neural network configured to, for each input cell of a plurality of input cells in the input image frame predict a three-dimensional articulated object pose of the articulated object and a three-dimensional target object pose of the target object relative to the input cell. The processor is further caused to output the three-dimensional articulated object pose and the three-dimensional target object pose from the neural network.
Kocabas et al. (us pat. pub. 2024/0070874)
Estimating motion of a human or other object in video is a common computer task with applications in robotics, sports, mixed reality, etc. However, motion estimation becomes difficult when the camera capturing the video is moving, because the observed object and camera motions are entangled. The present disclosure provides for joint estimation of the motion of a camera and the motion of articulated objects captured in video by the camera.
Table 1
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. See attached form 892 and Table 1.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to WOO RHIM whose telephone number is (571)272-6560. The examiner can normally be reached Mon - Fri 9:30 am - 6:00 pm et.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Henok Shiferaw can be reached at 571-272-4637. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/WOO C RHIM/Examiner, Art Unit 2676