Last updated: May 29, 2026
Application No. 18/534,478
MACHINE LEARNING FOR POSE ESTIMATION OF ROBOTIC SYSTEMS

Final Rejection §103
Filed
Dec 08, 2023
Priority
Dec 30, 2022 — provisional 63/436,372
Examiner
TRUONG, KARL DUC
Art Unit
2614
Tech Center
2600 — Communications
Assignee
Intrinsic Innovation LLC
OA Round
2 (Final)
Interview Optional

— +33.5% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 52% grant rate with +33.5% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 33 resolved cases, 2023–2026
Examiner Intelligence

TRUONG, KARL DUC View full profile →
Grants 52% of resolved cases
Career Allowance Rate
17 granted / 33 resolved
-10.5% vs TC avg
Strong +34% interview lift
Without
With
+33.5%
Interview Lift
resolved cases with interview
Typical timeline
2y 7m
Avg Prosecution
25 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
1.0%
-39.0% vs TC avg
§103
98.0%
+58.0% vs TC avg
§102
1.0%
-39.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 33 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
This action is in response to the amendment filed on 18th March, 2026. Claims 1, 4-5, 7, 9, 14-15, and 18 have been amended. Claim 8 has been cancelled. Claims 21-23 have been added. Claims 1-7 and 9-23 remain rejected in the application.

Response to Arguments
Applicant's arguments with respect to Claims 1 and 22-23, filed on 18th March, 2026, with respect to the rejection under 35 U.S.C. § 103 regarding that the prior art does not teach "determining a viewpoint dependent symmetry for the object based on the mesh using a feature-based sampling technique". The proposed amended claim limitations have been fully considered, but are not persuasive.

Applicant's arguments fail to comply with 37 CFR 1.111(b) because they amount to a general allegation that the claims define a patentable invention without specifically pointing out how the language of the claims patentably distinguishes them from the references. Therefore, applicant’s remark cannot be considered persuasive.




Applicant's arguments do not comply with 37 CFR 1.111(c) because they do not clearly point out the patentable novelty which he or she thinks the claims present in view of the state of the art disclosed by the references cited or the objections made. Further, they do not show how the amendments avoid such references or objections. Therefore, applicant’s remark cannot be considered persuasive.

In response to applicant's argument that the prior art does not teach "determining a viewpoint dependent symmetry for the object based on the mesh using a feature-based sampling technique" as recited in Claim 1, these limitations are taught by the combination of Fu and Jin. In particular, Fu teaches the following:
Paragraph [0074]: discloses a Module GeoReS (Geometry-constrained Reflection Symmetry) that reasons symmetry properties <read on determining symmetry> of objects <read on mesh> for further 3D shape understanding, better perspective context decomposing, and dimension information learning;
Paragraph [0076]: discloses the GeoReS module learning a function                     
                        h
                        γ
                        (
                        ∙
                        )
                    
                 to enforce that a predicted reflection-paired                     
                        h
                        γ
                        (
                        O
                        (
                        R
                        ,
                         
                        t
                        n
                        |
                        c
                        )
                        )
                        ≡
                        O
                        '
                        (
                        R
                        ,
                         
                        t
                        n
                        |
                        c
                        )
                    
                , which implicitly decouples the                     
                        (
                        R
                        ;
                         
                        t
                        n
                        )
                    
                 from the perspective-specific observation                     
                        O
                        (
                        R
                        ,
                         
                        t
                        n
                        |
                        c
                        )
                    
                 <read on viewpoint dependent symmetry>; and
Paragraph [0086]: discloses an example of a viewpoint dependent symmetry object, such as a mug being treated as a symmetric instance only under specific perspectives, where the handle is occluded.




Additionally, Jin teaches the following:
Paragraph [0048]: discloses identifying corresponding points between a pair of feature curves, where corresponding points on the feature curves are points on the feature curves that represent the actual same point on the object depicted on the images; and
Paragraph [0048]: further discloses transforming each feature curve into a 2D Bezier curve, where a 3D constraint generator 110 samples <read on feature-based sampling technique> each of the 2D Bezier curves at fixed intervals; the "feature-based sampling technique" is being interpreted as a neural network that selects representative data subsets by leveraging specific input features instead of random selection.
Therefore, applicant’s remark cannot be considered persuasive.

Regarding arguments to Claims 2-7 and 9-21, they directly/indirectly depend on independent Claims 1 and 22-23 respectively. Applicant does not argue anything other than independent Claims 1 and 22-23. The limitations in those claims, in conjunction with combination, was previously established as explained.











Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-2, 4, 7, 9-13, and 22-23 are rejected under 35 U.S.C. 103 as being unpatentable over Fu et al. (US 20220292698 A1, previously cited), hereinafter referenced as Fu, in view of Jin et al. (US 20130127847 A1, previously cited), hereinafter referenced as Jin.

Regarding Claim 1, Fu discloses a computer-implemented method for generating a plurality of training examples used for training a machine learning model (Fu, [0040]: teaches "a method <read on computer-implemented method> for category-level 6D pose and size estimation, including a 3D-OCR step for 3D Orientation-Consistent Representation, a GeoReS step for Geometry-constrained Reflection Symmetry, and a MPDE step for Mirror-Paired Dimensional Estimation"; [0042]: teaches "in the GeoReS step, an original input depth observation including pre-processed predicted category labels and potential masks of the target instances is received as input <read on generated training examples>," which is used for training a neural network <read on machine learning model>), comprising:
receiving data representing a three-dimensional model of an object with a physical pose (Fu, [0091]: teaches a neural network being trained on synthetic 3D models <read on receiving data>; [0061]: teaches the neural network estimating the 6D object pose and size of a set of unseen instances with known categories, presented by a partial point cloud; Note: it should be noted that paragraph [0122] of the specification defines a "physical pose" as "an object pose or actual pose");
[[modifying one or more features of the three-dimensional model to generate a mesh representing the object;]]
determining a viewpoint dependent symmetry for the object based on the mesh [[using a feature-based sampling technique]] (Fu, [0074]: teaches a Module GeoReS (Geometry-constrained Reflection Symmetry) that reasons symmetry properties <read on determining symmetry> of objects <read on mesh> for further 3D shape understanding, better perspective context decomposing, and dimension information learning; [0076]: teaches the GeoReS module learning a function                                 
                                    h
                                    γ
                                    (
                                    ∙
                                    )
                                
                             to enforce that a predicted reflection-paired                                 
                                    h
                                    γ
                                    (
                                    O
                                    (
                                    R
                                    ,
                                     
                                    t
                                    n
                                    |
                                    c
                                    )
                                    )
                                    ≡
                                    O
                                    '
                                    (
                                    R
                                    ,
                                     
                                    t
                                    n
                                    |
                                    c
                                    )
                                
                            , which implicitly decouples the                                 
                                    (
                                    R
                                    ;
                                     
                                    t
                                    n
                                    )
                                
                             from the perspective-specific observation                                 
                                    O
                                    (
                                    R
                                    ,
                                     
                                    t
                                    n
                                    |
                                    c
                                    )
                                
                             <read on viewpoint dependent symmetry>; [0086]: teaches an example of a viewpoint dependent symmetry object, such as a mug being treated as a symmetric instance only under specific perspectives, where the handle is occluded), wherein
the viewpoint dependent symmetry includes a global symmetry and/or a partial symmetry of the object (Fu, [0074]: teaches reflection symmetry <read on global symmetry>; [0075]: teaches rotational symmetry <read on partial symmetry>); and
generating output data comprising the mesh and symmetry data based on the determined viewpoint dependent symmetry (Fu, [0077]: teaches a mirror-paired dimensional estimation (MPDE), which is built upon the GeoReS module, where it obtains a relative complete object shape <read on mesh> beneficial to center localization and size regression, by grouping the output <read on output data> of the GeoReS branch and the input; [0077]: further teaches combining the observation and ground truth symmetric points <read on symmetry data> during training, namely the grouped points; [0076]: teaches the GeoReS module learning a function                                 
                                    h
                                    γ
                                    (
                                    ∙
                                    )
                                
                             to enforce that a predicted reflection-paired                                 
                                    h
                                    γ
                                    (
                                    O
                                    (
                                    R
                                    ,
                                     
                                    t
                                    n
                                    |
                                    c
                                    )
                                    )
                                    ≡
                                    O
                                    '
                                    (
                                    R
                                    ,
                                     
                                    t
                                    n
                                    |
                                    c
                                    )
                                
                            , which implicitly decouples the                                 
                                    (
                                    R
                                    ;
                                     
                                    t
                                    n
                                    )
                                
                             from the perspective-specific observation                                 
                                    O
                                    (
                                    R
                                    ,
                                     
                                    t
                                    n
                                    |
                                    c
                                    )
                                
                             <read on viewpoint dependent symmetry>); and
providing the output data to a machine learning model as input for predicting the physical pose of a training image of the object at a particular viewpoint during training (Fu, [0082]: teaches the GeoReS module <read on machine learning model> learning "to predict potential mirror-paired points, to reason the pointwise reflective points based on the input <read on provided output data> observation points <read on particular viewpoint>," where this is performed during symmetry prediction loss <read on during training>; [0083]: teaches the input observation points being the observable surface points, which is being interpreted as points that are visible from a certain or particular point of view; [0087]: teaches training the neural network with synthetic depth images <read on training image>; [0093]: teaches the estimated 6D pose and size being visualized as tight oriented bounding boxes around the target instances, which are the predicted poses <read on predicted physical pose of object> as shown in the top row of FIG. 4).
    PNG
    media_image1.png
    270
    734
    media_image1.png
    Greyscale


However, Fu does not expressly disclose
modifying one or more features of the three-dimensional model to generate a mesh representing the object; and
determining a viewpoint dependent symmetry for the object based on the mesh using a feature-based sampling technique.

Jin discloses
modifying one or more features of the three-dimensional model to generate a mesh representing the object (Jin, [0038]: teaches generating an image-based 3D model from single-view feature curves identified in digital images of an object and from multi-view feature curves identified in digital images of an object; [0030]: teaches a 3D model generator 100 that identifies feature curves that indicate shape (i.e., curves and/or edges) of an object depicted in the digital images, where the feature curves are selectable and modifiable); and
determining a viewpoint dependent symmetry for the object based on the mesh using a feature-based sampling technique (Jin, [0048]: teaches identifying corresponding points between a pair of feature curves, where corresponding points on the feature curves are points on the feature curves that represent the actual same point on the object depicted on the images; [0048]: further teaches transforming each feature curve into a 2D Bezier curve, where a 3D constraint generator 110 samples <read on feature-based sampling technique> each of the 2D Bezier curves at fixed intervals; Note: it should be noted that the "feature-based sampling technique" is being interpreted as a neural network that selects representative data subsets by leveraging specific input features instead of random selection).

Jin is analogous art with respect to Fu because they are from the same field of endeavor, namely generating a 3D model of an object based on input images. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to implement a 3D model generator that identifies feature curves of an object using a 3D constraint generator as taught by Jin into the teaching of Fu. The suggestion for doing so would allow the system to automatically detect curves and/or edges of said object, which can then be modified to determine global or partial symmetry, thereby improving the training process. Therefore, it would have been obvious to combine Jin with Fu.

Regarding Claim 22, it recites the limitations that are similar in scope to Claim 1, but in a system. As shown in the rejection, the combination of Fu and Jin discloses the limitations of Claim 1. Additionally, Fu discloses a system (Fu, [0057]: teaches a digital computer <read on system>) comprising
one or more computers (Fu, [0057]: teaches the processors <read on computers> of the digital computer) and
one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations (Fu, [0057]: teaches processors receiving instructions and data from read-only memory/RAM, where the processor performs instructions <read on operations> from one or more memory devices <read on storage devices>) comprising…

Thus, Claim 22 is met by Fu according to the mapping presented in the rejection of Claim 1, given the computer-implemented method corresponds to a system.

Regarding Claim 23, it recites the limitations that are similar in scope to Claim 1, but in one or more non-transitory computer storage media. As shown in the rejection, the combination of Fu and Jin discloses the limitations of Claim 1. Additionally, Fu discloses one or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations (Fu, [0057]: teaches processors <read on computers> receiving instructions and data from read-only memory/RAM <read on non-transitory computer storage media>, where the processor performs instructions <read on operations> from one or more memory devices <read on storage devices>) comprising:…

Thus, Claim 23 is met by Fu according to the mapping presented in the rejection of Claim 1, given the computer-implemented method corresponds to one or more non-transitory computer storage media.

Regarding Claim 2, the combination of Fu and Jin discloses the method of Claim 1. Additionally, Fu further discloses wherein the three-dimensional model of the object is
a computer-aided design (CAD) model or generated based on a plurality of images of the object taken from different views (Fu, [0087]: teaches rendering 60 different views for each instance to generate a 3D model reconstruction).





Regarding Claim 4, the combination of Fu and Jin discloses the method of Claim 1. Additionally, Fu further discloses wherein determining the viewpoint dependent symmetry for the object based on the mesh comprises:
[[determining multiple feature lines from the mesh;]]
[[determining a set of pairs of feature lines from the multiple feature lines, wherein]]
[[each pair of feature lines include two adjacent feature lines that are not parallel to each other so that the pair of feature lines uniquely define a coordinate frame;]]
[[repeatedly sampling two pairs of feature lines from the set of pairs of feature lines, and]]
for each two pairs, determining a candidate transformation from a first coordinate frame defined by a first pair of the two pairs to a second coordinate frame defined by a second pair of the two pairs (Fu, [0081]: teaches a set of candidate symmetry transformations; [0082]: teaches the GeoReS module learning to predict potential mirror-paired points <read on second pair>; [0093]: teaches "with the predicted poses and sizes of the target instances, the bounding boxes are transformed into the camera coordinate <read on candidate transformation from first coordinate frame to second coordinate frame> and then projected onto a 2D image frame with given camera intrinsic");
for each candidate transformation of the candidate transformations, determining a respective overlap measure for the candidate transformation (Fu, [0081]: teaches a set of candidate symmetry transformations; [0086]: teaches computing the average precision of 3D Intersection-Over-Union (IoU) <read on overlap measures>, which is used for evaluating 6D pose recovery); and
determining the symmetry with respect to a rotational axis based on the respective overlap measures (Fu, [0075]: teaches determining rotational symmetry by rotating a partial point around its symmetry-axis <read on rotational axis> in the object frame to allow the network to reason occluded parts from the observable one and to obtain a more complete shape for subsequent dimensional estimation; [0086]: teaches computing the average precision of 3D Intersection-Over-Union (IoU) <read on overlap measures>).

However, Fu does not expressly disclose
determining multiple feature lines from the mesh;
determining a set of pairs of feature lines from the multiple feature lines, wherein
each pair of feature lines include two adjacent feature lines that are not parallel to each other so that the pair of feature lines uniquely define a coordinate frame;
repeatedly sampling two pairs of feature lines from the set of pairs of feature lines.

Jin discloses
determining multiple feature lines from the mesh (Jin, [0041]: teaches the system "receiving input which identifies a plurality of feature curves <read on determining feature lines> for the object"; Note: it should be noted that paragraph [0071] of the specification defines a "feature line" as a 3D object being used as grading footprints, where a grading footprint outlines the area to be graded, which dictates how the area should be modified);
determining a set of pairs of feature lines from the multiple feature lines (Jin, [0057]: teaches a two-view curve fitting method that is used to generate a 3D shape constraint, which comprises pairs of corresponding feature curves <read on set of pairs of feature lines>), wherein
each pair of feature lines include two adjacent feature lines that are not parallel to each other so that the pair of feature lines uniquely define a coordinate frame (Jin, [0067]: teaches a 3D surface approximator 120 creating a bounding box for each 2D single-view feature curve that has been identified by the user, where "the planes of the 3D frusta <read on adjacent feature lines that are not parallel to each other that uniquely defines coordinate frames> may constrain the 3D space which the visual hull can occupy");
repeatedly sampling two pairs of feature lines from the set of pairs of feature lines (Jin, [0051]: teaches the 3D constraint generator 110 calculating "a cost value which indicates the quality of point correspondence between points on two corresponding feature curves," where it considers "each possible point correspondence (i.e., pair of points) between the feature curve points when calculating the cost values"; [0051]: further teaches "for each possible point correspondence, 3D constraint generator may solve for an optimal 3D point which has a minimal reprojection error with respect to the pair of feature curve sample points <read on repeatedly sampling two pairs of feature lines>").

Jin is analogous art with respect to Fu because they are from the same field of endeavor, namely generating a 3D model of an object based on input images. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to implement a 3D model generator that identifies feature curves of an object using a 3D constraint generator as taught by Jin into the teaching of Fu. The suggestion for doing so would allow the system to automatically detect curves and/or edges of said object, which can then be modified to determine global or partial symmetry, thereby improving the training process. Therefore, it would have been obvious to combine Jin with Fu.

Regarding Claim 7, the combination of Fu and Jin discloses the method of Claim 1. Additionally, Fu further discloses wherein
the view-point dependent symmetry is visible from a current viewpoint of the training image of the object (Fu, [0086]: teaches determining certain objects, such as a mug, as symmetric objects under specific perspectives <read on object symmetry being view-point dependent>, where the handle is occluded <read on object being visible from current viewpoint>; Note: it should be noted that although not expressly stated, the object must be visible in order to determine that an object exists in the first place).

Regarding Claim 9, the combination of Fu and Jin discloses the method of Claim 1. Additionally, Fu further discloses
determining a ground-truth label for a pose of the object in the training image (Fu, [0081]: teaches generating a ground truth set <read on determining ground-truth label for pose of object in training image> for symmetric instances), wherein
determining the ground-truth label for a pose of the object in the training image (Fu, [0081]: teaches generating the ground truth set <read on determining ground-truth label for pose of object in training image> for symmetric instances) comprises:
[[generating multiple two-dimensional images for the object based on the mesh data;]]
determining, [[from the multiple two-dimensional images for the object]], multiple symmetry generators for the object (Fu, [0081]: teaches the process of generating a ground truth set for symmetric instances include generating a set of candidate symmetry transformations <read on symmetry generators for object>; Note: it should be noted that Paragraph [0078] of the specification defines a symmetry generator as a symmetry transformation);
determining, based on the multiple symmetry generators for the object, a representative pose of the object (Fu, [0069]: teaches extracting "the representative orientation <read on representative pose of object> from the observation of a category-known instance and map it onto the correspondent canonical template shape of the category c, inheriting the consistent orientation R"; [0070]: teaches the generated representation implicitly characterizes the predicted orientation to be consistent with input                                 
                                    O
                                    (
                                    R
                                    ,
                                     
                                    t
                                    n
                                    |
                                    c
                                    )
                                
                            );
determining, as a canonical pose of the object, a pose generated from one of the multiple symmetry generators that is mostly aligned with the representative pose (Fu, [0069]: teaches extracting "the representative orientation from the observation and map it onto the correspondent canonical template shape <read on canonical pose of object> of the category c, inheriting the consistent orientation R"; [0070]: teaches the neural network learning a semantic correspondence between the partial observation and the 3D-OCR (3D Orientation-Consistent Representation) to maintain semantic alignment <read on determining pose generated from symmetry generators that are mostly aligned with representative pose>); and
assigning the canonical pose as the ground-truth pose of the object in the training image (Fu, [0082]: teaches the Geo-ReS module learning to predict potential mirror-paired points using the ground truth and predicted reflective points as input points, where it is built on a canonical category-specific template shape <read on the canonical pose being the ground-truth label>; [0069]: teaches a canonical category-specific template shape with category label c being predicted <read on assigning>).



However, Fu does not expressly disclose
generating multiple two-dimensional images for the object based on the mesh data; and
determining, from the multiple two-dimensional images for the object, multiple symmetry generators for the object.

Jin discloses
generating multiple two-dimensional images for the object based on the mesh data (Jin, [0039]: teaches receiving a plurality of captured images <read on generated 2D images> of the object from different viewpoints of said object, where 3D surface information <read on mesh data> is extracted from the multiple images of the object by the 3D model generator 100 to create a 3D model of the object); and
determining, from the multiple two-dimensional images for the object, multiple symmetry generators for the object (Jin, [0039]: teaches receiving a plurality of captured images <read on generated 2D images> of the object from different viewpoints of said object).

Jin is analogous art with respect to Fu because they are from the same field of endeavor, namely generating a 3D model of an object based on input images. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to implement a 3D model generator that identifies feature curves of an object using a 3D constraint generator as taught by Jin into the teaching of Fu. The suggestion for doing so would allow the system to automatically detect curves and/or edges of said object, which can then be modified to determine global or partial symmetry, thereby improving the training process. Therefore, it would have been obvious to combine Jin with Fu.

Regarding Claim 10, the combination of Fu and Jin discloses the method of Claim 9. Additionally, Fu further discloses wherein the representative pose of the object according to the symmetry is determined based on
an orientation of a center of the object and an orientation of a center of a viewer (Fu, [0083]: teaches the neural network learning the distance-weighted vector pointing to the center from the visible points or translation offset between them <read on orientation of center of object>; [0093]: teaches "with the predicted poses and sizes of the target instances, the bounding boxes are transformed into the camera coordinate and then projected onto a 2D image frame with given camera intrinsic <read on orientation of center of viewer>").

Regarding Claim 11, the combination of Fu and Jin discloses the method of Claim 1. Additionally, Fu further discloses
determining multiple sparse keypoints for the object based on the mesh for the object (Fu, [0069]: teaches the template shape T being sampled by the Farthest Point Sampling (FPS) algorithm for preserving the geometry of the object into a sparse K keypoints representations <read on determining sparse keypoints for object>).

Regarding Claim 12, the combination of Fu and Jin discloses the method of Claim 11. Additionally, Fu further discloses wherein determining the multiple sparse keypoints for the object comprises:
sampling a point as a keypoint from the mesh based on a distance measure between the point and a previously-sampled keypoint (Fu, [0081]: teaches the neural network learning to reconstruct keypoints <read on sampling point as keypoint>; [0083]: teaches the neural network learning the distance-weighted vector <read on distance measure> pointing to the center from the visible points <read on previously-sampled keypoint> or translation offset between them).

Regarding Claim 13, the combination of Fu and Jin discloses the method of Claim 11. Additionally, Fu further discloses wherein determining the multiple sparse keypoints for the object comprises:
sampling a point as the keypoint from the mesh based on a vector specifying a local geometry for the point (Fu, [0071]: teaches K-dimensional vectors being derived from the last n layers, which then generates a K-dimension output vector, where the neural network generates a per-point embedding <read on sampling point as keypoint>, which is then fed into the decoder, "which aims to perform orientation-guided characterization under perspective context extracted from the partial observations <read on specifying local geometry>"), wherein
the vector is used to determine a level saliency for the point (Fu, [0071]: teaches the neural network using a PointNet-like structure for aggregating low-level and high-level features <read on determining level saliency>, which "conducts the maxpooling of each K-dimensional vector derived from the last n layers and concatenates them as a multiple latent feature"; Note: it should be noted that "level saliency" is being interpreted as a level of interest or types of features).






Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Fu et al. (US 20220292698 A1, previously cited), hereinafter referenced as Fu, in view of Jin et al. (US 20130127847 A1, previously cited), hereinafter referenced as Jin as applied to Claim 1 above respectively, and further in view of Arisoy et al. (US 20150367578 A1, previously cited), hereinafter referenced as Arisoy.

Regarding Claim 3, the combination of Fu and Jin discloses the method of Claim 1. The combination of Fu and Jin does not expressly disclose the limitations of Claim 3; however, Arisoy discloses wherein modifying the one or more features to generate the mesh comprises
beveling or smoothing one or more sharp edges of the three-dimensional model (Arisoy, [0043]: teaches the system automatically detecting problematic sharp cusp regions, where the user is able to remove <read on smoothing> the detected sharp cusp regions within the identified regions of interest, thereby preserving a local shape region that only retains intended sharp edges).

Arisoy is analogous art with respect to Fu, in view of Jin because they are from the same field of endeavor, namely processing 3D mesh models. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to implement automatic detection of sharp cusps within the identified regions of interests on the 3D model as taught by Arisoy into the teaching of Fu, in view of Jin. The suggestion for doing so would allow the system to remove problematic sharp edges/points while preserving intended sharp edges/points, thereby maintaining a more accurate local shape region that is suitable for neural network training. Therefore, it would have been obvious to combine Arisoy with Fu, in view of Jin.

Claims 5-6 are rejected under 35 U.S.C. 103 as being unpatentable over Fu et al. (US 20220292698 A1, previously cited), hereinafter referenced as Fu, in view of Jin et al. (US 20130127847 A1, previously cited), hereinafter referenced as Jin as applied to Claim 1 above respectively, and further in view of Ilic et al. (US 20200211220 A1, previously cited), hereinafter referenced as Ilic.

Regarding Claim 5, the combination of Fu and Jin discloses the method of Claim 1. Additionally, Fu further discloses wherein determining the viewpoint dependent symmetry for the object based on the mesh comprises:
[[for each sampling of multiple samplings, determining a coordinate frame for multiple locations selected in the sampling;]]
[[generating a set of sampling pairs from the multiple samplings,]]
[[each sampling pair in the set includes a pair of coordinate frames associated with the pair of sampling;]]
[[for each sampling pair of the set of sampling pairs]], determining a pose transformation between the pair of coordinate frames (Fu, [0093]: teaches "with the predicted poses and sizes of the target instances <read on determining pose transformation>, the bounding boxes are transformed into the camera coordinate and then projected onto a 2D image frame with given camera intrinsic <read on pair of coordinate frames>"); and
clustering the pose transformations to determine the symmetry for the object with respect to a rotational axis (Fu, [0075]: teaches rotational symmetric instances/objects being symmetry-axis constrained, where a partial point is rotated 180 degrees around its symmetry-axis <read on determine object symmetry with respect to rotational axis> in the object frame to generate paired points to obtain a more complete shape for subsequent dimensional estimation <read on clustering pose transformations>).

However, the combination of Fu and Jin does not expressly disclose
for each sampling of multiple samplings, determining a coordinate frame for multiple locations selected in the sampling;
generating a set of sampling pairs from the multiple samplings,
each sampling pair in the set includes a pair of coordinate frames associated with the pair of sampling; and
for each sampling pair of the set of sampling pairs, determining a pose transformation between the pair of coordinate frames.

Ilic discloses
for each sampling of multiple samplings, determining a coordinate frame for multiple locations selected in the sampling (Ilic, [0096]: teaches                                 
                                    
                                        
                                            L
                                        
                                        
                                            p
                                            a
                                            i
                                            r
                                            s
                                        
                                    
                                
                             being a pairwise term, where "it is defined over a set                                 
                                    P
                                
                             of sample pairs                                 
                                    (
                                    
                                        
                                            s
                                        
                                        
                                            i
                                        
                                    
                                    ;
                                     
                                    
                                        
                                            s
                                        
                                        
                                            j
                                        
                                    
                                    )
                                
                            " such that "samples within an individual pair come from the same object 10, with either a very similar orientation or the same orientation with different image recording conditions <read on determine coordinate frame for multiple locations selected in sampling>"; Note: it should be noted that "coordinate frame" is being interpreted as coordinates);
generating a set of sampling pairs from the multiple samplings (Ilic, [0096]: teaches                                 
                                    
                                        
                                            L
                                        
                                        
                                            p
                                            a
                                            i
                                            r
                                            s
                                        
                                    
                                
                             being a pairwise term, where "it is defined over a set                                 
                                    P
                                
                             of sample pairs                                 
                                    (
                                    
                                        
                                            s
                                        
                                        
                                            i
                                        
                                    
                                    ;
                                     
                                    
                                        
                                            s
                                        
                                        
                                            j
                                        
                                    
                                    )
                                
                            " such that "samples within an individual pair come from the same object 10 <read on generating set of sampling pairs>, with either a very similar orientation or the same orientation with different image recording conditions"),
each sampling pair in the set includes a pair of coordinate frames associated with the pair of sampling (Ilic, [0096]: teaches                                 
                                    
                                        
                                            L
                                        
                                        
                                            p
                                            a
                                            i
                                            r
                                            s
                                        
                                    
                                
                             being a pairwise term, where "it is defined over a set                                 
                                    P
                                
                             of sample pairs                                 
                                    (
                                    
                                        
                                            s
                                        
                                        
                                            i
                                        
                                    
                                    ;
                                     
                                    
                                        
                                            s
                                        
                                        
                                            j
                                        
                                    
                                    )
                                
                             <read on sampling pair including pair of coordinate frames>" such that "samples within an individual pair come from the same object 10, with either a very similar orientation or the same orientation with different image recording conditions"); and
for each sampling pair of the set of sampling pairs, determining a pose transformation between the pair of coordinate frames (Ilic, [0096]: teaches a set of sample pairs <read on sampling pair from set of sampling pairs>).

Ilic is analogous art with respect to Fu, in view of Jin because they are from the same field of endeavor, namely identifying object instances in one or more images. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to have the neural network track keypoints across multiple images that correspond to a similar/same location on the detected object using pairwise sampling points as taught by Ilic into the teaching of Fu, in view of Jin. The suggestion for doing so would allow the neural network to better understand the dimensionality of the detected object through inference, thereby yielding predictable results. Therefore, it would have been obvious to combine Ilic with Fu, in view of Jin.


Regarding Claim 6, the combination of Fu, Jin, and Ilic discloses the method of Claim 5. The combination of Fu and Jin does not expressly disclose the limitations of Claim 6; however, Ilic discloses wherein
each sampling is determined based on a surface descriptor (Ilic, [0076]: teaches improving the process of determining rotation information by first "introducing a dynamic margin into the loss function, so that more rapid training and shorter descriptors are made possible, and then by producing a rotational invariance by learning rotations in the plane, including surface normals <read on surface descriptor> as a strong and complementary modality for RGB-D data"; [0075]: teaches the dynamic margin being an equation, "where q represents the orientation of the respective sample <read on sampling> as a quaternion, c denoting the object identity"), wherein
each sampling pair of the set of sampling pairs is determined based on a level of compatibility between the corresponding surface descriptors (Ilic, [0077]: teaches introducing a dynamic margin into the manifold learning triplet loss function <read on sampling pair>, where "such a loss function may be configured to map images of different objects and their orientation into a descriptor space with lower dimension, it being possible to apply efficient nearest-neighbor search methods to the descriptor space <read on level of compatibility between surface descriptors>"; [0092]: teaches an individual triplet 38 comprising a pair of similar samples                                 
                                    
                                        
                                            s
                                        
                                        
                                            i
                                        
                                    
                                    ,
                                     
                                    
                                        
                                            s
                                        
                                        
                                            j
                                        
                                    
                                
                             and a pair of dissimilar samples                                 
                                    
                                        
                                            s
                                        
                                        
                                            i
                                        
                                    
                                    ,
                                     
                                    
                                        
                                            s
                                        
                                        
                                            k
                                        
                                    
                                
                            ).



Ilic is analogous art with respect to Fu, in view of Jin because they are from the same field of endeavor, namely identifying object instances in one or more images. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to have the neural network track keypoints across multiple images that correspond to a similar/same location on the detected object using pairwise sampling points as taught by Ilic into the teaching of Fu, in view of Jin. The suggestion for doing so would allow the neural network to better understand the dimensionality of the detected object through inference, thereby yielding predictable results. Therefore, it would have been obvious to combine Ilic with Fu, in view of Jin.

Claims 14, 17, and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Fu et al. (US 20220292698 A1, previously cited), hereinafter referenced as Fu, in view of Jin et al. (US 20130127847 A1, previously cited), hereinafter referenced as Jin as applied to Claim 1 above respectively, and further in view of Sud et al. (US 20230040793 A1), hereinafter referenced as Sud.

Regarding Claim 14, the combination of Fu and Jin discloses method of Claim 1. Additionally, Fu further discloses
training the machine learning model using the plurality of training examples (Fu, [0085]: teaches the neural network being trained with synthetic data using the NOCS dataset <read on training examples>), wherein
training the machine learning model comprises:receiving an input including an image of an object and data representing a ground-truth pose of the object in accordance with the symmetry data (Fu, [0088]: teaches taking RGB-D images as input for a neural network; [0077]: teaches a Mirror-Paired Dimensional Estimation (MPDE) obtaining the observation                                 
                                    O
                                    (
                                    R
                                    ,
                                     
                                    t
                                    n
                                    |
                                    c
                                    )
                                
                             from a relative complete object shape that is beneficial to center localization and size regression <read on image of object> and ground truth symmetric points                                 
                                    O
                                    '
                                    (
                                    R
                                    ,
                                     
                                    t
                                    n
                                    |
                                    c
                                    )
                                
                             <read on ground-truth pose of object>, which are combined together during training);
[[processing the image using a keypoints machine learning model to predict output for the image;]]
[[based on the output generated by the keypoints machine learning model, computing correspondence information between pixels in the image and locations on a three-dimensional model of the object;]]
[[providing the correspondence information to a pose estimation module to generate an initial pose for the object;]]
[[providing the initial pose generated for the object to a refinement machine learning model to generate a refined pose of the object based on the correspondence; and]]
[[computing a measure of loss based on comparing the refined pose of the object to the ground-truth pose of the object in accordance with the symmetry data.]]

However, the combination of Fu and Jin does not expressly disclose
processing the image using a keypoints machine learning model to predict output for the image;
based on the output generated by the keypoints machine learning model, computing correspondence information between pixels in the image and locations on a three-dimensional model of the object;
providing the correspondence information to a pose estimation module to generate an initial pose for the object;
providing the initial pose generated for the object to a refinement machine learning model to generate a refined pose of the object based on the correspondence; and
computing a measure of loss based on comparing the refined pose of the object to the ground-truth pose of the object in accordance with the symmetry data.

Sud discloses
processing the image using a keypoints machine learning model to predict output for the image (Sud, [0027]: teaches supplying images for processing to an expert model, such as a keypoint detection model <read on keypoints machine learning model> that forms a probability heatmap for each joint <read on predicted output of image>);
based on the output generated by the keypoints machine learning model, computing correspondence information between pixels in the image and locations on a three-dimensional model of the object (Sud, [0089]: teaches single-view pose estimation that produces per-camera rough 3D pose estimates                                 
                                    Q
                                    =
                                    {
                                    q
                                    c
                                    ,
                                    j
                                    }
                                
                             <read on 3D model of object> given an image 3c from that camera, where "these single-image estimates                                 
                                    q
                                    
                                        
                                            c
                                        
                                        
                                            j
                                        
                                    
                                
                             <read on output of keypoints machine learning model> are assumed to be in the camera frame, meaning that first two spatial coordinates of                                 
                                    q
                                    c
                                    
                                        
                                            ,
                                        
                                        
                                            1
                                        
                                    
                                
                             <read on computed correspondence information> correspond to pixel coordinates of joint                                 
                                    j
                                
                             <read on location on 3D model> on image 3c, and the third coordinate corresponds to its single-image relative zero-mean depth estimate");
providing the correspondence information to a pose estimation module to generate an initial pose for the object (Sud, [0092]: teaches acquiring an initial guess for the 3D pose <read on generated initial pose> and cameras using single-view rough camera-frame 3D pose estimates Q <read on correspondence information> using an external single-view weakly-supervised 3D pose estimation network <read on pose estimation module> as shown in FIG. 5);
    PNG
    media_image2.png
    324
    454
    media_image2.png
    Greyscale

providing the initial pose generated for the object to a refinement machine learning model to generate a refined pose of the object based on the correspondence (Sud, [0094]: teaches training a neural optimizer <read on refinement machine learning model> to predict iterative refinement <read on generated refined pose> that minimizes the reprojection error with the ground truth re-projection, using the current guess and joint heatmaps as an input as shown in FIG. 6); and
    PNG
    media_image3.png
    357
    453
    media_image3.png
    Greyscale


computing a measure of loss based on comparing the refined pose of the object to the ground-truth pose of the object in accordance with the symmetry data (Sud, [0121]: teaches training computing system 150 including a model trainer 160 that trains the machine-learned models 120 using backpropagated loss functions <read on computed measure of loss> to update the parameters of the models; [0030]: teaches "a loss (e.g., a “ground truth loss” or “reprojection loss”) can compare the predicted output (e.g., a projection of the predicted output <read on refined pose of object>) with the ground truth output <read on ground-truth pose of object>," where "the loss function can be backpropagated through the meta-optimization neural network to modify (e.g., iteratively optimize) the parameters of the meta-optimization neural network"; [0104]: teaches the predicted updates respect symmetry <read on symmetry data>).

Sud is analogous art with respect to Fu, in view of Jin because they are from the same field of endeavor, namely training neural networks on accurate pose detection. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to implement a model trainer to train a plurality of internal neural networks in a system as taught by Sud into the teaching of Fu, in view of Jin. The suggestion for doing so would allow for concurrent training sessions for all neural network models, thereby yielding improved and predictable results. Therefore, it would have been obvious to combine Sud with Fu, in view of Jin.




Regarding Claim 17, the combination of Fu, Jin, and Sud discloses the method of Claim 14. Additionally, Fu further discloses wherein the machine learning model comprises
a single stage neural network (Fu, [0070]: teaches a 3D-OCR (3D Orientation-Consistent Representation) Module <read on single stage neural network>), wherein
the single stage neural network is configured to receive as input the image and one or more sparse keypoints determined for the object in the image (Fu, [0070]: teaches the 3D-OCR Module <read on single stage neural network> learning to reconstruct the 3D-OCR, where the generated representation implicitly characterizes the predicted orientation to be consistent with the input, which uses keypoints <read on sparse keypoints>), and
is configured to generate output that, for the input image, includes at least a predicted class for each input bounding box (Fu, [0071]: teaches the neural network using a PointNet-like structure, where "the normalized partial observation                                 
                                    O
                                    (
                                    R
                                    ,
                                     
                                    t
                                    n
                                    |
                                    c
                                    )
                                
                             and the category-specific <read on predicted class for each input bounding box> canonical template representation                                 
                                    T
                                    K
                                    (
                                    R
                                    0
                                    ,
                                     
                                    t
                                    0
                                    |
                                    c
                                    )
                                
                             are fed into the network and the encoders aim to extract feature FO and FT, respectively"),
a respective score for each of the predicted classes (Fu, [0092]: teaches calculating the average precision (AP) results in multiple difficult categories <read on respective score for each predicted classes>), and
one or more regressed keypoints associated with the locations on the three-dimensional model of the object (Fu, [0080]: teaches the system obtaining group mirror-paired points                                 
                                    G
                                
                            , where the network directly regresses the uniform scale                                 
                                    s
                                    n
                                
                             <read on regressed keypoints>).

Regarding Claim 21, the combination of Fu and Jin discloses the method of Claim 1. The combination of Fu and Jin does not expressly disclose the limitations of Claim 21; however, Sud discloses wherein
the symmetry data adjusts a measure of loss between a predicted pose and a target pose using a canonical pose that is equivalent to the target pose based on the viewpoint dependent symmetry for the particular viewpoint (Sud, [0121]: teaches training computing system 150 including a model trainer 160 that trains the machine-learned models 120 using backpropagated loss functions <read on measure of loss> to update the parameters of the models; [0030]: teaches "a loss (e.g., a “ground truth loss” or “reprojection loss”) can compare the predicted output (e.g., a projection of the predicted output <read on predicted pose>) with the ground truth output <read on target pose>," where "the loss function can be backpropagated through the meta-optimization neural network to modify (e.g., iteratively optimize) the parameters of the meta-optimization neural network"; [0106]: teaches concatenating view-invariant inputs                                 
                                    
                                        
                                            J
                                        
                                        
                                            i
                                        
                                    
                                
                             and                                 
                                    L
                                
                             to each row of view-dependent inputs                                 
                                    
                                        
                                            C
                                        
                                        
                                            i
                                        
                                    
                                    ,
                                     
                                    G
                                    ,
                                     
                                    
                                        
                                            K
                                        
                                        
                                            i
                                        
                                    
                                
                             <read on viewpoint dependent symmetry>, which are then passed through a permutation-equivalent MLP with aggregation layers, and then apply a mean aggregation and a non-permutation-equivalent MLP to get the final pose update <read on canonical pose>).





Sud is analogous art with respect to Fu, in view of Jin because they are from the same field of endeavor, namely training neural networks on accurate pose detection. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to implement a model trainer to train a plurality of internal neural networks in a system as taught by Sud into the teaching of Fu, in view of Jin. The suggestion for doing so would allow for concurrent training sessions for all neural network models, thereby yielding improved and predictable results. Therefore, it would have been obvious to combine Sud with Fu, in view of Jin.

Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Fu et al. (US 20220292698 A1, previously cited), hereinafter referenced as Fu, in view of Jin et al. (US 20130127847 A1, previously cited), hereinafter referenced as Jin, and further in view of Sud et al. (US 20230040793 A1), hereinafter referenced as Sud as applied to Claim 14 above respectively, and further in view of Rawat et al. (US 20220335274 A1, previously cited), hereinafter referenced as Rawat.

Regarding Claim 15, the combination of Fu, Jin, and Sud discloses the method of Claim 14. Additionally, Fu further discloses wherein the keypoints machine learning model comprises
[[a two-stage neural network, comprising:]]
[[a first neural network in a first stage, and]]
[[a second neural network in a second stage after the first stage, wherein]]
[[the first neural network is configured to receive as input]]
the image and one or more sparse keypoints determined for the object in the image (Fu, [0069]: teaches the template shape T being sampled by the Farthest Point Sampling (FPS) algorithm for preserving the geometry of the object into a sparse K keypoints representations <read on determining sparse keypoints for object in image>), and
generate, for the image, an output including one or more candidate bounding boxes each with a predicted score (Fu, [0093]: teaches visualizing "the estimated 6D pose and size as the tight oriented bounding box <read on generate candidate bounding boxes> around the target instances," which is then projected onto a 2D image frame with given camera intrinsic; [0092]: teaches the average precision (AP) results <read on predicted score>); wherein
the [[second]] neural network is configured to receive as input at least a portion of the one or more candidate bounding boxes (Fu, [0093]: teaches visualizing "the estimated 6D pose and size as the tight oriented bounding box <read on portion of candidate bounding boxes> around the target instances," which is then projected onto a 2D image frame with given camera intrinsic using the depth-based estimator <read on received input> of their system <read on neural network>; Note: it should be noted that the evaluation metrics include objects with partially occluded portions), and
generate output that, [[for the input to the second neural network]], includes a predicted class for each input bounding box (Fu, [0071]: teaches the neural network using a PointNet-like structure, where "the normalized partial observation                                 
                                    O
                                    (
                                    R
                                    ,
                                     
                                    t
                                    n
                                    |
                                    c
                                    )
                                
                             and the category-specific <read on predicted class for each input bounding box> canonical template representation                                 
                                    T
                                    K
                                    (
                                    R
                                    0
                                    ,
                                     
                                    t
                                    0
                                    |
                                    c
                                    )
                                
                             are fed into the network and the encoders aim to extract feature FO and FT, respectively"),
a respective score for each of the predicted classes (Fu, [0092]: teaches calculating the average precision (AP) results in multiple difficult categories <read on respective score for each predicted classes>), and
one or more regressed keypoints associated with the locations on the three-dimensional model of the object (Fu, [0080]: teaches the system obtaining group mirror-paired points                                 
                                    G
                                
                            , where the network directly regresses the uniform scale                                 
                                    s
                                    n
                                
                             <read on regressed keypoints>).

However, the combination of Fu, Jin, and Sud does not expressly disclose
a two-stage neural network, comprising:
a first neural network in a first stage, and
a second neural network in a second stage after the first stage, wherein
the first neural network is configured to receive as input…
the second neural network is configured to receive as input at least a portion of the one or more candidate bounding boxes, and
generate output that, for the input to the second neural network, includes a predicted class for each input bounding box.

Rawat discloses
a two-stage neural network (Rawat, FIG. 1 teaches a two-stage neural network inference system 150), comprising:
    PNG
    media_image4.png
    285
    567
    media_image4.png
    Greyscale


a first neural network in a first stage (Rawat, FIG. 1 teaches the two-stage neural network including a first neural network 110 <read on first stage>), and
a second neural network in a second stage after the first stage (Rawat, FIG. 1 teaches the two-stage neural network including a second neural network 120 <read on second stage>, which comes after the first neural network 110), wherein
the first neural network is configured to receive as input (Rawat, [0039]: teaches the system 150 processing the new network input 104 using the first neural network 110)…
the second neural network is configured to receive as input at least a portion of the one or more candidate bounding boxes (Rawat, [0036]: teaches the system 150 providing a new network input 152 as input to the second neural network 120, which happens after the first neural network), and
generate output that, for the input to the second neural network, includes a predicted class for each input bounding box (Rawat, FIG. 1 teaches a second neural network 120).

Rawat is analogous art with respect to the combination of Fu, Jin, and Sud because they are from the same field of endeavor, namely utilizing neural networks for data processing. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to implement two neural networks that can train each other as taught by Rawat into the combined teaching of Fu, Jin, and Sud. The suggestion for doing so would allow the system to classify new input data through distillation, thereby allowing the neural networks to maintain high quality predictions, such as seemingly symmetrical objects. Therefore, it would have been obvious to combine Rawat with the combination of Fu, Jin, and Sud.

Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over Fu et al. (US 20220292698 A1, previously cited), hereinafter referenced as Fu, in view of Jin et al. (US 20130127847 A1, previously cited), hereinafter referenced as Jin, and further in view of Sud et al. (US 20230040793 A1), hereinafter referenced as Sud, and further in view of Rawat et al. (US 20220335274 A1, previously cited), hereinafter referenced as Rawat as applied to Claim 15 above respectively, and further in view of Chen et al. (US 20160379371 A1, previously cited), hereinafter referenced as Chen.

Regarding Claim 16, the combination of Fu, Jin, Sud, and Rawat discloses the method of Claim 15. The combination of Fu, Jin, Sud, and Rawat does not expressly disclose the limitations of Claim 16; however, Chen discloses wherein
at least the portion of the one or more candidate bounding boxes are sampled from the one or more candidate bounding boxes based on respective objective function scores (Chen, [0040]: teaches detection performance being measured by an F-Score value <read on respective objective function scores>, where "the threshold corresponding to a maximum F-Score can be taken as the optimal threshold of the object bounding box detector"; [0129]: teaches an object bounding box detector and an object contour detector are applied to roughly estimate the locations of the object of a given semantic category, so that the problem of ambiguous sample selection and classification under weakly-supervised learning condition can be avoided).



Chen is analogous art with respect to the combination of Fu, Jin, Sud, and Rawat because they are from the same field of endeavor, namely detecting real-world objects in images/video frames. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to implement an F-Score value to monitor confidence values of object bounding boxes from the object bounding box detector as taught by Chen into the combined teaching of Fu, Jin, Sud, and Rawat. The suggestion for doing so would allow the system to estimate the locations of an object of a given semantic category using inference, thereby enabling the system to assign appropriate bounding boxes of partly-occluded objects. Therefore, it would have been obvious to combine Chen with the combination of Fu, Jin, Sud, and Rawat.

Claims 18-19 are rejected under 35 U.S.C. 103 as being unpatentable over Fu et al. (US 20220292698 A1, previously cited), hereinafter referenced as Fu, in view of Jin et al. (US 20130127847 A1, previously cited), hereinafter referenced as Jin, and further in view of Sud et al. (US 20230040793 A1), hereinafter referenced as Sud as applied to Claim 14 above respectively, and further in view of Vajda et al. (US 20190172223 A1, previously cited), hereinafter referenced as Vajda.

Regarding Claim 18, the combination of Fu, Jin, and Sud discloses the method of Claim 14. Additionally, Fu further discloses
[[determining a plurality of image features including gradients and magnitudes in each channel of the image;]]
[[determining a plurality of candidate features for the object based on data representing the pose of the object;]]
[[for each of the plurality of candidate features, selecting, according to one or more criteria, an image feature of the plurality of image features that corresponds to the candidate feature to generate a pair of features including the candidate feature and the selected image feature; and]]
updating the predicted pose of the object based on the pairs of feature (Fu, [0074]: teaches a pair of observation                                 
                                    O
                                    (
                                    R
                                    ,
                                     
                                    t
                                    n
                                    |
                                    c
                                    )
                                
                             and                                 
                                    O
                                    '
                                    (
                                    R
                                    ,
                                     
                                    t
                                    n
                                    |
                                    c
                                    )
                                
                             <read on pairs of feature>, which are used for point-wise paired prediction <read on update predicted pose of object>).

However, the combination of Fu, Jin, and Sud does not expressly disclose
determining a plurality of image features including gradients and magnitudes in each channel of the image;
determining a plurality of candidate features for the object based on data representing the pose of the object; and
for each of the plurality of candidate features, selecting, according to one or more criteria, an image feature of the plurality of image features that corresponds to the candidate feature to generate a pair of features including the candidate feature and the selected image feature.

Vajda discloses
determining a plurality of image features including gradients and magnitudes in each channel of the image (Vajda, [0112]: teaches a regional feature map <read on image features> of an ROI of an image including additional data, such as it being 3D, having a height size (H), a width size (W), and a color channel size (C); Note: it should be noted that the color channel is being interpreted as including gradients and magnitudes, such as transparency/opacity and saturation);
determining a plurality of candidate features for the object based on data representing the pose of the object (Vajda, [0087]: teaches the system generating a plurality of region feature maps <read on candidate features> for the ROIs based on the feature map <read on data representing predicted pose of object>; FIG. 3 teaches step 370, which generates keypoint masks based on the target regional feature maps, where the keypoint's location and the keypoint head may be tasked with predicting K masks); and
    PNG
    media_image5.png
    559
    427
    media_image5.png
    Greyscale

for each of the plurality of candidate features, selecting, according to one or more criteria, an image feature of the plurality of image features that corresponds to the candidate feature to generate a pair of features including the candidate feature and the selected image feature (Vajda, [0108]: teaches the RPN being trained to process a given input image and generate N candidate ROIs, and the detection head is trained to select M ROIs <read on selecting image feature that corresponds to candidate feature>, where each of the ROIs have 3D feature maps; [0111]: teaches the kernel being applied to two different but neighboring ROI feature maps <read on pair of features> to prevent incorrect sampling; [0088]: teaches the system generating regional feature maps of a predefined dimension <read on criteria> for each ROI).

Vajda is analogous art with respect to the combination of Fu, Jin, and Sud because they are from the same field of endeavor, namely object instance detection from image/video frame inputs. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to generate region feature maps for regions of interest (ROI) for different parts of an object as taught by Vajda into the combined teaching of Fu, Jin, and Sud. The suggestion for doing so would allow the system to generate keypoint mask predictions based on the regional feature maps and confidence values, which would lead to more confident object classifications. Therefore, it would have been obvious to combine Vajda with the combination of Fu, Jin, and Sud.

Regarding Claim 19, the combination of Fu, Jin, Sud, and Vajda discloses the method of Claim 18. The combination of Fu, Jin, and Sud does not expressly disclose the limitations of Claim 19; however, Vajda discloses wherein the plurality of candidate features comprise
one or more local maxima model gradients of multiple modalities (Vajda, [0133]: teaches for each joint heat-map, the first few (e.g., one or two) local maxima may be found and used as candidates using mean shift algorithm based on the type of joints <read on modalities>).

Vajda is analogous art with respect to the combination of Fu, Jin, and Sud because they are from the same field of endeavor, namely object instance detection from image/video frame inputs. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to generate region feature maps for regions of interest (ROI) for different parts of an object as taught by Vajda into the combined teaching of Fu, Jin, and Sud. The suggestion for doing so would allow the system to generate keypoint mask predictions based on the regional feature maps and confidence values, which would lead to more confident object classifications. Therefore, it would have been obvious to combine Vajda with the combination of Fu, Jin, and Sud.

Claim 20 is rejected under 35 U.S.C. 103 as being unpatentable over Fu et al. (US 20220292698 A1, previously cited), hereinafter referenced as Fu, in view of Jin et al. (US 20130127847 A1, previously cited), hereinafter referenced as Jin, and further in view of Sud et al. (US 20230040793 A1), hereinafter referenced as Sud, and further in view of Vajda et al. (US 20190172223 A1, previously cited), hereinafter referenced as Vajda as applied to Claim 18 above respectively, and further in view of Nakamura et al. (US 20080013836 A1, previously cited), hereinafter referenced as Nakamura.

Regarding Claim 20, the combination of Fu, Jin, Sud, and Vajda discloses the method of Claim 18. The combination of Fu, Jin, Sud, and Vajda does not expressly disclose the limitations of Claim 20; however, Nakamura discloses wherein the one or more criteria comprise at least one of:
an image feature being a local maximum in a gradient direction, a threshold difference in a direction between a candidate feature and an image feature, or a threshold difference in magnitude between an image feature and a maximum of the candidate feature along a search line (Nakamura, [0123]: teaches local points (local maxima and local minima) of images DI1, DI2, DI3 <read on image feature>, … at the respective levels output through the DoG filter, points of which position does not change in resolution changes within a predetermined range are detected as feature points by the feature point extractor 61; [0138]: teaches density gradient information of a feature point neighboring area, where the length and direction of the arrowheads indicate the gradient magnitude and gradient orientation <read on gradient direction> respectively).

Nakamura is analogous art with respect to the combination of Fu, Jin, Sud, and Vajda because they are from the same field of endeavor, namely extracting keypoints of a detected object. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to have object feature points include local gradient information of the neighboring area, which includes the gradient magnitude and gradient orientation as taught by Nakamura into the combined teaching of Fu, Jin, Sud, and Vajda. The suggestion for doing so would allow the system to compare these values amongst the input images, thereby enabling the system to better associate identical keypoints from different perspectives, which results in an improved system. Therefore, it would have been obvious to combine Nakamura with the combination of Fu, Jin, Sud, and Vajda.








Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Desappan et al. (US 20160117569 A1) discloses selecting feature points within an image to determine a set of candidate feature points;
Han et al. (US 20190392632 A1) discloses reconstructing a 3D model of an object;
Hu et al. (US 20220300738 A1) discloses detecting and labeling target objects in a 2D image;
Tkach et al. (US 20200349772 A1) discloses receiving images that include color and depth data for determining viewpoints associated with an AR/VR environment display; and
Zheng et al. (US 20230196617 A1) discloses utilizing pre-trained artificial neural networks for human model recovery using body keypoints from images of a person.

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to KARL TRUONG whose telephone number is (703)756-5915. The examiner can normally be reached 10:30 AM - 7:30 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kent Chang can be reached at (571) 272-7667. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/K.D.T./Examiner, Art Unit 2614                                                                                                                                                                                                        
/KENT W CHANG/Supervisory Patent Examiner, Art Unit 2614
Read full office action
Prosecution Timeline

Dec 08, 2023
Application Filed
Sep 18, 2025
Non-Final Rejection mailed — §103
Dec 15, 2025
Interview Requested
Dec 23, 2025
Applicant Interview (Telephonic)
Jan 28, 2026
Examiner Interview Summary
Mar 18, 2026
Response Filed
Apr 13, 2026
Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/364,590
Patent 12633012
OPHTHALMIC INFORMATION PROCESSING METHOD, OPHTHALMIC APPARATUS, AND STORAGE MEDIUM STORING OPHTHALMIC INFORMATION PROCESSING PROGRAM
2y 9m to grant Granted May 19, 2026
18/126,424
Patent 12608881
FORMATION OF BOUNDING VOLUME HIERARCHIES
3y 0m to grant Granted Apr 21, 2026
18/324,617
Patent 12573149
DATA PROCESSING METHOD AND APPARATUS, DEVICE, COMPUTER-READABLE STORAGE MEDIUM, AND COMPUTER PROGRAM PRODUCT
2y 9m to grant Granted Mar 10, 2026
18/455,592
Patent 12561875
ANIMATION FRAME DISPLAY METHOD AND APPARATUS, DEVICE, AND STORAGE MEDIUM
2y 6m to grant Granted Feb 24, 2026
18/211,149
Patent 12494013
AUTODECODING LATENT 3D DIFFUSION MODELS
2y 5m to grant Granted Dec 09, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
52%
Grant Probability
85%
With Interview (+33.5%)
2y 7m (~1m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 33 resolved cases by this examiner. Grant probability derived from career allowance rate.