DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. The Amendment filed 22 December 2025 has been entered and considered. Claim 1 has been amended. Claims 1-37 are pending of which claims 21-37 are withdrawn. Claims 13-20 are rejected. Claims 1-12 are allowed.
Response to Amendment
Independent claim 1 and its dependent claims
In view of the amendments to independent claim 1, the previously applied prior art rejections are withdrawn. Claims 1-12 are allowed.
Independent claim 13 and its dependent claims
Independent claim 13
On pages 14-16 of the Amendment, Applicant asserts that the applied art does not teach or suggest the following steps of independent claim 13: “transform the combined feature volume to generated output image data comprising a plurality of image views of the object; select at least one of the plurality of image views of the object based on at least a latent loss computed based on a latent representation of a query image and a latent representation of one or more of the plurality of image views of the object”. In support of this assertion, Applicant provides several arguments, each of which are addressed below since claim 13 remains unamended.
On pages 14-15 of the Amendment, Applicant argues that Sundermeyer does not teach or suggest that the plurality of image views of the object whose latent representations are used to compute the claimed latent loss are the same plurality of image views generated by transforming the combined feature volume.
Initially, the Examiner notes that one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references. See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986).
The rejection relies on the combination of Sitzmann and Sundermeyer to teach the combination of elements noted by the Applicant here. Specifically, Sitzmann discloses a “DeepVoxels” 3D feature volume which is transformed to generate novel views of an object (Section 3 and Figs. 1-2). Sundermeyer similarly discloses learning a 3D representation of the object in a latent space, and rendering synthetic views of the learned object with varying poses (Abstract, Section 1, and Fig. 4). Sundermeyer additionally discloses that those same rendered synthetic views are represented in latent form in a codebook used in the claimed latent loss querying process (Fig. 6 and Section 3.5). Thus, it is the combination of Sitzmann and Sundermeyer that teaches a plurality of image views of an object that are: 1) generated by transforming the combined feature volume and 2) assigned latent representations used to compute the claimed latent loss. This is clearly articulated in the rejection of claim 13 in the prior action. Thus, Applicant errs in attacking Sundermeyer alone with respect to these claimed elements.
On page 15 of the Amendment, Applicant also asserts that Sundermeyer does not disclose computing a latent representation of the plurality of image views generated by transforming the combined feature volume or selecting from among those generated views. In support of this assertion, Applicant argues that Sundermeyer computes similarity between a test code and codebook codes and returns rotation matrices associated with the nearest codebook entries. The Examiner respectfully submits that the acknowledged teaching of Sundermeyer teaches the very limitation in question.
Importantly, Sundermeyer discloses that the latent codes that are created to form the codebook are created from the generated synthetic image views (Figs. 4 and 6 and Section 3, portions of which are reproduced and annotated below).
For example, Section 3.4 of Sundermeyer discloses that the neural architecture “render[s] 200000 views of each object uniformly at random 3D orientations”, as show in Fig. 4:
PNG
media_image1.png
282
584
media_image1.png
Greyscale
These same rendered synthetic views are then encoded into latent codes that form the codebook. Indeed, Section 3.5 of Sundermeyer discloses that the “Create a codebook” step 3 is performed based on the “Render[ed] clean, synthetic object views” of Step 1. This can be clearly seen visually when relating Figs. 4 and 6:
PNG
media_image2.png
539
905
media_image2.png
Greyscale
Here, Sundermeyer clearly disclose computing a latent representation of the generated image views, as claimed, contrary to Applicant’s assertions.
Then, at test time, Sundermeyer discloses generating a latent code representing a query image ztest, “comput[ing] cosine similarity between the test code ztest…and all codes zi…from the codebook” to select the generated views having “the highest similarities” as estimates of the “object orientation” in the query/test image (Section 3.5). Stated another way, the query object is effectively compared (via latent code comparison) with each of the rendered synthetic views to identify those having a pose that most closely matches the pose of the query object, as shown in Fig. 6:
PNG
media_image3.png
360
754
media_image3.png
Greyscale
Since the “highest similarities…are returned as estimates of the 3D object orientation” of the query object in the query image, Sundermeyer clearly discloses a selection from among the generated views – particularly, a selection of those that most closely match the pose of the query object in the test image (Section 3.5). That is, Sundermeyer clearly discloses selecting from among the generated novel views, as claimed, contrary to Applicant’s assertions.
Applicant later acknowledges the citation of Sitzmann for generating a 3D volumetric representation and synthesizing novel views, but contends that the rejection does not show that the selection is performed from among the plurality of image views of the object produced from the combined feature volume. The Examiner respectfully disagrees.
As clearly articulated in the rejection, both Sitzmann and Sundermeyer disclose generating novel views based on a 3D latent representation of an object:
Sitzmann Abstract (emphasis added): “We apply our persistent 3D scene representation to the problem of novel view synthesis”.
Sitzmann Fig. 1 caption (emphasis added): “During training, we learn a persistent DeepVoxels representation that encodes the view-dependent appearance of a 3D scene from a dataset of posed multi-view images (top). At test time, DeepVoxels enable novel view synthesis (bottom).”
Sitzmann Section 1 (emphasis added): “The goal of the DeepVoxels approach is to condense posed input images of a scene into a persistent latent representation without explicitly having to model its geometry (see Fig. 1). This representation can then be applied to the task of novel view synthesis to generate unseen perspectives of a 3D scene”.
Sundermeyer Abstract (emphasis added): “Augmented Autoencoder… provides an implicit representation of object orientations defined by samples in a latent space”.
Sundermeyer Section 3.4: “we render 20000 views of each object” using “convolutional Autoencoder architecture”.
Sitzmann additionally teaches that the novel views are rendered based on a combined feature volume, and Sundermeyer additionally teaches that the novel views are used for pose estimation via selection based on latent loss. Thus, in combination, Sitzmann and Sundermeyer teach novel views that are both: 1) rendered based on a combined feature volume, and 2) used for pose estimation via selection based on latent loss, contrary to Applicant’s arguments. The reasons for combining the references are articulated in the rejection.
On page 16 of the Amendment, Applicant argues that the Office’s continued reliance on Afzal to supply depth “up front” indicates that the cited portions do not clearly establish the recited “depth information associated with each of the first image and the second image”. It is unclear precisely what Applicant is arguing here. As best understood, Applicant appears to contend that the limitation in quotations is not taught. The Examiner respectfully disagrees and submits that both Sitzmann and Afzal teaches the limitation in quotations.
Section 3.2 of Sitzmann discloses that 2D features of the multiple source views are lifted into respective temporary 3D volumes which are used to generate the 3D DeepVoxels representation. This process is able “to resolve the depth ambiguity” (Section 3.2). That is, Sitzmann discloses the derivation of depth information from the 2D images in service of the 3D reconstruction.
Afzal, like Sitzmann, is directed to a multi-view system for 3D reconstruction (Abstract, Section III). Afzal discloses that visual and depth information acquired by RGB-D cameras are used to estimate poses of cameras in the multi-view system with the “final goal” of getting “a holistic 3D reconstruction” (Section III).
That is, both Sitzmann and Afzal disclose depth information associated with each image, contrary to Applicant’s assertions. The difference between the teachings of the references is that Sitzmann derives depth information for each source image, whereas Afzal’s system receives the depth information directly (without the need for derivation). Since claim 13 requires that the input image data includes depth information (i.e., depth information is received, not derived), Afzal is relied upon for this aspect of the claim. The reasons for combining the references are described in the rejection of the claim.
On page 16 of the Amendment, Applicant argues that the reasons for combining Sitzmann and Sundermeyer do not supply the alleged missing linkage created by claim 13’s recitation that the selection step is performed from the views generated by transforming the combined feature volume. The Examiner respectfully disagrees.
At least for all the reasons discussed above, there is no missing linkage between Sitzmann and Sundermeyer. Furthermore, as clearly articulated in the rejection, it would have been obvious to modify Sitzmann to use the generated novel views for object pose estimation in the manner disclosed by Sundermeyer (i.e., via selection based on latent loss). As detailed above, both references disclose novel view generation. Sitzmann’s novel views are generated by transforming a combined feature volume. Sundermeyer uses the generated novel views for pose estimation via selection based on latent loss. In combination, the references teach generating novel views by transforming a combined feature volume, and using the novel views that were generated by transforming the combined feature volume for pose estimation via selection based on latent loss. The linkage is clear.
On page 16 of the Amendment, Applicant further asserts that the rejection does not explain why a person of ordinary skill in the art would modify the cited systems so that the latent loss computation and selection are performed on the image views generated by transforming the combined feature volume.
The feature that links the two references is the generation of the novel views. Sitzmann details the process of generating the novel views, and Sundermeyer details the process of using generated novel views. Thus, if an ordinarily skilled artisan were to look to Sundermeyer to modify Sitzmann, it would clearly be for the purpose of using the generated novel views – namely for pose estimation. Sundermeyer provides an explicit application for such pose estimation – “robotic manipulation” of imaged objects (Section 1 of Sundermeyer). Accordingly, as discussed in the rejection, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Sitzmann to use the generated novel views for object pose estimation to yield the predictable result of improving robotic manipulation. These reasons for combining the references are not “conclusory”, as Applicant alleges; rather, they are firmly rooted in the evidence supplied by the references.
Finally, on page 16, Applicant asserts that the rejection does not articulate how Sundermeyer’s retrieval based codebook selection would be integrated into the specific combined feature volume pipeline without redesign. As discussed above in great detail, as well as in the rejection, Sitzmann discloses that the combined feature volume is used to generate novel views of the object. Sundermeyer discloses an analogous novel view generation process, and a further use of the novel views for pose estimation of a query object using codebook selection. Contrary to Applicant’s implications, no redesign would be required to integrate the two.
Importantly, Applicant fails to articulate any convincing evidence or reasoning as to why a skilled artisan would not have combined the references in the manner clearly described in the rejection.
For all the foregoing reasons, the rejection of independent claim 13 is maintained.
Dependent claims
The arguments regarding dependent claims 4 and 10-12 are rendered moot in view of the amendments to independent claim 1 which result in allowability of these claims.
On pages 17-18, Applicant asserts that the applied art does not teach or suggest image views that include mask image data, as required by claim 18. In support of this assertion, Applicant argues that the alpha mask produced by Lombardi as an output of a rendering pipeline is not mask image data included in an image view of an object. However, Applicant does not provide any reasoning or rationale for this statement. How, precisely, is Lombardi’s alpha mask different from the generically claimed mask image data? And how is the difference captured in the claim language? According to an exemplary embodiment of a “rendering system” shown in Fig. 4 of the subject application, mask data 424 is output from the rendering system as a separate output from image view 418. That is, Lombardi appears to disclose the claimed feature in a manner identical to the subject invention.
Applicant also asserts that it is unclear how the image views can include both the depth image data and the mask image data. The claim does not preclude an interpretation in which the generated views include separate images – the view itself, the separate mask, and the corresponding depth image. This seems consistent with the subject invention, as discussed above. Thus, the disclosure by the art of rendering systems which generate separate images for each of the view itself, the separate mask, and the corresponding depth image reads on the claimed invention and likely is similar to the disclosed invention.
On page 18 of the Amendment, Applicant argues that the rejection of claim 20 does not show where the cited art discloses calculating depth loss based on the recited inputs and then using that calculated depth loss as the basis for pose estimation. In support of this argument, Applicant contends that Sundermeyer’s optional ICP refinement step is an interpretation, rather than a disclosure. The Examiner respectfully disagrees.
As identified in the rejection, Sundermeyer discloses obtaining a test/query image of the object and calculating an “estimate” of its pose based on a comparison between the query image and the synthetic novel views (Sections 3.5-3.6; see detailed discussion above). Then “the estimate is refined on depth data using a standard ICP approach” (Section 3.6). Iterative Closest Point (ICP), by definition, minimizes the difference between two clouds of points, and thus necessarily discloses the claimed depth loss (difference). Further, since the “estimate” that is refined is initially based on the query image and the novel views, the refinement is necessarily also based on the query image and the novel views. Nothing in the claim language precludes this interpretation, particularly in view of the very broad linking language “based on” (which is broader than “inputs”, the language noted by the Applicant).
On page 19 of the Amendment, Applicant argues that the rejection does not identify where the cited art discloses the recitation of estimating another object pose based on at least one view and estimating the object pose based on both the other object pose and the combined feature volume, as required by claim 19. The Examiner respectfully disagrees.
As identified in the rejection, Sundermeyer discloses that a codebook of poses is generated for each generated synthetic view (claimed “another object pose” generated based on novel views), and the object pose of the object in the test image (claimed “object pose”) is calculated by comparing the test object pose estimated by the network with the stored object poses for the synthetic views to estimate object pose for the object in the test image (Section 3.5-3.6). Here, Sundermeyer clearly discloses estimating another object pose based on at least one view and estimating the object pose based on the other object pose, contrary to Applicant’s assertions.
Additionally, as discussed in great detail above, Sitzmann discloses generating novel views from the claimed 3D image volume. Thus, in combination, Sundermeyer’s novel views used in the estimation of the object pose are based on the 3D image volume by virtue of being generated therefrom. Nothing in the claim language precludes this interpretation, particularly in view of the very broad linking language “based on”.
For all the foregoing reasons, the rejections of these dependent claims are maintained.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
13-17 and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over “DeepVoxels: Learning Persistent 3D Feature Embeddings” by Sitzmann et al. (cited in the IDS filed 6/29/21; hereinafter “Sitzmann”) in view of “Implicit 3D Orientation Learning for 6D Object Detection from RGB Images” by Sundermeyer et al. (cited in the IDS filed 6/29/21; hereinafter “Sundermeyer”) in further view of “Category-Specific Object Reconstruction from a Single Image” to Kar et al. (hereinafter “Kar”)and further in view of “RGB-D Multi-view System Calibration for Full 3D Scene Reconstruction” by Afzal et al. (hereinafter “Afzal”).
As to independent claim 13, Sitzmann discloses a computer system comprising one or more processors and computer readable memory storing executable instructions that, as a result of being executed by the one or more processors, cause the computer system to at least (Abstract and Fig. 2 discloses that Sitzmann is directed to a deep learning model having an encoder-decoder based architecture for 3D scene representation and novel view synthesis, such processing requiring implementation by software instructions stored in memory and executed by a processor; for example, Section 5 discloses a GPU and memory): obtain input image data comprising at least a first image of an object and a second image of the object (Section 3 discloses a training corpus comprising M source views Si of an object which are input to the network architecture of Fig. 2; any two of the source views Si correspond to the claimed first and second images of the object); process the input image data to generate a first three-dimensional feature volume corresponding to the first image and a second three-dimensional feature volume corresponding to the second image (Section 3 and Fig. 2 disclose a lifting layer that lifts a 3D feature volume from each 2D feature map extracted from the respective source views Si); combine the first and second three-dimensional feature volumes to generate a combined feature volume (Section 3 and Fig. 2 disclose generating a persistent 3D DeepVoxels representation of the object by integrating the lifted 3D feature volumes using a recurrent fusion process); transform the combined feature volume to generate output image data comprising a plurality of image views of the object (Section 3 discloses that the trained rendering network processes the 3D DeepVoxels representation to generate multiple novel views of the object; see also Fig. 2).
Sitzmann does not expressly disclose select at least one of the plurality of image views of the object based on at least a latent loss computed based on a latent representation of a query image and a latent representation of one or more of the plurality of image views of the object; and estimate an object pose based on the selected at least one of the plurality of image views of the object or the input image data further comprising a binary mask and depth information associated with each of the first image and the second image.
Sundermeyer, like Sitzmann, is directed to a trained deep network architecture that inputs 2D images of an object, learns a 3D representation of the object in a latent space, and renders synthetic views of the learned object with varying poses (Abstract, Section 1, and Fig. 4). In particular, Sundermeyer discloses a trained Augmented Autoencoder (“AAE”) that generates the synthetic views (Section 3.2-3.4). In addition, Sundermeyer discloses performing pose estimation for a test image by generating a codebook including latent codes z for all synthetic object views and the corresponding pose of the object therein (Section 3.5). At test time, a query image including the object is input to the encoder portion of the trained AAE, the resulting code ztest is output by the encoder and compared with all codes zi from the codebook, and the images having codes with the highest similarity to the query image code ztest are returned as estimates of the 3D object orientation for the query image (Fig. 6 and Section 3.5). Sundermeyer further discloses a similar process of codebook comparison for estimating translation of the query object (Section 3.6 and Fig. 1). Since the image codes z correspond to the synthetic views, Sundermeyer’s process of selecting the closest matching image codes constitutes a selection of the closest matching synthetic views to the query image. For example, see Fig. 14 which shows four novel images on the right which have the highest similarity to the query image on the left and which have been retrieved via the codebook retrieval process discussed above.
That is, Sundermeyer discloses select at least one of the plurality of image views of the object based on at least a latent loss (cosine distance cosi) computed based on a latent representation of a query image (ztest) and a latent representation of one or more of the plurality of image views of the object (zi, wherein Sections 3.1 and 3.5 discloses that “z” is used to indicate “latent representation z” or “latent codes z”; see equation 4); and estimate an object pose based on the selected at least one of the plurality of image views of the object (Fig. 6 and Section 3.5 discloses that a query image including the object is input to encoder portion of the trained AAE, the resulting code ztest is output by the encoder and compared with all codes zi from the codebook (equation 4), and the images having codes with the highest similarity to the query image code ztest are returned as estimates of the 3D object orientation for the query image; see Fig. 14, for example).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Sitzmann to use the generated novel views for object pose estimation, as taught by Sundermeyer, to arrive at the claimed invention discussed above. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. More specifically, Sitzmann’s multi-view system for 3D reconstruction and synthetic view rendering as modified by Sundermeyer’s multi-view system for 3D representation learning and synthetic view rendering which uses the rendered synthetic views for object pose estimation can yield a predictable result of improving robotic manipulation, as taught in Section 1 of Sundermeyer. Thus, a person of ordinary skill would have appreciated including in Sitzmann’s system the ability to use rendered synthetic views for object pose estimation since the claimed invention is merely a combination of old elements, and in the combination each element merely would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that the results of the combination were predictable.
The proposed combination of Sitzmann and Sundermeyer does not expressly disclose the input data further comprising a binary mask and depth information associated with each of the first image and the second image.
Kar, like Sitzmann, is directed to training a model to generate a 3D reconstruction of an object in the training images (Abstract, Figs. 1-2). Kar discloses that each training image is provided to the model along with a binary mask of the object such that all keypoints of the object lie inside its binary mask (see Section 2, equation 2, and Fig. 2). That is, Kar teaches the input data further comprising a binary mask (Section 2, equation 2, and Fig. 2).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the proposed combination of Sitzmann and Sundermeyer to provide a mask of the object along with each image, as taught by Kar, to arrive at the claimed invention discussed above. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. More specifically, Sitzmann’s model for generating a 3D reconstruction of an object as modified by Kar’s model for generating a 3D reconstruction of an object in multiple images based on corresponding binary masks can yield a predictable result of “speed[ing] up computation” since image data outside of the mask would not need to be processed, while also creating “more accurate viewpoints” (Section 2.1 of Kar). Thus, a person of ordinary skill would have appreciated including in Sitzmann’s system the ability to process an input image according to a binary mask provided therewith since the claimed invention is merely a combination of old elements, and in the combination each element merely would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that the results of the combination were predictable.
Section 3.2 of Sitzmann discloses that 2D features of the multiple source views are lifted into respective temporary 3D volumes which are used to generate the 3D DeepVoxels representation. This process is able “to resolve the depth ambiguity” (Section 3.2). That is, Sitzmann discloses the derivation of depth information from the 2D images in service of the 3D reconstruction. Thus, Sitzmann does not expressly disclose that such depth information is provided to the system up front. Accordingly, the proposed combination of Sitzmann, Sundermeyer, and Kar does not expressly disclose the input data further comprising depth information associated with each of the first image and the second image.
Afzal, like Sitzmann, is directed to a multi-view system for 3D reconstruction (Abstract, Section III). Afzal discloses that visual and depth information acquired by RGB-D cameras are used to estimate poses of cameras in the multi-view system with the “final goal” of getting “a holistic 3D reconstruction” (Section III). That is, Afzal discloses the input data further comprising depth information associated with each of the first image and the second image (Abstract and Section III).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Sitzmann to generate the 3D reconstruction using depth information from the acquired images, as taught by Afzal, to arrive at the claimed invention discussed above. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. More specifically, Sitzmann’s system of generating a 3D reconstruction by deriving depth information as modified by Afzal’s system of generating a 3D reconstruction using depth information acquired at the time of image capture can yield a predictable result of saving computational resources since receiving depth information is less expensive computationally than deriving such depth information. Thus, a person of ordinary skill would have appreciated including in Sitzmann’s system the ability to receive depth information along with the images since the claimed invention is merely a combination of old elements, and in the combination each element merely would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that the results of the combination were predictable.
As to claim 14, the proposed combination of Sitzmann and Sundermeyer does not expressly disclose that the input image data further comprises a first binary mask based on the first image of the object and a second binary mask based on the second image of the object.
Kar, like Sitzmann, is directed to training a model to generate a 3D reconstruction of an object in the training images (Abstract, Figs. 1-2). Kar discloses that each training image is provided to the model along with a binary mask of the object such that all keypoints of the object lie inside its binary mask (see Section 2, equation 2, and Fig. 2).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the proposed combination of Sitzmann and Sundermeyer to provide a mask of the object along with each training image, as taught by Kar, to arrive at the claimed invention discussed above. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. The reasons for combining the references are the same as those discussed above in conjunction with claim 13.
As to claim 15, Sitzmann further teaches that processing the image data comprises generating the first and second three-dimensional feature volumes based on a camera model comprising camera parameters, the camera parameters comprising one or more focal lengths of a camera, coordinate data of a principal point associated with the camera, or at least one of rotation or translation of the camera (Section 3.2 of Sitzmann discloses that the network architecture that generates the respective 3D feature volumes follows a perspective pinhole camera model comprising extrinsic and intrinsic camera parameters which include coordinates u and v and depth data d of voxel centers from the camera, and rotation and translation of the camera).
As to claim 16, Sitzmann further teaches that combining the first and second three-dimensional feature volumes comprises fusing the first and second three- dimensional feature volumes using a recurrent neural network that sequentially integrates the first and second three-dimensional feature volumes to generate the combined feature volume (Section 3.2 of Sitzmann discloses that the 3D feature volumes are integrated to generate the 3D DeepVoxels representation using a gated recurrent neural network architecture that integrates the 3D feature volumes incrementally and sequentially; for example, equations 2-5 show that a 3D feature volume of a current timestamp lifted from a source image affect the trainable parameters of the recurrent neural network, and a 3D feature volume of a subsequent timestamp lifted from a subsequently input source image further affect the trainable parameters of the recurrent neural network; by this iterative process, the recurrent fusion is performed to integrate the 3D feature volumes and thereby generate the 3D DeepVoxels representation).
As to claim 17, Sitzmann further teaches that transforming the combined feature volume comprises processing the combined feature volume using a first neural network and a second neural network to provide an updated feature volume used to generate a two- dimensional feature grid based on at least a camera model comprising one or more camera parameters (Section 3 and Fig. 3 of Sitzmann disclose that the occlusion-aware projection operation which processes the 3D DeepVoxels representation into the novel views includes sampling the 3D DeepVoxels representation into a view volume and collapsing the view volume in the depth direction to generate a 2D feature grid used to generate the novel views; see also Fig. 2; Fig. 3 shows that the feature grid is dependent on the camera model, and Section 3 discloses that the camera model includes intrinsic and extrinsic camera parameters; Section 3 “Occlusion Module” discloses that prior to the collapsing, the 3D DeepVoxels representation is compressed to a low-dimensional feature vector by a single 3D convolutional layer (interpreted as the claimed first neural network) which is input to a 3D U-Net (interpreted as the claimed second neural network)).
As to claim 19, Sitzmann does not expressly disclose that estimating the object pose comprises: estimating another object pose based on at least one of the plurality of image views of the object; and estimating the object pose based on the other object pose and the combined feature volume. However, Sundermeyer discloses that a codebook of poses is generated for each generated synthetic view, and the object pose of the object in the test image is calculated by comparing the test object pose estimated by the network with the stored object poses for the synthetic views to estimate object pose for the object in the test image (Section 3.5-3.6). This portion of Sundermeyer further discloses that each of the generated synthetic views and the test object pose estimated by the network are based on the 3D representation of the object in a latent space characterized by the trained network. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Sitzmann to estimate object pose of a query/test image based on a comparison with stored object poses for the synthetic views, as taught by Sundermeyer, to arrive at the claimed invention discussed above. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. It is predictable that the proposed modification would have more accurately estimated the pose of the object in the test/query image.
As to claim 20, Sitzmann does not expressly disclose that estimating the object pose comprises: calculating depth loss based on image data of the query image and image data of at least one of the plurality of image views of the object; and estimating the object pose based the calculated depth loss, wherein the object pose is associated with the query image.
However, Sundermeyer discloses obtaining a test/query image of the object and estimating a pose (translation and rotation) of the object in the test/query image, the object pose being of the object in the test image. Section 3.6 of Sundermeyer further discloses that the pose estimation involves pose refinement using an iterative closest point (“ICP”) approach on depth data of the provided test image. Specifically, Appendix A.4 discloses projecting the depth images of the test/query image and the synthetic view into respective 3D point clouds, generating random points on the surface of the respective object models, and performing Iterative Closest Point (ICP) matching to minimize the difference (interpreted as depth loss) between these point sets to arrive at the pose estimate.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Sitzmann to perform pose estimation of a query image based on a minimized depth loss between the object in the query image and the rendered synthetic views, as taught by Sundermeyer, to arrive at the claimed invention discussed above. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. It is predictable that doing so would have resulted in efficiently estimating pose of an object in a query image.
Claim 18 is rejected under 35 U.S.C. 103 as being unpatentable over Sitzmann in view of Sundermeyer, Kar, and Afzal and further in view of “Neural Volumes: Learning Dynamic Renderable Volumes from Images” by Lombardi et al. (cited in IDS filed 6/29/21; hereinafter “Lombardi”).
As to claim 18, the proposed combination of Sitzmann, Sundermeyer, Kar, and Afzal further teaches that the plurality of image views of the object comprises a first image view including first depth image data and a second image view including second depth image data (Section III of Afzal discloses that the images are RGB-D images, each including a color image and a depth image portion; the reasons for combining the references are analogous to those discussed above in conjunction with claim 1).
The proposed combination of Sitzmann, Sundermeyer, Kar and Afzal does not expressly disclose that the first image view includes first mask image data and the second image view includes second mask image data.
Lombardi, like Sitzmann, is directed to modeling an object as a 3D volume in latent space based on multiple images of the object and rendering novel views of the object based on the latent 3D volume representation (Abstract and Fig. 2). Lombardi discloses that each of the rendered novel views comprise an associated alpha mask (Fig. 2).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the proposed combination of Sitzmann, Sundermeyer, Kar and Afzal to output an associated mask with each rendered novel view, as taught by Lombardi, to arrive at the claimed invention discussed above. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. It is predictable that the proposed modification would have resulted in saving computational resources since image data outside of the mask would not need to be processed.
Allowable Subject Matter
Claims 1-12 are allowed.
The following is an examiner’s statement of reasons for allowance: Independent claim 1 recites a computer-implemented method, comprising: generating a three-dimensional image volume based on a plurality of image volumes derived from two-dimensional image data comprising at least an RGB image, a binary mask, and depth information; performing processing on the three-dimensional image volume to generate image data comprising a plurality of two-dimensional image views of an object; selecting at least one of the plurality of two-dimensional image views of the object based on at least one loss function comprising a latent loss computed based on a latent representation of a query image and a latent representation of one or more of the two- dimensional image views; and using the selected at least one of the plurality of two-dimensional image views of the object to estimate an object pose, wherein estimating the object pose comprises updating one or more pose parameters using gradient optimization based on a gradient of the latent loss with respect to the one or more pose parameters. The cited art of record does not teach or suggest such a combination of features.
Sitzmann is directed to a deep learning model having an encoder-decoder based architecture for 3D scene representation and novel view synthesis. Sitzmann discloses generating a persistent 3D DeepVoxels representation of an object by integrating 3D feature volumes lifted from 2D feature maps extracted from source views Si of the object. Sitzmann further discloses that the training corpus comprises M source views Si (i=1:M) of the object which are input to the network architecture of Fig. 2, the 3D DeepVoxels representation being generated based on these M source views Si, wherein the trained rendering network comprises a 2D U-Net architecture which takes as input a flattened canonical view volume from the occlusion network based on the 3D DeepVoxels representation to generate multiple 2D novel views.
Sundermeyer, like Sitzmann, is directed to a trained deep network architecture that inputs 2D images of an object, learns a 3D representation of the object in a latent space, and renders synthetic views of the learned object with varying poses. In particular, Sundermeyer discloses a trained Augmented Autoencoder (“AAE”) that generates the synthetic views. In addition, Sundermeyer discloses performing pose estimation for a test image by generating a codebook including latent codes z for all synthetic object views and the corresponding pose of the object therein. At test time, a query image including the object is input to the encoder portion of the trained AAE, the resulting code ztest is output by the encoder and compared with all codes zi from the codebook, and the images having codes with the highest similarity to the query image code ztest are returned as estimates of the 3D object orientation for the query image. Sundermeyer further discloses a similar process of codebook comparison for estimating translation of the query object. Also, Sundermeyer’s system operates on RGB images. Since the image codes z correspond to the synthetic views, Sundermeyer’s process of selecting the closest matching image codes constitutes a selection of the closest matching synthetic views to the query image. For example, see Fig. 14 which shows four novel images on the right which have the highest similarity to the query image on the left and which have been retrieved via the codebook retrieval process discussed above.
Kar, like Sitzmann, is directed to training a model to generate a 3D reconstruction of an object in the training images. Kar discloses that each training image is provided to the model along with a binary mask of the object such that all keypoints of the object lie inside its binary mask.
Afzal, like Sitzmann, is directed to a multi-view system for 3D reconstruction. Afzal discloses that visual and depth information acquired by RGB-D cameras are used to estimate poses of cameras in the multi-view system with the “final goal” of getting “a holistic 3D reconstruction”.
However, even if the teachings of Sitzmann, Sundermeyer, Kar, and Afzal were to be combined, the resulting combination would not teach or suggest that estimating the object pose comprises updating one or more pose parameters using gradient optimization based on a gradient of the latent loss with respect to the one or more pose parameters, within the context of the remaining features of independent claim 1.
“Perspective-n-Learned-Point: Pose Estimation from Relative Depth” by Piasco et al. (hereinafter “Piasco”), similar to Sundermeyer, is directed to estimating pose of an object in a query image by comparing a compact descriptor of the image with a plurality of descriptors corresponding to reference images of the object with known poses to identify an initial pose. Piasco discloses that the initial pose is then refined based on a reconstructed depth map of the closest matching reference image to derive the final pose of the object in the query image.
However, Piasco does not teach or suggest that estimating the object pose comprises updating one or more pose parameters using gradient optimization based on a gradient of the latent loss with respect to the one or more pose parameters, within the context of the remaining features of independent claim 1. The remaining cited art of record, alone or in combination does not cure this deficiency. Accordingly, claim 1 is allowed. Claims 2-12 are allowed by virtue of their dependency on claim 1.
Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee. Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SEAN M CONNER whose telephone number is (571)272-1486. The examiner can normally be reached 10 AM - 6 PM Monday through Friday, and some Saturday afternoons.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Greg Morse can be reached at (571) 272-3838. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/SEAN M CONNER/Primary Examiner, Art Unit 2663