DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
Claim Objections
Claim 11 is objected to because of the following informalities: Claim 11 does not have clear antecedent basis for “the machine-learned pose estimator model” and “the latent scene representation.” (For examination, claim 11 is considered to depend on claim 10, which recites “a machine-learned pose estimator model” (via claim 5) and “a latent scene representation.”) Appropriate correction is required.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
Claims 7 and 10 are rejected under 35 U.S.C. 112(b) as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor regards as the invention. Claim 7 recites “the latent scene representation.” Claim 10 recites “the portion of the training target image.” There is insufficient antecedent basis for these limitations in the claims.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
Claim(s) 1-4, 6-9, 17, and 19-20 is/are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Guizilini et al. (US 2024/0029286).
Regarding claim 1, Guizilini teaches: A computer-implemented method for image view synthesis, the method comprising:
obtaining, by a computing system (Guizilini Fig. 4: system 400), one or more source images of a scene (Guizilini [0037] “images of a scene are captured by a plurality of cameras”);
obtaining, by the computing system, a query associated with a target view of the scene, wherein at least a portion of the query is parameterized in a latent pose space (Guizilini [0042] “The GSR architecture 10 is designed and trained so that only a camera embedding is needed to query the latent scene representation 112 and with the decoder 120 can generate an estimated depth map 150 (or sparse depth maps 151) and an estimated scene image 140 (or sparse RGB image 141)” [0052] “a latent scene representation encoding a pointcloud from images of a scene captured by a plurality of cameras each with known intrinsics and poses and generating a virtual camera having a viewpoint different from the viewpoints of the plurality of cameras”); and
generating, by the computing system and using a machine-learned image view synthesis model, an output image of the scene associated with the target view (Guizilini [0039] “the GSR architecture 10 implements a neural network or other machine-learning model that receives images from a plurality of cameras and corresponding camera embeddings to learn and ultimately generate estimated depth maps for arbitrary viewpoints within the scene … the GSR architecture 10 introduces view synthesis as an auxiliary task, decoded from the same latent representation”).
Regarding claim 2, Guizilini teaches: The computer-implemented method of claim 1, wherein the latent pose space was learned by reconstructing, using the machine-learned image view synthesis model, training target views of training scenes from training source images (Guizilini [0043]-[0044] “the decoder processes the latent scene representation based on the viewpoint of the virtual camera, which results in a sparse RGB image 141 and a sparse depth map 151 ... during training of the GSR architecture utilizes the generated sparse RGB image 141 and the sparse depth map 151 to improve the GSR architecture's 10 ability to learn a geometric scene representation”).
Regarding claim 3, Guizilini teaches: The computer-implemented method of claim 2, wherein the latent pose space was learned by generating, using a machine-learned pose estimator model, latent pose values respectively associated with the training target views (Guizilini [0039] “the GSR architecture 10 implements a neural network or other machine-learning model that receives images from a plurality of cameras and corresponding camera embeddings to learn and ultimately generate estimated depth maps for arbitrary viewpoints within the scene” [0042] “To improve the advance the training of the GSR architecture 10 many views of a scene may be needed to learn a multi-view, consistent latent representation of the scene”).
Regarding claim 4, Guizilini teaches: The computer-implemented method of claim 3, wherein the latent pose values were used by the machine-learned image view synthesis model to reconstruct the training target views (Guizilini [0039] “the GSR architecture 10 introduces view synthesis as an auxiliary task, decoded from the same latent representation” [0042] “To improve the advance the training of the GSR architecture 10 many views of a scene may be needed to learn a multi-view, consistent latent representation of the scene”).
Regarding claim 6, Guizilini teaches: The computer-implemented method of claim 1, wherein the machine-learned image view synthesis model is trained for at least one cycle without explicit ground-truth pose data (Guizilini [0028] “the aforementioned feature leverages this property at training time as well, generating additional supervision in the form of virtual cameras with corresponding ground-truth RGB images and depth maps obtained by projecting available information onto these new viewpoints” [0039] “the GSR architecture 10 introduces view synthesis as an auxiliary task, decoded from the same latent representation, which improves depth estimation performance without requiring any additional ground-truth source”).
Regarding claim 7, Guizilini teaches: The computer-implemented method of claim 1, wherein the machine-learned image view synthesis model is configured to generate a latent scene representation from the latent scene representation and process the latent scene representation in view of the query to obtain the output image (Guizilini [0021] “using a fixed-size N.sub.l×C.sub.l latent scene representation R 112, and learning to project high-dimensional N.sub.e×C.sub.e embeddings onto this latent representation using cross-attention layers 114. The architecture then performs self-attention 116 in this lower-dimensional space, producing a conditioned latent representation R.sub.c 118, that can be queried using N.sub.d×C.sub.d embeddings during the decoding stage 120 to generate estimates, such as estimated scene images 140 and estimated depth maps 150”).
Regarding claim 8, Guizilini teaches: The computer-implemented method of claim 7, wherein the latent scene representation is generated by performing, using the machine-learned image view synthesis model, self-attention over image features extracted from the source images (Guizilini [0020]-[0021] “the GSR architecture 10 ingests images and camera embeddings 102, 104, 106 from a plurality of calibrated cameras ... using a fixed-size N.sub.l×C.sub.l latent scene representation R 112, and learning to project high-dimensional N.sub.e×C.sub.e embeddings onto this latent representation using cross-attention layers 114. The architecture then performs self-attention 116 in this lower-dimensional space, producing a conditioned latent representation R.sub.c 118”).
Regarding claim 9, Guizilini teaches: The computer-implemented method of claim 1, comprising, for a respective portion of the output image:
determining, by the computing system, a respective location-indexed query based on the query and an index value for the respective portion (Guizilini [0020] “The GSR architecture 10 processes this information according to the modality into different pixel-wise embeddings that serve as input to the backbone of GSR architecture 10. This encoded information can be queried using only camera embeddings 132” [0034] “the architecture enables querying at specific image coordinates”);
determining, by the computing system and based on the respective location-indexed query, relevant features of a latent scene representation for generating the respective portion (Guizilini [00004] “embeddings from cameras with arbitrary calibration (i.e., intrinsics and extrinsics) can be generated and queried to produce per-pixel estimates”); and
generating, based on the relevant features, the respective portion (Guizilini [00004] “embeddings from cameras with arbitrary calibration (i.e., intrinsics and extrinsics) can be generated and queried to produce per-pixel estimates”);
wherein the machine-learned image view synthesis model generates the respective portion using a decoding transformer that cross-attends over the latent scene representation based on the respective location-indexed query (Guizilini [0021] “producing a conditioned latent representation R.sub.c 118, that can be queried using N.sub.d×C.sub.d embeddings during the decoding stage 120 to generate estimates, such as estimated scene images 140 and estimated depth maps 150, using cross-attention layers implemented by a depth decoder 122 and a RGB decoder 124, respectively”).
Regarding claim 17, Guizilini teaches: The computer-implemented method of claim 1, comprising:
obtaining, by the computing system, the one or more source images of an environment from an imaging sensor of a computing device in the environment, wherein the environment comprises the scene (Guizilini [0037] “images of a scene are captured by a plurality of cameras ... The plurality of cameras 430 provide the computing device 410 with image data 449C”); and
generating the output image using the computing device (Guizilini [0039] “the computing device 410 implements a GSR architecture 10 … the GSR architecture 10 introduces view synthesis as an auxiliary task, decoded from the same latent representation”).
Claim 19 recites limitation(s) similar in scope to those of claim 1, and is rejected for the same reason(s). Guizilini further teaches one or more non-transitory computer-readable media storing instructions (Guizilini Fig. 4: memory component 440).
Claim 20 recites limitation(s) similar in scope to those of claim 1, and is rejected for the same reason(s). Guizilini further teaches one or more processors; and one or more non-transitory computer-readable media storing instructions (Guizilini Fig. 4: processor 445 and memory component 440).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 5 and 10-11 is/are rejected under 35 U.S.C. 103 as being unpatentable over Guizilini et al. (US 2024/0029286) in view of Lv et al. (US 2022/0239844).
Regarding claim 5, Guizilini teaches/suggests: The computer-implemented method of claim 1, comprising:
obtaining, by the computing system, training source images of a training scene (Guizilini [0042] “the computing device 410 with an encoder 100 of the GSR architecture 10 encodes the received images and camera embeddings into a latent scene representation 112. The GSR architecture 10 is designed and trained so that only a camera embedding is needed to query the latent scene representation 112”),
wherein the training source images are associated with a training target image associated with a training target view of the training scene (Guizilini [0039] “the GSR architecture 10 introduces view synthesis as an auxiliary task, decoded from the same latent representation, which improves depth estimation performance without requiring any additional ground-truth source” [An initial ground-truth image is considered a training target image.]);
generating, by the computing system and using a machine-learned pose estimator model, one or more latent pose values associated with the training target view (Guizilini [0042] “The GSR architecture 10 is designed and trained so that only a camera embedding is needed to query the latent scene representation 112”);
generating, by the computing system and using the machine-learned image view synthesis model, a training output image associated with the training target view (Guizilini [0039] “the GSR architecture 10 introduces view synthesis as an auxiliary task, decoded from the same latent representation”); and
training, by the computing system, at least one of the machine-learned pose estimator model or the machine-learned image view synthesis model (Guizilini [0042] “The GSR architecture 10 is designed and trained so that only a camera embedding is needed to query the latent scene representation 112” [0039] “the GSR architecture 10 introduces view synthesis as an auxiliary task, decoded from the same latent representation”).
Guizilini is silent regarding based on a comparison of the training output image and the training target image. Lv, however, teaches/suggests based on a comparison of the training output image and the training target image (Lv [0032] “During training, the pixel value (e.g., color and/or opacity) of a pixel (e.g., 130a) may be generated using NeRF and volume rendering and compared to a ground truth pixel value of that pixel captured in a frame of a training video”). Before the effective filing date of the claimed invention, it would have been obvious for one of ordinary skill in the art to modify the output and ground-truth images of Guizilini to be compared as taught/suggested by Lv for the training.
Regarding claim 10, Guizilini as modified by Lv teaches/suggests: The computer-implemented method of claim 5, wherein the machine-learned pose estimator model is configured to process the portion of the training target image and at least a portion of a latent scene representation to generate the latent pose value (Guizilini [0021] “using a fixed-size N.sub.l×C.sub.l latent scene representation R 112, and learning to project high-dimensional N.sub.e×C.sub.e embeddings onto this latent representation using cross-attention layers 114. The architecture then performs self-attention 116 in this lower-dimensional space, producing a conditioned latent representation R.sub.c 118”).
Regarding claim 11, Guizilini as modified by Lv teaches/suggests: The computer-implemented method of claim [10], wherein the machine-learned pose estimator model comprises a transformer encoder configured to attend over the latent scene representation (Guizilini [0020]-[0021] “embodiments utilize Perceiver IO as the general-purpose transformer backbone. During the encoding stage 100, the GSR architecture 10 ingests images and camera embeddings 102, 104, 106 from a plurality of calibrated cameras … using a fixed-size N.sub.l×C.sub.l latent scene representation R 112, and learning to project high-dimensional N.sub.e×C.sub.e embeddings onto this latent representation using cross-attention layers 114. The architecture then performs self-attention 116 in this lower-dimensional space, producing a conditioned latent representation R.sub.c 118”).
Claim(s) 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Guizilini et al. (US 2024/0029286) in view of Lv et al. (US 2022/0239844) as applied to claim 11 above, and further in view of Yu et al. (US 2024/0020854).
Regarding claim 12, Guizilini as modified by Lv does not teach/suggest: The computer-implemented method of claim 11, wherein the machine-learned pose estimator model attends over a selected subset of the latent scene representation that corresponds to a reference view of the one or more source images. Yu, however, teaches/suggests attends over a selected subset of the latent scene representation that corresponds to a reference view of the one or more source images (Yu [0051] “Cross-attention and bilateral attention are applied with the reference frame features” [0020] “Some modified approaches use a spatial local attention to mitigate resulting challenges, where the attention is only computed between each query token and its surrounding key tokens within a spatial local window”). Before the effective filing date of the claimed invention, it would have been obvious for one of ordinary skill in the art to modify the cross-attention of Guizilini as modified by Lv to be within the spatial local window (the selected subset) as taught/suggested by Yu to mitigate resulting challenges.
Claim(s) 13 and 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Guizilini et al. (US 2024/0029286) in view of Bhogal et al. (US 10970330).
Regarding claim 13, Guizilini does not teach/suggest: The computer-implemented method of claim 1, wherein the query is obtained using a view navigator that provides an interactive interface for mapping pose inputs to the latent pose space. Bhogal, however, teaches/suggests the query is obtained using a view navigator (Bhogal col. 5 ll. 1-38 “the search engine receives input from a user for rotation of the object in the image … the search engine executes a search query based on the new orientation vector” col. 8 line 65 – col. 9 line 19 “the visual indication to the user of available viewpoints of an object 704 in image search results is shown as a three-dimensional sphere 706 … the user may interact with an input device by a rotational gesture, and the image search engine causes a rotation of the sphere 706 on the display device. The new orientation vector 704 of the sphere 706 offers a preview to the user as to what the new viewpoint of the object 702 will be”). Before the effective filing date of the claimed invention, it would have been obvious for one of ordinary skill in the art to modify the system of Guizilini to include the interactive interface of Bhogal for view navigation.
As such, Guizilini as modified by Bhogal teaches/suggests a view navigator that provides an interactive interface for mapping pose inputs to the latent pose space (Guizilini [0042] “The GSR architecture 10 is designed and trained so that only a camera embedding is needed to query the latent scene representation 112” Bhogal col. 8 line 65 – col. 9 line 19 “the visual indication to the user of available viewpoints of an object 704 in image search results is shown as a three-dimensional sphere 706 … the user may interact with an input device by a rotational gesture, and the image search engine causes a rotation of the sphere 706 on the display device. The new orientation vector 704 of the sphere 706 offers a preview to the user as to what the new viewpoint of the object 702 will be”).
Regarding claim 16, Guizilini as modified by Bhogal teaches/suggests: The computer-implemented method of claim 13, wherein the view navigator is configured to explore the latent pose space and determine one or more control vectors that correspond to interpretable pose controls (Guizilini [0042] “The GSR architecture 10 is designed and trained so that only a camera embedding is needed to query the latent scene representation 112” Bhogal col. 8 line 65 – col. 9 line 19 “the visual indication to the user of available viewpoints of an object 704 in image search results is shown as a three-dimensional sphere 706 … the user may interact with an input device by a rotational gesture, and the image search engine causes a rotation of the sphere 706 on the display device. The new orientation vector 704 of the sphere 706 offers a preview to the user as to what the new viewpoint of the object 702 will be”). The orientation vectors selectable by the user meet the interpretable pose controls. The same rationale to combine as set forth in the rejection of claim 13 above is incorporated herein.
Claim(s) 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Guizilini et al. (US 2024/0029286) in view of Bhogal et al. (US 10970330) as applied to claim 13 above, and further in view of Ramage et al. (US 2016/0063393).
Regarding claim 14, Guizilini as modified by Bhogal teaches/suggests: The computer-implemented method of claim 13, wherein the view navigator maps one or more interactive input elements to the latent pose space that correspond to interpretable pose controls (Guizilini [0042] “The GSR architecture 10 is designed and trained so that only a camera embedding is needed to query the latent scene representation 112” Bhogal col. 8 line 65 – col. 9 line 19 “the visual indication to the user of available viewpoints of an object 704 in image search results is shown as a three-dimensional sphere 706 … the user may interact with an input device by a rotational gesture, and the image search engine causes a rotation of the sphere 706 on the display device. The new orientation vector 704 of the sphere 706 offers a preview to the user as to what the new viewpoint of the object 702 will be”). The orientation vectors selectable by the user meet the interpretable pose controls. The same rationale to combine as set forth in the rejection of claim 13 above is incorporated herein.
Guizilini and Bhogal are silent regarding one or more principal axes of the latent pose space. Ramage, however, teaches/suggests one or more principal axes of the latent pose space (Ramage [0072]-[0073] “using principal component analysis, the computing system 120 can find the eigenvectors of the covariance to determine the principal axes of the vector space … the computing system 120 may determine that the data along the particular principal axis represent substantive data ... The n observations are mapped to the reduced vector space accordingly”). Before the effective filing date of the claimed invention, it would have been obvious for one of ordinary skill in the art to modify the rotational gestures of Guizilini as modified by Bhogal to be mapped to the principal axes of the latent scene representation as taught/suggested by Ramage to use principal component analysis.
Claim(s) 15 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Guizilini et al. (US 2024/0029286).
Regarding claim 15, Guizilini teaches/suggests: The computer-implemented method of claim 1, comprising:
obtaining, by the computing system, latent pose values for each of a plurality of helper images (Guizilini [0039] “the GSR architecture 10 implements a neural network or other machine-learning model that receives images from a plurality of cameras and corresponding camera embeddings to learn and ultimately generate estimated depth maps for arbitrary viewpoints within the scene” [The received images meet the helper images.]); and
obtaining, by the computing system, one or more target latent pose values for a target view by interpolating (Guizilini [0039] “the GSR architecture 10 is designed to learn a geometric scene representation for depth synthesis, including estimation, interpolation, and extrapolation”).
Guizilini is silent regarding interpolating between the latent pose values of the helper images. However, the concept and advantages of such interpolation are well known and expected in the art (Official Notice). It would have been obvious to include such interpolation in Guizilini to determine the camera embeddings at the arbitrary viewpoints.
Regarding claim 18, Guizilini teaches/suggests: The computer-implemented method of claim 17, wherein the computing device is part of a robotic system that controls a motion of the robotic system (Guizilini [0039] “Downstream tasks may include robot or autonomous vehicle navigation”). Guizilini is silent regarding based on the output image. However, the concept and advantages of using the output image for navigation are well known and expected in the art (Official Notice). It would have been obvious to use the output image for the navigation in Guizilini so that the robot/vehicle can see its surrounding.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
US 2019/0147221 – pose estimation
US 2021/0294834 – images having similar pose
US 2022/0138249 – pose search
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANH-TUAN V NGUYEN whose telephone number is 571-270-7513. The examiner can normally be reached on M-F 9AM-5PM ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, JASON CHAN can be reached on 571-272-3022. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/ANH-TUAN V NGUYEN/
Primary Examiner, Art Unit 2619