Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-3, 5-12, and 14-20 are rejected under 35 U.S.C. 103 as being unpatentable over Sajjadi (US20240169662A1) and Ngo (US20240233097A1).
Regarding claim 1, Sajjadi teaches one or more processors comprising: one or more circuits to (Sajjadi; ¶0006, describes one or more processors and non-transitory computer-readable medium (CRM) storing instructions) determine, based at least on an input indicating one or more characteristics of a scene, a plurality of estimated views of the scene corresponding to a texture (¶0047, describes obtaining source images from different views of a scene where, ¶0049, source images can capture lighting conditions, colors, textures, and shapes. This teaches the determination is based at least on source image input describing characteristics of the scene (textures and shapes).
Sajjadi further discloses, ¶0080, that the pose estimator processes a target view hint to generate latent pose data, p~, where p~ can be referred to as the estimated pose, ¶0103. Sajjadi describes, ¶0098, the decoder uses latent pose p~ as a query to render the full novel view. “Estimated views” encompasses estimated viewpoint/pose parameters such as p~. ¶0105, describes multiple estimated views are supported (“5 input views and 3 novel target views”) and, ¶0089, at inference time “Pose estimator 208 can generate latent pose data for each of the two known images” and interpolate to “obtain novel views”. The views correspond to a texture because the source images capture “textures” (¶0049).) render, from a model of the texture, a plurality of renders of the texture, at least one render of the plurality of renders being associated with a corresponding estimated view of the plurality of estimated views (¶0096-0097, describes the target view y is rendered by a decoder after the input views are encoded into a set-latent scene representation S that captures the contents of the scene. The latent scene representation S captures the contents of the scene from source images that capture textures (¶0049). This teaches a model of the texture (latent scene representations S used for rendering scene appearance/texture).
Sajjadi further discloses, ¶0098, the decoder uses estimated pose features p~ to render the full novel view y, and, ¶0105, multiple renders are supported by using “5 input views and 3 novel target views”; ¶0098 each render is associated with its corresponding estimated pose/view. This teaches rendering, from the model of the texture, a plurality of renders of the texture where at least one render of the plurality of renders is associated with a corresponding estimated view of the plurality of estimated views.) update the model of the texture based at least on the plurality of renders and the plurality of estimated views (¶0076, describes a training loop where a model trainer evaluates a rendered training output image agains a target image and updates one or more parameters of the image view synthesis model based on the evaluation. The renders are produced by the decoder conditioned on estimated pose p~ (¶0098) where the training uses “5 input views and 3 novel target views” to predict target views and minimize error between predicted output and target view (¶0105). Because y~ (rendered output) is generated from p~, the evaluation is a function of both the rendered outputs and the estimated pose p~ (the renders are evaluated in view of the estimated views used to generate them). This teaches updating the model of the texture (image view synthesis model) based on at least the plurality of renders (training output images) and the plurality of estimated views (latent pose parameters p~ used to condition the rendering).) and update the plurality of estimated views based at least on the plurality of renders (¶0076, describes updating the pose generator/estimator based on evaluation of rendered outputs during training; the trainer can update the training pose generator based on the evaluation of rendered training output images against target images and “gradients flowed to and through the Pose Estimator” (¶0105). Sajjadi further describes, ¶0067, the pose estimator can “estimate a latent pose query associated with a training output image”. Because the pose estimator outputs latent pose p~ (estimated view) and is updated via gradients derived from evaluation of rendered outputs (¶0076), the estimated views (latent pose outputs) are updated (via the pose estimator) based at least on the plurality of renders. This teaches updating the plurality of estimated views based at least on the plurality of renders.)
However, Sajjadi does not explicitly disclose determining the estimated views using a denoiser.
Ngo describes, ¶0029, performing noise filtering operations where “the low-resolution 3D mesh may be used to further enhance the IVDM 240 (e.g., surface normal, inpainting, de-noising, etc.).” and, “an image filtering operation is performed on the inpainted 2D view (e.g., de-noising based on depth of the 3D mesh)”. Ngo further describes, ¶0038, the IVDM is further modified by a noise filtering operation and rendered as denoised IVDM, and, ¶0062, applies an image filtering operation where the noise filtering reduces noise while preserving edges. This teaches using a denoiser (noise filtering/de-noising) that can be applied to image data.
It would have been obvious to one of ordinary skill in the art, before the effective filing date, to modify the view synthesis technique as taught by Sajjadi with the noise filtering technique of Ngo because both references relate to generating viewpoint dependent rendered views of environments, and doing so provides the benefit of reducing noise in the input used for estimating the view/pose, improving stability/quality of the rendered views.
Claim 10, has similar limitations as of claim 1, therefore it is rejected under the same rationale as claim 1.
Claim 18, has similar limitations as of claim 1, therefore it is rejected under the same rationale as claim 1.
Regarding claim 2, Sajjadi in view of Ngo teaches the one or more processors of claim 1, wherein the one or more circuits are to update the model of the texture based at least on a consistency loss determined according to the plurality of renders and the plurality of estimated views (Sajjadi; ¶0127, describes training based on a comparison of the training output image and the training target image using a loss function, “e.g., a reconstruction loss, a perceptual loss, etc.”, where “Update gradients can flow to or through the image view synthesis model”. Sajjadi further describes, ¶0105, the training objective is to render multiple target views by “minimizing the mean-squared error between the predicted output and the target view” using multiple target views per training instance (“5 input views and 3 novel target views”). Because the multiple renders are evaluated against their corresponding target views and the resulting losses are combined used to update a shared model, the loss enforces consistency across views. If the model produces an accurate render for one view, but an inaccurate render for another, the combined loss reflects the inconsistency and the model is updated to be more consistent across all views.
The combined multiple view loss reads on a “consistency loss” determined according to the plurality of renders and the plurality of estimated views. This teaches updating the model of the texture based at least on a consistency loss determined according to the plurality of renders and the plurality of estimated views.)
Claim 11, has similar limitations as of claim 2, therefore it is rejected under the same rationale as claim 2.
Claim 19, has similar limitations as of claim 2, therefore it is rejected under the same rationale as claim 2.
Regarding claim 3, Sajjadi in view of Ngo teaches the one or more processors of claim 1, wherein the denoiser operates in an image space for the scene (Ngo; ¶0039, describes performing a noise filtering operation (image filtering) on the IVDM and describes neighboring pixel blending/edge preservation (bilateral filtering) as part of the noise filtering. Ngo, ¶0062, further describes applying an image filtering operation on an enhanced 2D view (a rendered 2D image) which is also image space filtering. This teaches the image denoiser is an image space denoiser for the scene.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date, to modify the view synthesis technique as taught by Sajjadi to operate in image space as taught by Ngo, with the benefit of reducing complexity and improving efficiency.
Claim 12, has similar limitations as of claim 3, therefore it is rejected under the same rationale as claim 13
Regarding claim 5, Sajjadi in view of Ngo teaches the one or more processors of claim 1, wherein the one or more circuits are to update the model over a plurality of iterations until a convergence criterion is satisfied (Sajjadi; ¶0106, describes that all models used to obtain the example results were trained for 3 M steps. This teaches iteratively updating the model over a plurality of iteraitons (3 million training steps). Sajjadi further describes training “from an initialized state to a fully trained state (e.g., when the model exhibits a desired performance profile, such as based on accuracy, precision, recall, etc.).” This teaches updating the model until a convergence criterion is satisfied (desired performance profile/fully trained state is achieved).) the convergence criterion comprising at least one of a threshold for the plurality of iterations or a threshold for one or more losses associated with the plurality of estimated views and the plurality of renders (Sajjadi; ¶0106, describes training for 3 M steps which correlates to using a predetermined iteration count (3 million steps) as a stopping threshold. This teaches a threshold for the plurality of iterations.)
Claim 14, has similar limitations as of claim 5, therefore it is rejected under the same rationale as claim 5.
Claim 20, has similar limitations as of claim 5, therefore it is rejected under the same rationale as claim 5.
Regarding claim 6, Sajjadi in view of Ngo teaches the one or more processors of claim 1, wherein the scene comprises an object corresponding to the one or more characteristics (Sajjadi; ¶0049, describes source images can capture scene characteristics such as “lighting conditions, colors, textures, and shapes” and can include “foreground and background elements of the scene”. The “foreground and background elements” of the scene include objects in the scene and the characteristics (such as textures and shapes) correspond to said objects. This reads on the scene comprising an object corresponding to the one or more characteristics.)
Claim 15, has similar limitations as of claim 6, therefore it is rejected under the same rationale as claim 6.
Regarding claim 7, Sajjadi in view of Ngo teaches the one or more processors of claim 1, wherein at least one estimated view of the plurality of estimated views corresponds to a different camera perspective of the scene (Sajjadi; ¶0028, describes novel view synthesis in which the model generates “an image of a target view of a scene based on one or more source images of the scene”, where the target view can be specified at inference time by a query parameterized in a latent pose space. Sajjadi; ¶0077, describes the latent pose space reflects semantically meaningful axes with respect to camera views such as “ camera height, camera rotation, camera distance, etc.” and that traversals in the latent pose space correspond to camera motion (translation, pitch, and tilt). This teaches at least one estimated view corresponding to a camera perspective that is different from another view.)
Regarding claim 8, Sajjadi in view of Ngo teaches the one or more processors of claim 1, wherein the model of the texture is a three-dimensional (3D) model comprising parameters of one or more geographic elements or one or more 3D constructs representing 3D information (Sajjadi; ¶0077, describes the latent representation used by the framework enables the system to “capture the 3D structure of complex real-world scenes” and that “the model has correctly estimated the depth of the scene”. A model that captures 3D structure and depth constitutes a 3D model representing 3D information (comprises parameters representing 3D information). This teaches the model of the texture is a 3D model comprising parameters of one or more 3D constructs representing 3D information (latent representation encoding 3D structure and depth).)
Claim 17, has similar limitations as of claim 8, therefore it is rejected under the same rationale as claim 8.
Regarding claim 9, Sajjadi in view of Ngo teaches the one or more processors of claim 1, wherein the one or more processors are comprised in at least one of: a system for performing simulation operations; a system for performing collaborative content creation for 3D assets; a system for generating synthetic data; a system comprising one or more vision language models (VLMs); a system comprising one or more large language models (LLMs); a system for performing conversational AI operations; a system for performing light transport simulation; a system for performing deep learning operations; a system for performing digital twin operations; a control system for an autonomous or semi-autonomous machine; a perception system for an autonomous or semi-autonomous machine; a system incorporating one or more virtual machines (VMs); a system implemented using a robot; a system implemented using an edge device; a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources, specifically a system for performing deep learning operations (Sajjadi; ¶0053, describes that the image view synthesis model can include one or more machine-learned models including one or more transformer blocks and, ¶0076 and 0105, the model is trained using gradient based optimization with loss functions. Machine-learned models including transformers trained via gradient descent are deep learning systems. This teaches a system for performing deep learning operations.)
Claims 4 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Sajjadi (US20240169662A1), Ngo (US20240233097A1), and Scribano (Scribano, Carmelo, et al. "Denoising diffusion models on model-based latent space." Algorithms 16.11 (2023): 501.).
Regarding claim 4, Sajjadi in view of Ngo teaches the one or more processors of claim 1. Sajjadi; ¶0028, discloses “The query can be parameterized in a latent pose space” and, ¶0088, the framework processes images to obtain latent pose values in that latent space, where these latent pose values represent estimated views (as previously discussed in claim 1). However, Sajjadi in view of Ngo does not explicitly disclose wherein the denoiser operates in a latent space, and the one or more circuits are to use an encoder to convert the plurality of estimated views from the latent space to an image space of the plurality of renders.
Scribano describes, ABST, “defining the generative process in the latent space” of an encoder, which “renders the learning of the generative process more manageable while significantly reducing computational and memory demands” and further discloses, Sec. 2.3, Equation (6), defining the generative process in latent space means redefining the loss as a function of z, where the denoising model operates on latent representations zt. This teaches the denoiser operates in a latent space.
Scribano further describes, Sec. 2.3, using an encoder-decoder architecture to convert between latent and image spaces; “an encoder E(x) = z which maps an image
PNG
media_image1.png
36
128
media_image1.png
Greyscale
into a latent representation
PNG
media_image2.png
36
132
media_image2.png
Greyscale
and a decoder D(z) = x which
reconstructs
PNG
media_image3.png
32
64
media_image3.png
Greyscale
in the image space”. This teaches converting latent representations to image space outputs (renders). In the combined system, Sajjadi’s plurality of estimated views (latent pose values) specify/condition the target views for rendering and Scribano’s encoder/decoder conversion produces image space renders from latent representations. Scribano; Sec. 2.3, “encoder” reads on the encoding/decoding conversion mechanism that performs the conversion between latent space and image space. This teaches using an encoder to convert from the latent space to an image space of the plurality of renders.
It would have been obvious to one of ordinary skill in the art, before the effective filing date, to modify the view synthesis of Sajjadi in view of Ngo with Scribano’s latent space denoising approach to improve efficiency while maintaining output quality.
Claim 13, has similar limitations as of claim 4, therefore it is rejected under the same rationale as claim 4.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DAN F KALHORI whose telephone number is (571)272-5475. The examiner can normally be reached Mon-Fri 8:30-5:30 ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, DEVONA E FAULK can be reached at (571) 272-7515. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/DAN F KALHORI/Examiner, Art Unit 2618
/DEVONA E FAULK/Supervisory Patent Examiner, Art Unit 2618