Office Action Analysis: 18671805 — GENERATION OF 3D ASSETS USING NOVEL POSE ESTIMATION

Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 05/22/204 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claim(s) 1, 3-4, 6, 8-11, and 13-19 are rejected under 35 U.S.C. 103 as being unpatentable over Park et al. (US 20250225713 A1) (Hereinafter referred to as Park) in view of Shi et al. ("Zero123++: a single image to consistent multi-view diffusion base model." arXiv preprint arXiv:2310.15110 (2023)) (Hereinafter referred to as Shi). 
Regarding claim 1, Park discloses a computing system for generating a three-dimensional asset of an object, the computing system comprising: processing circuitry and memory containing instructions that, when executed, cause the processing circuitry to: (see Claim 14, “A rendering device comprising: a memory configured to store a view change model and a scene restoration model; and at least one processor configured to:…” 
Also see para. [0017], “Referring to FIG. 8, a rendering device 800 may include a processor 810 and a memory 820. The memory 820 may be connected to the processor 810, and may store instructions executable by the processor 810, data to be computed by the processor 810, or data processed by the processor 810.
Also see Abstract, “generating a scene restoration model based on the input image at the input viewpoint and the plurality of augmented images at the plurality of augmented viewpoints; and restoring a scene image of a target view of the object using the scene restoration model.”);
receive an initial image of the object in a first perspective view; (see para. [0016], “obtain an input image of an object, based on an input viewpoint corresponding to the input image…”);
wherein the diffusion-based generative model has been trained with a plurality of training data sets, each training data set comprising a plurality of training images with different perspective views corresponding to a set of points on an imaginary unit sphere around a training object, wherein the set of points is organized along lines on the imaginary unit sphere around the training object such that neighboring points along a line are uniformly separated with similar angular distances; and (see para. [0062], “The scene restoration model may be a model that is designed and trained to output information (e.g., scene information) used to restore a scene from a given view in which an object positioned in a 3D space is viewed from the given view (e.g., a viewpoint and a view direction). The input image at the input viewpoint and the augmented images at the augmented viewpoints may be used as training data (e.g., ground truth (GT) data) for the scene restoration model. A training input of the training data may be an input that indicates a view (e.g., a viewpoint and a view direction), and a training output may be a corresponding image (e.g., an image showing a scene observed according to a corresponding view). However, the configuration of the training data is not limited thereto.” Also see para. [0070], “The rendering device may determine the input viewpoint 321 and the plurality of augmented viewpoints 322 in the 3D virtual space based on the input image and the object 390. The rendering device may determine augmented views having a view direction r2 from viewpoints 320 surrounding the object 390 individually toward the object 390 around the object 390 in the 3D virtual space. For example, the rendering device may determine positions along a surface of a virtual solid FIG. 310 surrounding the object 390 in the 3D space as the plurality of augmented viewpoints 322. A shape of the solid FIG. 310 may be, for example, a sphere or hemisphere.”);

    PNG
    media_image1.png
    375
    617
    media_image1.png
    Greyscale


perform surface reconstruction using the plurality of novel view images to generate the three-dimensional asset. (see Abstract, “generating a scene restoration model based on the input image at the input viewpoint and the plurality of augmented images at the plurality of augmented viewpoints; and restoring a scene image of a target view of the object using the scene restoration model.”)
However, Park does not explicitly disclose perform depth estimation on the initial image to generate depth information; and generate a plurality of novel view images with perspective views different from the first perspective view using a diffusion-based generative model, the initial image, and the depth information,.
Shi more explicitly teaches, in the context of three-dimensional asset generation, perform depth estimation on the initial image to generate depth information; (see 4. Depth ControlNet for Zero123++, “In addition to the base Zero123++ model, we also release a depth-controlled version of Zero123++ built with ControlNet. We render normalized linear depth images corresponding to the target RGB images and train a Control-Net to control Zero123++ on the geometry via depth.” and “We may use a single view as the input image to Zero123++ (the first example)”)	Shi also teaches generate a plurality of novel view images with perspective views different from the first perspective view using a diffusion-based generative model, the initial image, and the depth information, (see 4. Depth ControlNet for Zero123++ and Fig. 9, “Fig. 9 shows two example generations from depth controlled Zero123++. We may use a single view as the input image to Zero123++ (the first example) or generate the input image from depth with vanilla depth-controlled Stable Diffusion as well to eliminate any need for input colors (the second example)).

    PNG
    media_image2.png
    471
    358
    media_image2.png
    Greyscale

As both Park and Shi from the same field of endeavor, It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include performing depth estimation on the initial image to generate depth information and generating a plurality of novel view images with perspective views different from the first perspective view using a diffusion-based generative model, the initial image, and the depth information in the system for generating a three dimensional asset of an object disclosed by Park according to the teachings of Shi in order to overcome common issues like texture degradation and geometric misalignment and to excels in producing high-quality, consistent multi-view images from a single image (see abstract of Shi).

Regarding claim 3, Park in view of Shi discloses all the limitation of claim 1, and Shi further discloses wherein the instructions, when executed, further cause the processing circuitry to: perform background removal on the initial image, wherein the plurality of novel view images is generated using the diffusion-based generative model, the background-removed initial image, and the depth information. (see 3.1. Image to Multi-view of Shi, “We use SAM [8] for background removal. Zero123++ generates consistent and high-quality multi-view images, and can generalize to out-of-domain Ai-generated and 2D illustration images.”)

Regarding claim 4, Park in view of Shi discloses all the limitation of claim 1, and Park further discloses wherein the perspective views of the plurality of novel view images are different perspective views corresponding to a second set of points on an imaginary unit sphere around the object, wherein the second set of points is organized along lines on the imaginary unit sphere around the object. (see para. [0070] of Park, “The rendering device may determine the input viewpoint 321 and the plurality of augmented viewpoints 322 in the 3D virtual space based on the input image and the object 390. The rendering device may determine augmented views having a view direction r2 from viewpoints 320 surrounding the object 390 individually toward the object 390 around the object 390 in the 3D virtual space. For example, the rendering device may determine positions along a surface of a virtual solid FIG. 310 surrounding the object 390 in the 3D space as the plurality of augmented viewpoints 322. A shape of the solid FIG. 310 may be, for example, a sphere or hemisphere.”)

Regarding claim 6, Park in view of Shi discloses all the limitation of claim 1, and Park further discloses wherein the instructions, when executed, further cause the processing circuitry to: select a subset of the plurality of novel view images, wherein the surface reconstruction is performed using the selected subset, exclusive of the novel view images outside the selected subset. (see para. [0008] of Park, “selecting an augmented image at the each augmented viewpoint from among the plurality of candidate images.”)

Regarding claim 8, Park in view of Shi discloses all the limitation of claim 6 and Park further discloses wherein the plurality of novel view images comprises pluralities of similar view images, each plurality of similar view images corresponding to a respective perspective view of the object, and wherein selecting the subset of the plurality of novel view images comprises, for each plurality of similar view images, selecting an image based on at least a quality criterion. (see para. [0010] of Park, “The selecting of the augmented image may include: calculating a learned perceptual image patch similarity (LPIPS) loss between the retransformed image and the corresponding reference image; and selecting a candidate image having a smallest LPIPS loss from among the plurality of candidate images as the augmented image.”)

Regarding claim 9, Park in view of Shi discloses all the limitation of claim 6 and Park further discloses wherein performing the surface reconstruction comprises: performing a first surface reconstruction using the subset of the plurality of novel view images; performing a second surface reconstruction using a direct methodology based on the initial image; and performing a joint reconstruction using the first surface reconstruction and the second surface reconstruction to generate the three-dimensional asset. (see para. [0062] of Park, “At operation 270, the rendering device may generate a scene restoration model using the input image at the input viewpoint and the augmented images at the plurality of augmented viewpoints.”)

Regarding claim 10, Park in view of Shi discloses all the limitation of claim 1 and Park further discloses wherein each of the plurality of training data sets is generated using a training three-dimensional asset corresponding to a respective training object. (see para. [0064] of Park, “For example, an image generated by the scene restoration method may be used as 3D training data for various tasks (e.g., autonomous driving and object recognition).”)

Regarding claim 11, the method claim 11 is similar in scope to claim 1 and is rejected under the same rationale.
Regarding claim 13, the method claim 13 is similar in scope to claim 3 and is rejected under the same rationale.
Regarding claim 14, the method claim 14 is similar in scope to claim 4 and is rejected under the same rationale.
Regarding claim 15, the method claim 15 is similar in scope to claim 5 and is rejected under the same rationale.
Regarding claim 16, the method claim 16 is similar in scope to claim 6 and is rejected under the same rationale.
Regarding claim 17, the method claim 17 is similar in scope to claim 8 and is rejected under the same rationale.
Regarding claim 18, the method claim 18 is similar in scope to claim 9 and is rejected under the same rationale.
Regarding claim 19, the method claim 19 is similar in scope to claim 10 and is rejected under the same rationale.

Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Park et al. (US 20250225713 A1) (Hereinafter referred to as Park) in view of Shi et al. ("Zero123++: a single image to consistent multi-view diffusion base model." arXiv preprint arXiv:2310.15110 (2023)) (Hereinafter referred to as Shi) and further in view of Chu et al. (CN 112330825 A) (Hereinafter referred to as Chu). 

Regarding claim 5, Park in view of Shi fails to disclose wherein the plurality of novel view images comprises one hundred forty-four images with different perspective views corresponding to intersecting points of a grid of sixteen lines in a first axis and nine lines in a second axis on an imaginary unit sphere around the object.
Chu teaches, in the context of three-dimensional asset generation, wherein the plurality of novel view images comprises one hundred forty-four images with different perspective views. (see para. [0074], “The acquired two-dimensional views should at least cover all the outer surfaces of the preprocessed three-dimensional model, and all two-dimensional views should be of equal size. 144 viewpoints were defined for each 3D model, with one viewpoint set every 30 degrees of latitude and longitude. For each viewpoint, the 3D model can be photographed, resulting in 144 views surrounding the entire 3D model. The viewpoint settings are shown in Figure 4.”

    PNG
    media_image3.png
    235
    213
    media_image3.png
    Greyscale

Also see para. [0076], “In this embodiment of the invention, 12 virtual cameras are placed around the grid every 30 degrees along the longitude direction, and 12 virtual cameras are placed around the grid every 30 degrees along the latitude direction to create 144 rendering views in a 12*12 grid, as shown in Figure 5.”)
  It would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains to have included generating the three-dimensional asset of Park in view of Shi with one hundred forty-four images with different perspective views according to the teachings of Chu in order to improve the accuracy and speed of 3D model generation (see para. [0002] of Chu).
Park in view of Shi and further in view of Chu fail to disclose wherein the perspective views correspond to intersecting points of a grid of sixteen lines in a first axis and nine lines in a second axis on an imaginary unit sphere around the object.  However, per MPEP 2144.05, even where claimed ranges or amounts do not overlap with the prior art but are merely close, a prima facie case of obviousness exists. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to obtain intersecting points of a grid of sixteen lines in a first axis and nine lines in a second axis through routine experimentation. In addition, given that Chu teaches one hundred forty-four images and 12*12 grid, it would have been obvious to one of ordinary skill in the art to try a grid of sixteen lines in a first axis and nine lines in a second axis from a finite number of identified, predictable solutions and reasonable expectation of success in order to improve the accuracy and speed of 3D model generation (see para. [0002] of Chi).

Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Park et al. (US 20250225713 A1) (Hereinafter referred to as Park) in view of Shi et al. ("Zero123++: a single image to consistent multi-view diffusion base model." arXiv preprint arXiv:2310.15110 (2023)) (Hereinafter referred to as Shi) and further in view of Lin et al. ("Modeling 3d shapes by reinforcement learning." European Conference on Computer Vision. Cham: Springer International Publishing, 2020.) (Hereinafter referred to as Lin). 

Regarding claim 7, Park in view of Shi discloses all the limitation of claim 6 and Park further discloses wherein the subset is selected based on at least a quality criterion using a machine learning model (see para [0009] of Park, The selecting of the augmented image may include: obtaining a retransformed image by transforming each candidate image from among the plurality of candidate images to a corresponding reference viewpoint using the view change model; and selecting the augmented image based on a comparison between the retransformed image and a corresponding reference image.)
However, Park does not explicitly disclose a machine learning model trained with reinforcement learning.
Lin more explicitly teaches, in the context of three-dimensional asset generation, a machine learning model trained with reinforcement learning. (see Abstract, “We explore how to enable machines to model 3D shapes like human modelers using deep reinforcement learning (RL) and “we introduce a novel training algorithm that combines heuristic policy, imitation learning and reinforcement learning.”).
As both Park and Lin from the same field of endeavor, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include a machine learning model trained with reinforcement learning in the system for generating a three dimensional asset of an object disclosed by Park according to the teachings of Lin in order to select the subset more effectively based on the quality criterion (see Abstract of Lin).

Claim 2 ,12, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Park et al. (US 20250225713 A1) (Hereinafter referred to as Park) in view of Shi et al. ("Zero123++: a single image to consistent multi-view diffusion base model." arXiv preprint arXiv:2310.15110 (2023)) (Hereinafter referred to as Shi) as applied to claim1 and in further view of Shi et al. (Shi, Dingfeng, Yifan Zhao, and Jia Li. "Reconstructing Part-Level 3D Models From a Single Image." ICME. 2020.)

Regarding claim 2, Park in view of Shi discloses all the limitation of claim 1, but park in view of Shi does not explicitly disclose wherein the instructions, when executed, further cause the processing circuitry to: segment the initial image to isolate a component of the object, wherein the plurality of novel view images comprises images of the component, and wherein the generated three-dimensional asset comprises a three-dimensional asset of the component.
Shi et al. more explicitly teach, in the context of three-dimensional asset generation, wherein the instructions, when executed, further cause the processing circuitry to: segment the initial image to isolate a component of the object, (see Fig.1. and 1. Introduction, “. As illustrated in Fig. 1, the classical object-level reconstruction task aims to estimate the holistic object from a single monocular input, while the part-level task has the potential to provide fine-grained information, e.g., the armrests and backrests of chairs.”)

    PNG
    media_image4.png
    300
    454
    media_image4.png
    Greyscale

wherein the generated three-dimensional asset comprises a three-dimensional asset of the component. (see Abstract, “we make the first attempt to reconstruct the 3D models with part-level representations in a unified framework.”
As Park, Shi, and Shi et al. are from the same field of endeavor, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include wherein the instructions, when executed, further cause the processing circuitry to: segment the initial image to isolate a component of the object, wherein the plurality of novel view images comprises images of the component, and wherein the generated three-dimensional asset comprises a three-dimensional asset of the component, in the system for generating a three dimensional asset of an object disclosed by Park view of Shi according to the teachings of Shi et al. in order to generating reliable part-level structures while achieving state-of-the-art performance in object-level recovering (see Abstract of Shi et al.).
The combination Park, Shi, and Shi et al. teaches wherein the plurality of novel view images ((see para. [0016] of Park, determine a plurality of augmented viewpoints surrounding the object in a three-dimensional (3D) space including the object, generate a plurality of augmented images at the plurality of augmented viewpoints, wherein each augmented image from among the plurality of augmented images corresponds to a view of the object from a corresponding augmented viewpoint from among the plurality of augmented viewpoints, and wherein each augmented image is generated based on an image at a different viewpoint using the view change model)  comprises images of the component (see Fig.1. and 1. Introduction of Shi et al., “. As illustrated in Fig. 1, the classical object-level reconstruction task aims to estimate the holistic object from a single monocular input, while the part-level task has the potential to provide fine-grained information, e.g., the armrests and backrests of chairs.”).

Regarding claim 12, the method claim 12 is similar in scope to claim 2 and is rejected under the same rationale.

Regarding Claim 20, the system claim 20 is similar in scope to claim 1 and 2 and is rejected under the same rationale.  
Also, Shi et al. further disclose performing asset reconstruction using the three-dimensional assets of the plurality of components to generate the three-dimensional asset of the object. (see 2.1. Overview of Shi et al., “we present the part-level 3D reconstruction framework, which is composed of two modules, i.e., the Feature Enhancement Encoder and 3D Part Generator.”)
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Hyorim Park whose telephone number is (571)272-3859. The examiner can normally be reached Monday - Friday.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jason Chan can be reached at (571) 272-3022. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

Hyorim Park
Examiner
Art Unit 2619



/Hyorim Park/Examiner, Art Unit 2619                                                                                                                                                                                                        

/JASON CHAN/Supervisory Patent Examiner, Art Unit 2619
Read full office action
GENERATION OF 3D ASSETS USING NOVEL POSE ESTIMATION

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

GENERATION OF 3D ASSETS USING NOVEL POSE ESTIMATION

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email