DETAILED ACTION
Notice of Pre-AIA or AIA Status
1. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
Information Disclosure Statement
2. The information disclosure statements (IDS) submitted on the following dates are in compliance with the provisions of 37 CFR 1.97 and are being considered by the Examiner: 05/06/2025; 12/03/2025.
Claim Rejections - 35 USC § 103
3. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
4. Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over “Text-Guided Texturing by Synchronized Multi-View Diffusion” by Yuxin Liu (“Liu”) in view of “DreamFace: Progressive Generation of Animatable 3D Faces under Text Guidance” by Longwen Zhang (“Zhang”)
Regarding claim 1, Liu discloses a system (Liu- page 1, section 1 Introduction, 1st paragraph, at least discloses “Existing rendering systems predominantly utilize polygonal geometric primitives, like triangles, and apply texture mapping to enhance visual appeal”) comprising:
receiving a prompt from a developer (Liu- page 1, Abstract, at least discloses “a novel approach to synthesize texture to dress up a 3D object, given a text prompt”; page 3, section 4, at least discloses “a T2I diffusion process is assigned to synthesize each of these 2D views {𝑧 (𝑣1 ) 𝑡 , 𝑧 (𝑣2 )𝑡 , · · · , 𝑧 (𝑣𝑛 )𝑡 }, using the text prompt 𝑦 as guidance and conditional images (depth or normal map rendered from the corresponding viewpoints) as the condition”);
accessing a three dimensional head mesh (Liu- page 4, section 4.1, 2nd paragraph at least discloses “the source mesh from views 𝑉 = {𝑣𝑖 }𝑁 𝑖=1 using this map as the texture, to obtain 3D consistent initial views 𝑍𝑇 = {𝑧 (𝑣𝑖 )𝑇}𝑁𝑖=1 of the object“; see also figures, where the textured 3D meshes are in majority character avatars which possess a "head", as claimed);
generating a textured three dimensional head mesh (Liu- page 4, section 4.1, 2nd paragraph at least discloses “At initial time step 𝑇 , we first initialize a latent texture 𝑤𝑇 with standard normal distribution. Then we render the source mesh from views 𝑉 = {𝑣𝑖 }𝑁 𝑖=1 using this map as the texture, to obtain 3D consistent initial views 𝑍𝑇 = {𝑧 (𝑣𝑖 )𝑇}𝑁𝑖=1 of the object”; see also figures, where the textured 3D meshes are in majority character avatars which possess a "head", as claimed) by:
inputting the prompt into a stable diffusion model (Liu- page 3, left column, 2nd paragraph, at least discloses “position map serves as a 2D representation of the input geometry, and is used as a conditional input to a 2D T2I diffusion model for generating texture in UV domain”; page 3, section 3, 2nd paragraph, at least discloses “we utilize Stable Diffusion model […] which is trained to denoise in low-resolution latent space 𝑧 =E(𝑥) encoded by a pre-trained VAE encoder E, as it can significantly reduce the computational cost”; page 3, section 4, 1st paragraph at least discloses “[…] We first surround the target 3D object 𝑚 with multiple cameras {𝑣1, 𝑣2, · · · 𝑣𝑛}, each covers part of the object. Then, a T2I diffusion process is assigned to synthesize each of these 2D views {𝑧 (𝑣1 ) 𝑡 , 𝑧 (𝑣2 ) 𝑡 , · · · , 𝑧 (𝑣𝑛 ) 𝑡 }, using the text prompt 𝑦 as guidance and conditional images (depth or normal map rendered from the corresponding viewpoints) as the condition […]”);
retrieving a plurality of two dimensional images for the texture (Liu- page 3, left column, 2nd paragraph, at least discloses “position map serves as a 2D representation of the input geometry, and is used as a conditional input to a 2D T2I diffusion model for generating texture in UV domain”; page 3, section 4, 1st paragraph, at least discloses “Given the object geometry and a known camera, the ground truth depth map or normal map can be rendered to condition the above 2D image generation, enabling the generation of 2D views of the desired textured 3D object […] We first surround the target 3D object 𝑚 with multiple cameras {𝑣1, 𝑣2, · · · 𝑣𝑛}, each covers part of the object. Then, a T2I diffusion process is assigned to synthesize each of these 2D views {𝑧 (𝑣1 ) 𝑡 , 𝑧 (𝑣2 ) 𝑡 , · · · , 𝑧 (𝑣𝑛 ) 𝑡 }, using the text prompt 𝑦 as guidance and conditional images (depth or normal map rendered from the corresponding viewpoints) as the condition […]”; and
projecting the plurality of two dimensional images onto the three dimensional head mesh (Liu- Fig. 2 shows denoising results in different views of the same object could diverge into different directions, leading to seams when projecting to an output texture; page 3, section 4, 1st paragraph, at least discloses “With sufficient views, we can obtain the complete texture map covering the whole object, by projecting the generated pixels from each view onto the object surface, which in turn can be mapped to the texture domain (UV space)”;
Liu does not explicitly disclose a system comprising: at least one processor; and at least one memory component storing instructions that, when executed by the at least one processor, cause the at least one processor to perform operations; accessing a camera feed from a camera system of a user, the camera feed including a head of the user; and applying a first content augmentation corresponding to the textured three dimensional head mesh to the head of the user in the camera feed.
However, Zhang discloses
a system comprising: at least one processor; and at least one memory component storing instructions that, when executed by the at least one processor, cause the at least one processor to perform operations (Zhang- page 5, left column, 2nd paragraph, at least discloses “The texture LDM is trained using augmented UV texture datasets with physically-based ones”; page 8, left column, section 5.1.1 Data Collection, at least discloses We collect the UV texture dataset from multiple sources, including our captured facial scans using the multi-view photometric capture system, public datasets [Yang et al. 2020], and textures from commercial datasets; page 14, section 7.1, at least discloses “We train the texture LDM on two Nvidia A6000 GPUs” “enabling the generation of a high-quality facial asset within 5 minutes on a single Nvidia A6000 GPU”);
accessing a camera feed from a camera system of a user, the camera feed including a head of the user (Zhang- page 4, right column, 1st paragraph, at least discloses “Some face-tracking techniques [Apple 2023; Feng et al. 2021; Fyffe et al. 2015; Somepalli et al. 2021] can provide a lightweight capture solution for facial animation with a single RGB camera input.”; page 8, left column, section 5.1.1 Data Collection, at least discloses We collect the UV texture dataset from multiple sources, including our captured facial scans using the multi-view photometric capture system); and
applying a first content augmentation corresponding to the textured three dimensional head mesh to the head of the user in the camera feed (Zhang- Fig. 8 shows Generated facial assets of celebrities. Our approach generates facial assets of celebrities that capture their personalized characteristics and achieve a high degree of resemblance. By generating physically-based textures, our facial assets achieve photo-realistic results; Fig. 9 shows Generated facial assets from descriptions. Our approach generates facial assets that faithfully match the characteristics described in the prompts. Through our animatability empowerment, the generated facial assets can be animated using a single RGB image and rendered photo-realistically in modern CG pipelines; Fig. 10 shows Generation out of distribution. The upper row shows the rendering results from the differentiable renderer, and the lower row shows the corresponding diffuse maps. Our framework faithfully reveals the facial characteristics of characters; page 10, right column, section 6, at least discloses our framework also empowers the animatability of the generated facial asset).
It would have been obvious to one of ordinary in the art before the effective filing date of the claimed invention to have modified Liu to incorporate the teachings of Zhang, and apply the captured facial scans using the multi-view photometric capture system into the Zhang’s teachings for accessing a camera feed from a camera system of a user, the camera feed including a head of the user; and applying a first content augmentation corresponding to the textured three dimensional head mesh to the head of the user in the camera feed.
Doing so would pave the way for novices to conveniently customize 3D content.
Regarding claim 2, Liu in view of Zhang, discloses the system of claim 1, and further discloses wherein the plurality of two dimensional images include four views of a head that correspond to the prompt (Liu- Fig. 5 shows Comparison of object texturing results with text prompt; page 3, section 4, at least discloses “We first surround the target 3D object 𝑚 with multiple cameras {𝑣1, 𝑣2, · · · 𝑣𝑛}, each covers part of the object. Then, a T2I diffusion process is assigned to synthesize each of these 2D views {𝑧 (𝑣1 ) 𝑡 , 𝑧 (𝑣2 ) 𝑡 , · · · , 𝑧 (𝑣𝑛 ) 𝑡 }, using the text prompt 𝑦 as guidance and conditional images (depth or normal map rendered from the corresponding viewpoints) as the condition”; Zhang- page 6, section 4.1, at least discloses We then render the front and left/right 3/4 views of the selected face geometry […] To find the best match from candidate geometries, we first render the front and left/right 3/4 views of the selected face geometry under 10 directional lightings from different angles).
It would have been obvious to one of ordinary in the art before the effective filing date of the claimed invention to have modified Liu to incorporate the teachings of Zhang, and apply the views into the Zhang’s teachings in order the plurality of two dimensional images include four views of a head that correspond to the prompt.
The same motivation that was utilized in the rejection of claim 1 applies equally to this claim.
Regarding claim 3, Liu in view of Zhang, discloses the system of claim 2, and further discloses wherein the four views include a front view, a left view, a right view, and a top view (Liu- Fig. 5 shows Comparison of object texturing results with text prompt; page 3, section 4, at least discloses “We first surround the target 3D object 𝑚 with multiple cameras {𝑣1, 𝑣2, · · · 𝑣𝑛}, each covers part of the object. Then, a T2I diffusion process is assigned to synthesize each of these 2D views {𝑧 (𝑣1 ) 𝑡 , 𝑧 (𝑣2 ) 𝑡 , · · · , 𝑧 (𝑣𝑛 ) 𝑡 }, using the text prompt 𝑦 as guidance and conditional images (depth or normal map rendered from the corresponding viewpoints) as the condition”; Zhang- page 6, section 4.1, at least discloses We then render the front and left/right 3/4 views of the selected face geometry […] To find the best match from candidate geometries, we first render the front and left/right 3/4 views of the selected face geometry under 10 directional lightings from different angles).
Regarding claim 4, Liu in view of Zhang, discloses the system of claim 2, and further discloses wherein the operations further comprise:
assigning a weighting of a certain facial feature based on the type of view for the plurality of two dimensional images (Liu- page 2, left column, 3rd paragraph, at least discloses “The overlapping regions among different views on the textured object (Fig. 2, left) serve as the information exchange sites. During each denoising step, we share (blend in our case) the latent from different views in the UV texture domain, if they have an overlap”; Zhang- page 2, at least discloses “Our generated neutral assets naturally support blendshapes-based facial animations, thanks to the unified geometric topology. We further improve the animation ability with personalized deformation characteristics”), and projecting a texture of the facial feature onto the three dimensional head mesh from one of the two dimensional images based on the weightings (Liu- page 2, left column, 3rd paragraph, at least discloses “The overlapping regions among different views on the textured object (Fig. 2, left) serve as the information exchange sites. During each denoising step, we share (blend in our case) the latent from different views in the UV texture domain, if they have an overlap”; page 3, section 4, 1st paragraph, at least discloses “With sufficient views, we can obtain the complete texture map covering the whole object, by projecting the generated pixels from each view onto the object surface, which in turn can be mapped to the texture domain (UV space)).
Regarding claim 5, Liu in view of Zhang, discloses the system of claim 1, and further discloses wherein the operations further comprise:
generating a first two dimensional view from the textured three dimensional head mesh (Liu- Fig. 2, 2nd column from the left, “View 1, 2, …N” showing as two dimensional views, View 1 as a first two dimensional view ");
adding noise to the two dimensional view to generate a second two dimensional view (Liu- Fig. 2, 2nd column from the left, “View 1, 2, …N” showing noisy"; page 3, right column, at least discloses “(a) Given noisy latent image 𝑧𝑡 , the model predicts the noise 𝜖𝜃 (𝑧𝑡 , 𝑡 ) in the current latent image […] The obtained noisy latent image will be used as input at the next time step 𝑡 − 1.”);
denoising the second two dimensional view to generate a third two dimensional view (Liu- Fig. 2, 3rd and 6th columns from the left, “View 1, 2, …N” showing which are texturing the 3D mesh after "Denoise"; page 3, right column, at least discloses “A clean intermediate state 𝑧0|𝑡 can be obtained by removing the noise from 𝑧𝑡 (modifications on 𝑧0|𝑡 can be applied to affect the subsequent denoising process).”); and
projecting the third two dimensional view onto the textured three dimensional head mesh to generate an updated textured three dimensional head mesh (Liu- Fig. 2, 7th column from the left, “View 1, 2, …N” where View N corresponding to the third two dimensional view being projected to the textured three dimensional head mesh at 6th column to generate an updated textured three dimensional head mesh at 7th column; page 3, right column, at least discloses “Decode the fully denoised 𝑧0 with the VAE decoder D to obtain the output image 𝑥 = D(𝑧0)”).
Regarding claim 6, Liu in view of Zhang, discloses the system of claim 1, and further discloses wherein the operations further comprise:
comparing the updated textured three dimensional head mesh with the textured three dimensional head mesh to identify a loss (Liu- Fig. 2, 3rd and 6th columns from the left, suggests updating textured three dimensional head mesh after multiple iterations; page 3, right column, at least discloses “Compute the latent image 𝑧𝑡−1 as a linear combination of 𝑧𝑡 and 𝑧0|𝑡 using time-step-related coefficients. The obtained noisy latent image will be used as input at the next time step 𝑡 − 1 […] Decode the fully denoised 𝑧0 with the VAE decoder D to obtain the output image 𝑥 = D(𝑧0)”; Zhang- page 6, at least discloses “We rely on the Score Distillation Sampling (SDS) loss of the pretrained generic LDM, i.e., Stable Diffusion [Rombach et al. 2022], for guiding the details carving”); and
further modifying the updated textured three dimensional head mesh causing a reduction in the loss (Zhang- page 6, equation 6, at least discloses "We can write the SDS loss of generic LDM on rendered image I as follows: where xd= [V d,N d,/1*} are the optimizing parameters"; Equation 7, "Besides the SDS loss, we further add regularization terms to ensure the rationality of generated details. The additional regularization losses include, where Laplacian(-) represents the Laplacian smooth loss between two meshes, and Lmap regularizes both the gradient and divergence of the detailed normal map for smoothing." Also see Page 9, Latent space SDS. Page 10, image space SDS).
It would have been obvious to one of ordinary in the art before the effective filing date of the claimed invention to have modified Liu to incorporate the teachings of Zhang, and apply the SDS loss of generic LDM into the Zhang’s teachings for comparing the updated textured three dimensional head mesh with the textured three dimensional head mesh to identify a loss; and further modifying the updated textured three dimensional head mesh causing a reduction in the loss.
The same motivation that was utilized in the rejection of claim 1 applies equally to this claim.
Regarding claim 7, Liu in view of Zhang, discloses the system of claim 1, and further discloses wherein the operations further comprise training the stable diffusion model to generate a plurality of two dimensional images based on inputted prompts (Liu- page 3, section 3 Diffusion Model Preliminaries, at least discloses “In our work, we utilize Stable Diffusion model [Rombach et al. 2022], which is trained to denoise in low-resolution latent space 𝑧 =E(𝑥) encoded by a pre-trained VAE encoder E, as it can significantly reduce the computational cost. Then, an image can be generated through the following steps […] In addition to text conditioning using built-in attention mechanisms, several external encoders and adapters [Mou et al. 2023;Zhang et al. 2023] have been designed to enable diffusion models to be conditioned on other modalities. ControlNet as one of these methods, allows diffusion models to generate images conditioned on screen-space depth or normal maps”; page 3, section 4, at least discloses “a T2I diffusion process is assigned to synthesize each of these 2D views {𝑧 (𝑣1 ) 𝑡 , 𝑧 (𝑣2 )𝑡 , · · · , 𝑧 (𝑣𝑛 )𝑡 }, using the text prompt 𝑦 as guidance and conditional images (depth or normal map rendered from the corresponding viewpoints) as the condition).
The methods of claims 8-14 are similar in scope to the functions performed by the system of claims 1-7 and therefore claims 8-14 are rejected under the same rationale.
Regarding claims 15-20, the claims are directed toward a non-transitory computer-readable storage medium storing instructions of claims 1-6, are rejected under the same rationale.
Conclusion
5. The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. They are as recited in the attached PTO-892 form.
6. Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL LE whose telephone number is (571)272-5330. The examiner can normally be reached 9am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kent Chang can be reached at (571) 272-7667. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/MICHAEL LE/Primary Examiner, Art Unit 2614