DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
This is in response to applicant’s amendment/response filed on 10/30/2025, which has been entered and made of record. Claims 11 and 13-14 have been amended. Claim 12 has been cancelled. Claim 21 has been added. Claims 1-11 and 13-21 are pending in the application.
Response to Arguments
Applicant's arguments filed on 10/30/2025 have been fully considered but they are not persuasive. Applicant submitted new amended claims. Accordingly, new grounds of rejection are set forth above. The new grounds of rejection conclusion have been necessitated by Applicant's amendments to the claims.
Applicants state that “Applicant respectfully submits that Chen and Cao, whether taken individually or in combination, fail to disclose or render obvious at least "wherein training the neural network model of the image processor comprises relating pixels across images within a set of 3D images using a self-attention layer of the neural network model, wherein the self-attention layer relates pixels of a first image of the set of 3D images with pixels of a second image of the set of 3D images, and wherein the first image has the same subject as the second image, and the first image has a first view orientation that is different from a second view orientation of the second image," as recited in claim 1”. The examiner disagrees. Applicant did not raise any specific argument or evidence to support his conclusion. The Examiner directs Applicant to claim rejections for detailed analyses.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 11 and 13-17 is/are rejected under 35 U.S.C. 103 as being unpatentable over WO2024/206918 to Chen et al. in view of China PGPubs 115393841 to Cao et al..
Regarding claim 11, Chen et al. teach a method for training an image processor having a neural network model (abstract), the method comprising:
generating a first training batch that comprises a plurality of sets of 3D images (page 13, lines 30-35, sample a batch trainings image, with each set corresponding to a respective subject);
generating respective view embeddings as inputs for the neural network model (page 11, lines 27-29, generate an image embedding of the image for neural network); and
training the neural network model of the image processor for multi-view image diffusion using the first training batch and the view embeddings (page 13, lines 30-35 and page 14, line 1-2, page 14, lines 21-29, generate a fine-tuned diffusion neural network model for the corresponding subject).
But Chen et al. keep silent for teaching each 3D image within a set of 3D images having a same subject and a different view orientation; generating respective view embeddings as inputs for the neural network model that represent the different view orientations, wherein training the neural network model of the image processor comprises relating pixels across images within a set of 3D images using a self-attention layer of the neural network model, wherein the self-attention layer relates pixels of a first image of the set of 3D images with pixels of a second image of the set of 3D images, and wherein the first image has the same subject as the second image, and the first image has a first view orientation that is different from a second view orientation of the second image.
PNG
media_image1.png
242
314
media_image1.png
Greyscale
In related endeavor, Cao et al. teach generating a first training batch that comprises a plurality of sets of 3D images, each 3D image within a set of 3D images having a same subject and a different view orientation (Fig 3, par 0007-0008, obtaining the multi-view angle graph of the to-be-identified three-dimensional object and inputting the multi-view angle graph of the to-be-identified three-dimensional object to neural network);
generating respective view embeddings as inputs for the neural network model that represent the different view orientations (par 0009, combining the position information of each view image with view independent information, embedding the feature code of the corresponding view image, obtaining the first embedded feature code, par 0012, embedding the position information of the corresponding view angle graph and view irrelevant information in the sampling result to obtain the second embedding characteristic code);
wherein training the neural network model of the image processor comprises relating pixels across images within a set of 3D images using a self-attention layer of the neural network model (page 5, lines 1-2, page 10, lines 11-12, the diffusion neural network 110 performs a diffusion process in pixel space, so that the images operated on and generated by the diffusion neural network have values for each pixel that specify color values, e.g., RGB values or another color encoding scheme, page 5 line 30-33, page 8, lines 13-26, the diffusion decoder can include a set of convolutional layers. Optionally, the diffusion decoder can also include one or more attention layers, e.g., self-attention layers or cross-attention layers, that each update a representation of the image using a representation of the text, a representation of the time step, or both), wherein the self-attention layer relates pixels of a first image of the set of 3D images with pixels of a second image of the set of 3D images (par 0004, “The present invention provides a 3D object recognition method and system based on a self-attention mechanism in order to overcome the problem of ignoring the connection between multiple views and feature information redundancy during 3D object recognition in the prior art, fully considering multi-view images The connection between them filters the similar feature information between multi-view graphs, aggregates the feature information of multi-view graphs on a representative view graph, and reduces the redundancy of feature information “, par 0124, “The self-attention network model can generate correlations between multi-view images, further aggregate the feature information of the sampled and retained multi-view images on several representative view images, and reduce the redundancy of feature information again”), and wherein the first image has the same subject as the second image, and the first image has a first view orientation that is different from a second view orientation of the second image (par 0007-0008, obtaining the multi-view angle graph of the to-be-identified three-dimensional object and inputting the multi-view angle graph of the to-be-identified three-dimensional object to neural network, Fig 3, same elevation and angles around 360 degree).
It would have been obvious to a person of ordinary skill in the art at the time before the effective filing data of the claimed invention to modified Chen et al. to include each 3D image within a set of 3D images having a same subject and a different view orientation; generating respective view embeddings as inputs for the neural network model that represent the different view orientations, wherein training the neural network model of the image processor comprises relating pixels across images within a set of 3D images using a self-attention layer of the neural network model, wherein the self-attention layer relates pixels of a first image of the set of 3D images with pixels of a second image of the set of 3D images, and wherein the first image has the same subject as the second image, and the first image has a first view orientation that is different from a second view orientation of the second image as taught by Cao et al. to provide multi-view 3D data set and embedding position information and view independent information into the feature code of the corresponding view image to a diffusion neural network training model to increase the full-face and accuracy of the characteristic information to reduce the characteristic information redundancy.
Regarding claim 13, Chen et al. as modified by Cao et al. teach all the limitation of claim 11, and Chen et al. further teach wherein training the neural network model of the image processor comprises combining the respective view embeddings with a corresponding diffusion timestep as a residual for the neural network model (page 4, lines 29-33, diffusion input includes timestep in neural network).
Regarding claim 14, Chen et al. as modified by Cao et al. teach all the limitation of claim 11, and Chen et al. further teach wherein each 3D image within the set of 3D images corresponds to a same input prompt; wherein training the neural network model comprises: generating a text embedding that represents the same input prompt, wherein the text embedding is combined with the view embeddings (page 1, lines 29-33, Recent text-to-image generation models have shown great progress in generating
highly realistic, accurate, and diverse images from a given text prompt, page 11, line 27-34 and page 12, lines 1-6, claim 13, processing the candidate new training image using an image encoder neural network to generate an image embedding of the image; processing the training text description using a text encoder neural network to generate a text embedding of the image; and determining the quality score based on a similarity between the text embedding and the image embedding); and providing the text embedding to a cross attention layer of the neural network model (page 6, line 5-7, page 8, lines 6-26, the diffusion decoder can include a set of convolutional layers. Optionally, the diffusion decoder can also include one or more attention layers, e.g., self-attention layers or cross-attention layers, that each update a representation of the image using a representation of the text, a representation of the time step, or both).
Regarding claim 15, Chen et al. as modified by Cao et al. teach all the limitation of claim 11, and Cao et al. further teach wherein generating the respective view embeddings comprises generating the respective view embeddings using a multi-layer perceptron (par 0010-0015, inputting the embedded feature codes into the first self-attention network model and second self-attention network model).
PNG
media_image1.png
242
314
media_image1.png
Greyscale
Regarding claim 16, Chen et al. as modified by Cao et al. teach all the limitation of claim 11, and Cao et al. further teach wherein generating the first training batch that comprises the plurality of sets of 3D images comprises generating 3D images, for each set of 3D images, to have view orientations having a same elevation angle at uniformly distributed azimuth angles (par 0007-0008, obtaining the multi-view angle graph of the to-be-identified three-dimensional object and inputting the multi-view angle graph of the to-be-identified three-dimensional object to neural network, Fig 3, same elevation and angles around 360 degree).
Regarding claim 17, Chen et al. as modified by Cao et al. teach all the limitation of claim 16, and Chen et al. further teach wherein generating the first training batch further comprises generating a plurality of individual 2D images having different subjects from each other (Fig 2, page 6, lines 10-12, disclose image sets for three different subjects).
Claim(s) 18-19 is/are rejected under 35 U.S.C. 103 as being unpatentable over WO2024/206918 to Chen et al. in view of China PGPubs 11539841 to Cao et al., further in view of Seo et al. (Junyoung Seo, Wooseok Jang, Min-Seop Kwak, Jaehoon Ko, Hyeonsu Kim, Junho Kim, Jin-Hwa Kim, Jiyoung Lee, and Seungryong Kim. Let 2d diffusion model know 3d-consistency for robust text-to-3d generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023).
Regarding claim 18, Chen et al. as modified by Cao et al. teach all the limitation of claim 17, but keep silent for teaching further comprising: pre-training the neural network model as a 2D diffusion model for transfer learning; and fine-tuning the neural network model using the plurality of sets of 3D images and the plurality of individual 2D images.
In related endeavor, Seo et al. teach further comprising: pre-training the neural network model as a 2D diffusion model for transfer learning (abstract, “we introduce a training strategy that enables the 2D diffusion model learns to handle the errors and sparsity within the coarse 3D structure for robust generation, as well as a method for ensuring semantic consistency throughout all viewpoints of the scene. “, section 4.3, “Our approach aims to incorporate 3D awareness into pretrained 2D diffusion models, and to achieve this we construct a coarse 3D representation of a given initial image and project it to a target viewpoint to make a sparse depth map” …..pretraining 2D diffusion model) and fine-tuning the neural network model using the plurality of sets of 3D images and the plurality of individual 2D images (section 1, “we propose a novel framework, named 3DFuse, that effectively injects 3D awareness into pretrained 2D diffusion models. Given a text prompt, we first sample semantic code to fasten the semantic identity of the generated scene. The semantic code consists of a generated 2D image and a prompt embedding optimized from the pretrained diffusion model”, Fig 3, section 4, provide a pivotal tuning to optimize embedding an tune the LoRA layer).
It would have been obvious to a person of ordinary skill in the art at the time before the effective filing data of the claimed invention to modified Chen et al. as modified by Cao et al. to include further comprising: pre-training the neural network model as a 2D diffusion model for transfer learning; and fine-tuning the neural network model using the plurality of sets of 3D images and the plurality of individual 2D images as taught by Seo et al. to learn a 3D diffusion from 2D images with robust normalization and de-normalization operations to propel these advances in image synthesis .
Regarding claim 19, Chen et al. as modified by Cao et al. and Seo et al. teach all the limitation of claim 18, and further teach wherein fine-tuning the neural network model comprises: receiving a plurality of identity text/image pairs of a subject; and fine-tuning parameters of the neural network model using a parameter preservation loss (Chen et al.: page 11, lines 6-21, page 12, lines 23-33, provide fine tune for pre-trained diffusion neural network using loss check, Seo et al.: section 1, “we propose a novel framework, named 3DFuse, that effectively injects 3D awareness into pretrained 2D diffusion models. Given a text prompt, we first sample semantic code to fasten the semantic identity of the generated scene. The semantic code consists of a generated 2D image and a prompt embedding optimized from the pretrained diffusion model”, Fig 3, section 4, provide a pivotal tuning to optimize embedding an tune the LoRA layer).
Claim(s) 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over WO2024/206918 to Chen et al. in view of China PGPubs 11539841 to Cao et al., further in view of Karnewar et al. (Animesh Karnewar, Andrea Vedaldi, David Novotny, and Niloy J Mitra. Holodiffusion: Training a 3d diffusion model using 2d images. In CVPR, 2023).
Regarding claim 20, Chen et al. as modified by Cao et al. teach all the limitation of claim 16, but keep silent for teaching wherein training the neural network model comprises sharing a diffusion timestep among each 3D image within the set of 3D images.
In related endeavor, Karnewar et al. further teach wherein training the neural network model comprises sharing a diffusion timestep among each 3D image within the set of 3D images (Fig 2, section 3.1, set a linear time schedule for all 3D image in diffusion processing).
It would have been obvious to a person of ordinary skill in the art at the time before the effective filing data of the claimed invention to modified Chen et al. as modified by Cao et al. to include wherein training the neural network model comprises sharing a diffusion timestep among each 3D image within the set of 3D images as taught by Karnewar et al. to generate a diffusion models with scalable, train robustly, and are competitive in terms of sample quality and fidelity to existing approaches for 3D generative modeling.
Claim(s) 21 is/are rejected under 35 U.S.C. 103 as being unpatentable over WO2024/206918 to Chen et al. in view of China PGPubs 11539841 to Cao et al., further in view of U.S. PGPubs 2023/0394630 to Bergner et al.
Regarding claim 21, Chen et al. as modified by Cao et al. teach all the limitation of claim 11, but keep silent for teaching wherein training the neural network model of the image processor further comprises iteratively denoising images with the set of 3D images to generate a denoised images.
In related endeavor, Bergner et al. further teach wherein training the neural network model of the image processor further comprises iteratively denoising images with the set of 3D images to generate a denoised images (par 0038, “As shown in FIG. 3, the block 133 may train a neural network model 510 on the noisy image 313 using the initial image 311 as ground truth. In some embodiments, the block 133 may train the neural network model 510 on each of the additional noisy images 333 using the corresponding additional initial images 331 as ground truth for those training iterations”, par 0044-0049, “The block 139 may configure the neural network model 510 to denoise the image 391. In some embodiments, the block 139 may configure the neural network model 510 to predict noise in the noisy image 313 and to remove the predicted noise from the noisy image 313 to generate the clean or denoised image 315. Typically, if the neural network model 510 is effective, the use of the second value 515 applied to the noisy image 313 should result in a denoised image 315 cleaner than the initial image 311”, par 0051, “FIG. 5A shows a noisy image 391 that the methods described below may be applied to in order to implement the neural network model 510 described herein. The noisy image 391 is then input to a system applying the denoising convolutional neural network (CNN) 510 trained using the method discussed herein. When such a noisy image 391 is denoised using the first value 513 for the tuning variable, as discussed above, the output is very similar to the level of noise present in the initial images 311, 331 discussed above, which include a baseline of noise”, par 0075, “The method of FIG. 6 shows one iteration of training a neural network model 510 in steps 601-605. As discussed above, these first few steps may be repeated many times, followed by a tuning process shown in the method”).
It would have been obvious to a person of ordinary skill in the art at the time before the effective filing data of the claimed invention to modified Chen et al. as modified by Cao et al. to include wherein training the neural network model of the image processor further comprises iteratively denoising images with the set of 3D images to generate a denoised images as taught by Bergner et al. to generate image quality better than ground truth images on which it was trained.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Jin Ge whose telephone number is (571)272-5556. The examiner can normally be reached 8:00 to 5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jason Chan can be reached at (571)272-3022. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
JIN . GE
Examiner
Art Unit 2619
/JIN GE/Primary Examiner, Art Unit 2619