Last updated: April 19, 2026

Application No. 18/453,248

TECHNIQUES FOR GENERATING IMAGES OF OBJECT INTERACTIONS

Non-Final OA §103§112

Filed

Aug 21, 2023

Examiner

VARNDELL, ROSS E

Art Unit

2674

Tech Center

2600 — Communications

Assignee

Nvidia Corporation

OA Round

3 (Non-Final)

Interview Optional

— +13.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 615 resolved cases, 2023–2026

Examiner Intelligence

VARNDELL, ROSS E View full profile →

Grants 85% — above average

Career Allow Rate

520 granted / 615 resolved

+22.6% vs TC avg

Moderate +13% lift

Without

With

+13.0%

Interview Lift

resolved cases with interview

Typical timeline

2y 4m

Avg Prosecution

28 currently pending

Career history

643

Total Applications

across all art units

Statute-Specific Performance

§101

6.3%

-33.7% vs TC avg

§103

66.9%

+26.9% vs TC avg

§102

6.4%

-33.6% vs TC avg

§112

10.7%

-29.3% vs TC avg

Black line = Tech Center average estimate • Based on career data from 615 resolved cases

Office Action

§103 §112

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Priority to provisional application 63/384,080 filed 11/16/2022 is acknowledged.

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 3/12/2026 has been entered.
 
Annotations
In this and subsequent office actions, the examiner has used strikethrough text to indicate a reference does not teach a particular limitation.  Usually, the following paragraph will reflect where the secondary reference is relied upon for the teaching of the limitation.  The limitations a reference does not teach are thus struck through so the rejection can be easily understood. 

Response to Arguments
This final office action is in response to the amendment filed 3/12/2026.  Claims 1-21 are pending in this application and have been considered below.  
Applicant’s arguments with respect to claims 1-21 have been considered but are moot in view of new ground(s) of rejection because of the amendments.
	
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 3 and 13 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claims 3 and 13 recite “performing the one or more first denoising diffusion operations.”  The word “diffusion” has no antecedent in the independent claims 1 and 11, which recites, “one or more first denoising operations.”  

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1-2, 4, 6-7, 11-12, 15, 17, and 20-21 is/are rejected under 35 U.S.C. 103 as being unpatentable over Corona et al. (GanHand: Predicting Human Grasp Affordances in Multi-Object Scenes – hereinafter “Corona”) in view of Nichol et al. (GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models – hereinafter “Nichol”).
Claim 1.
Corona discloses a computer-implemented method for generating an image, the method comprising: 
performing one or more first denoising operations based on a first machine learning model and an input image that includes a first object to generate a mask, that is based on the first object, wherein the mask represents  a second object that is not included in the input image (Corona p. 5032 “Yet, the key difference with our approach and all methods discussed in this section is that in our case hands are not visible in the input images and all reasoning is done from an image of the object alone”; p. 5034 “given an image I, we train a model M that provides a hand pose P and shape V, and grasp type C for every object of interest in I.” Where, the input I is an image of objects. No hands. The output includes the hand configuration – something not present in I.), and a spatial arrangement associated with a second object interacting with the first object (Corona p. 5031 discloses “given a single RGB image of a scene with an arbitrary number of objects, we aim to predict human grasp affordances, i.e. predict multiple plausible solutions of how a human would grasp each one of the observed objects” and “In order to predict feasible human grasps, we introduce GanHand, a multi-task GAN architecture that given solely one input image: 1) estimates the 3D shape/pose of the objects; 2) predicts the best grasp type according to a taxonomy with 33 classes [18]; 3) refines the hand configuration”;  pp. 5034-5: "Our goal is to predict how a human would naturally grasp one or several objects, given a single RGB image of these objects. This implies producing valid hand configurations showing several contact points with the target object"; p. 5035, Segmentation Mask Generation: "During training, one object is randomly selected at a time, its 3D shape is projected onto the image plane to obtain a segmentation mask that is then concatenated with the input image and fed to the grasp prediction network. The mask indicates which object has to be focused on" ); and 

Corona discloses all of the subject matter as described above except for specifically teaching “performing one or more second denoising operations based on a second machine learning model, the input image, and the mask to generate an image of the second object interacting with the first object.”  However, Nichol in the same field of endeavor teaches performing one or more second denoising operations based on a second machine learning model, the input image, and the mask to generate an image of the second object interacting with the first object (Fig. 2 demonstrates the model receiving an original image with an erased region and a mask and generating new objects absent from the original scene, corresponding to the an image of a second object (e.g. a hand) not present in the input, interacting with the first object. Sections 4.1-4.3 “model predicts p(xt-1|xt, c)” conditioning on both image context and mask channel at each diffusion step.).
Therefore, it would have been obvious to a person of ordinary skill in the art (“POSITA”) to combine Corona and Nichol before the effective filing date of the claimed invention.  A POSITA would recognize it would have been obvious to combine Corona’s affordance prediction, which identifies where an absent hand spatially interacts with an object, with Nichol’s image-conditioned inpainting diffusion model to render that predicted spatial arrangement as a 2D mask and use it to condition image generation.  This combination applies each reference’s known technique according to established function to yield the predictable results of generating a realistic image of the absent hand interacting with the first object.
The instant application’s specification acknowledges performing denoising operations conditioned on a user-specified region to generate an image of an absent object interacting with a first object was known in the prior art (see ¶68, Fig. 8A).
Claims 2 and 12.
The combination of Corona and Nichol discloses the computer-implemented method of claim 1, further comprising receiving an input position associated with the second object, wherein the one or more first denoising operations are further based on the input position (Corona Page 5: “we represent the absolute translation of the hand w.r.t the camera as the T = Tobject +                 
                    ∆
                
            T . Similarly we represent the absolute hand rotation as R =Ro +                 
                    ∆
                
            R … We then build a Fully Connected Network fed with (HC, Tobject,Ro) that predicts {                
                    ∆
                
            H,                 
                    ∆
                
            T,                 
                    ∆
                
            R}, to compute the absolute rigid pose of the hand.” Corona teaches using object position (Tobject) as input to guide the grasp prediction process.).
Claim 4.
The combination of Corona and Nichol discloses the computer-implemented method of claim 1, wherein each of the one or more first denoising operations and the one or more second denoising operations includes one or more denoising diffusion operations (Nichol p. 6, Section 4.3 discloses “we explicitly fine-tune our model to perform inpainting, similar to Saharia et al. (2021a). During fine-tuning, random regions of training examples are erased, and the remaining portions are fed into the model along with a mask channel as additional conditioning information.” This shows the denoising diffusion operations to condition on both the input image and the mask channel.; p. 3, Section 2.1 Diffusion Models, “pθ(xt-1|xt) … 
gradually reducing the noise” This is the backward denoising process iteratively removing noise using diffusion operations.).
Claims 6 and 15.
The combination of Corona and Nichol discloses the computer-implemented method of claim 1, wherein the second machine learning model comprises an encoder-decoder neural network (Nichol p. 6, Section 4.1 discloses “We adopt the ADM model architecture proposed by Dhariwal & Nichol (2021),” where, the ADM/U-Net is an encoder-decoder architecture.).
Claims 7 and 17.
The combination of Corona and Nichol discloses the computer-implemented method of claim 1, wherein the second object comprises a portion of a human body (Corona Page 1, Title: “GanHand: Predicting Human Grasp Affordances in Multi-Object Scenes”).
Claim 11.
The combination of Corona and Nichol discloses the one or more non-transitory computer-readable media storing instructions that, when executed by at least one processor (Corona Page 5: “a pretrained and fine tuned ResNet-50 , followed by a classification network … We then build a Fully Connected Network fed with (HC, Tobject,Ro)”; Page 6: discloses “We perform a hyperparameter grid search to maximize [19] and finally train all models using LR=0.0001, BS=32 … using Adam optimizer … Training models for single object (ObMan) or multi-object (YCB-Affordance) scenes takes approximately 6 and 8 days respectively on a V100 GPU.” Where, neural networks must be stored as instructions/parameters in computer memory.), cause the at least one processor to perform steps for …
The combination of Corona and Nichol renders claim(s) 11 obvious for the reasons discussed above in claim 1, mutatis mutandis. 
Claim 20.
The combination of Corona and Nichol discloses a system, comprising: one or more memories storing instructions; and one or more processors that are coupled to the one or more memories and, when executing the instructions (Corona Page 5: “a pretrained and fine tuned ResNet-50 , followed by a classification network … We then build a Fully Connected Network fed with (HC, Tobject,Ro)”; Page 6: discloses “We perform a hyperparameter grid search to maximize [19] and finally train all models using LR=0.0001, BS=32 … using Adam optimizer … Training models for single object (ObMan) or multi-object (YCB-Affordance) scenes takes approximately 6 and 8 days respectively on a V100 GPU.” Where, neural networks must be stored as instructions/parameters in computer memory.), are configured to … 
The combination of Corona, Zhang, and Saharia renders claim(s) 20 obvious for the reasons discussed above in claim 1, mutatis mutandis. 

Claims 5 and 14 are rejected under 35 U.S.C. 103 as being unpatentable Corona and Nichol as applied to claim 1 and 11 above, and further in view of Jaderberg et al. (Spatial Transformer Networks – hereinafter “Jaderberg”).
Claims 5 and 14.
The combination of Corona, Zhang, and Saharia discloses the computer-implemented method of claim 1, wherein the first machine learning model comprises  and an encoder neural network (Corona Page 6: “We use a pre-trained ResNet-50 as image encoder.” and Saharia Page 3: "Palette uses a UNet architecture [Ho et al. 2020] with several modifications inspired by recent work [Dhariwal and Nichol 2021,· Saharia et al. 2021; Song et al. 2021]. The network architecture is based on the 256 x256 class-conditional UNet model"  U-Net = encoder-decoder architecture).
Corona and Nichol discloses all of the subject matter as described above except for specifically teaching “a spatial transformer neural network.”  However, Jaderberg in the same field of endeavor teaches a spatial transformer neural network (Abstract: “the Spatial Transformer, which explicitly allows the spatial manipulation of data within the network. This differentiable module can be inserted into existing convolutional architectures, giving neural networks the ability to actively spatially transform feature maps, conditional on the feature map itself, without any extra training supervision or modification to the optimization process.”).
Therefore, it would have been obvious to one of ordinary skill in the art to combine Corona, Nichol, and Jaderberg before the effective filing date of the claimed invention.  The motivation for this combination of references would have been to improve the (1) known limitations: fixed CNN encoders struggle with geometric variations in object poses (2) proven enhancement: Jaderberg demonstrates that spatial transformers improve spatial reasoning in CNN-based systems (3) natural integration: spatial transformers are designed to augment encoder networks as used in Corona (4) performance improvement: expected benefits include better handling of rotations, scaling, and perspective variations in object interactions.  

Claims 8 and 16 are rejected under 35 U.S.C. 103 as being unpatentable Corona and Nichol as applied to claim 1 and 11 above, and further in view of Ye et al. (What’s in your hands? 3D Reconstruction of Generic Objects in Hands – hereinafter “Ye”).
Claims 8 and 16.
The combination of Corona and Nichol discloses the computer-implemented method of claim 1, further comprising 
Corona and Nichol discloses all of the subject matter as described above except for specifically teaching “performing one or more operations to generate three-dimensional geometry corresponding to the second object as set forth in the image of the second object interacting with the first object.”  However, Ye in the same field of endeavor teaches performing one or more operations to generate three-dimensional geometry corresponding to the second object as set forth in the image of the second object interacting with the first object (Abstract: “Our work aims to reconstruct hand-held objects given a single RGB image … our work reconstructs generic handheld object without knowing their 3D templates.”; Page 3897: “Given an image depicting a hand holding an object, we aim to reconstruct the 3D shape of the underlying object … This network … maps a query 3D point to a signed distance from the object surface, and the zero-level set of this function can be extracted as the object surface”).
Therefore, it would have been obvious to one of ordinary skill in the art to combine Corona, Nichol, and Ye before the effective filing date of the claimed invention.  The motivation for this combination of references would have been to extend the 2D interaction image generation of Corona and Nichol with Ye’s 3D reconstruction techniques to provide complete 3D geometric information about the interacting objects. 

Claims 9, 10, 18, and 19 are rejected under 35 U.S.C. 103 as being Corona and Nichol as applied to claim 1 and 11 above, and further in view of Zhang et al. (Learning Object Placement by Inpainting for Compositional Data Augmentation – hereinafter “Zhang”).
Claims 9 and 18.
The combination of Corona and Nichol discloses the computer-implemented method of claim 1, further comprising: 
Corona and Nichol discloses all of the subject matter as described above except for specifically teaching “detecting the second object as set forth in one or more training images of the second object interacting with the first object; determining one or more training parameter vectors based on the one or more training images and the second object detected in the one or more training images; and performing a plurality of operations to train the first machine learning model based on the one or more training images and the one or more training parameter vectors.”  However, Zhang in the same field of endeavor teaches detecting the second object as set forth in one or more training images of the second object interacting with the first object (Page 570: “Our system leverages existing instance segmentation dataset and a self-supervised image inpainting network to generate the necessary training data for learning object placement.  Our insight is that we can generate such training data by removing objects from the background scenes. With an instance segmentation mask, we first cut out the object regions and then fill in the holes with an image inpainting network.”); determining one or more training parameter vectors based on the one or more training images and the second object detected in the one or more training images (Page 570: “After that, we simultaneously obtain a clean background scene without objects in it and the corresponding ground truth plausible placement locations and scales for placing these objects into the scene.”; Fig. 2: “we first cut out the object region with the instance segmentation mask, and save the original bounding boxes as the ground truth plausible placement locations and scales."); and performing a plurality of operations to train the first machine learning model based on the one or more training images and the one or more training parameter vectors (Page 567: “The ‘free’ labeled object-background pairs are then fed into our proposed PlaceNet, which predicts the location and scale to insert the object into the background.”; Page 570: “Overall, our proposed data acquisition technique provides a way to generate large-scale training data for learning object placement without any human labeling.”).
Therefore, it would have been obvious to one of ordinary skill in the art to combine Corona, Nichol, and Zhang before the effective filing date of the claimed invention.  The motivation for this combination of references would have been to enhance Corona’s hand-object interactions training system with Zhang’s proven data quality pipeline, including automated object detection, segmentation, and artifact removal procedures, to improve the quality and scale of training data for more robust interactions prediction models, and then use Nichol diffusion-based inpainting for the final high-quality content generation.  
Claims 10 and 19.
The combination of Corona, Nichol, and Zhang discloses the computer-implemented method of claim 1, further comprising: performing more operations to separate the second object from one or more training images to generate one or more segmented images (Zhang Page 570, Fig. 2: “we crop out segmented objects corresponding to the bounding boxes”); inpainting one or more portions of the one or more training images based on the one or more segmented images to generate one or more inpainted images (Zhang Page 570, Fig. 2: “Finally, we use inpainting network to fill the holes of the occluded region and generate the clean background.”); performing one or more operations to remove one or more artifacts from the one or more inpainted images to generate one or more inpainted images with artifacts removed (Zhang Abstract: “This seemingly simple task is difficult for current learning based approaches because of the lack of labeled training pair of foreground objects paired with cleaned background scenes”; Page 570: “After that, we simultaneously obtain a clean background scene without objects in it and the corresponding ground truth plausible placement locations and scales for placing these objects into the scene.” Nichol, Section 4.3 “replacing the known region of the image with a sample from q(xt|x0) after each sampling step.”); and training the second machine learning model based on the one or more training images, the one or more inpainted images, and the one or more inpainted images with artifacts removed (Zhang Page 567: “The ‘free’ labeled object-background pairs are then fed into our proposed PlaceNet, which predicts the location and scale to insert the object into the background.”; Zhang Page 570: “Overall, our proposed data acquisition technique provides a way to generate large-scale training data for learning object placement without any human labeling.”).  
The combination of Corona, Nichol, and Zhang renders claim(s) 10 obvious for the reasons discussed above for claim 9, mutatis mutandis. 
Claim 21. (New)
The computer-implemented method of claim 1, wherein performing the one or more first denoising operations comprises: generating, via the first machine learning model, a first parameter vector indicating a spatial layout of the second object interacting with the first object (Corona p. 5034 “M: I => {C, V, P}” grasp type, hand shape, hand pose – representing the spatial layout of the hand); and generating the mask based on the first parameter vector (GLIDE-style inpainting requires a 2D mask input, admitted prior art in Fig. 8A, ¶68).

Allowable Subject Matter
Claims 3 and 13 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form, including all of the limitations of the base claim and any intervening claims, AND overcoming the 35 USC 112(b) rejection above.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Ross Varndell whose telephone number is (571)270-1922.  The examiner can normally be reached M-F, 9-5 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, O’Neal Mistry can be reached at (313)446-4912.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/Ross Varndell/Primary Examiner, Art Unit 2674

Read full office action

Prosecution Timeline

Aug 21, 2023

Application Filed

Aug 04, 2025

Non-Final Rejection — §103, §112

Oct 24, 2025

Response Filed

Jan 13, 2026

Final Rejection — §103, §112

Mar 12, 2026

Request for Continued Examination

Mar 15, 2026

Response after Non-Final Action

Mar 19, 2026

Non-Final Rejection — §103, §112 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/188,565

Patent 12603810

System and Method for Communications Beam Recovery

2y 5m to grant Granted Apr 14, 2026

18/356,461

Patent 12597238

AUTOMATIC IMAGE VARIETY SIMULATION FOR IMPROVED DEEP LEARNING PERFORMANCE

2y 5m to grant Granted Apr 07, 2026

17/433,084

Patent 12582348

DEVICE AND METHOD FOR INSPECTING A HAIR SAMPLE

2y 5m to grant Granted Mar 24, 2026

18/053,348

Patent 12579441

SYSTEMS AND METHODS FOR IMAGE RECONSTRUCTION

2y 5m to grant Granted Mar 17, 2026

18/370,758

Patent 12579786

SYSTEM AND METHOD FOR PROPERTY TYPICALITY DETERMINATION

2y 5m to grant Granted Mar 17, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

3-4

Expected OA Rounds

85%

Grant Probability

98%

With Interview (+13.0%)

2y 4m

Median Time to Grant

High

PTA Risk

Based on 615 resolved cases by this examiner. Grant probability derived from career allow rate.