Last updated: April 19, 2026
Application No. 18/781,731
AMODAL SEGMENTATION BY SYNTHESIZING WHOLE OBJECTS

Non-Final OA §102§103
Filed
Jul 23, 2024
Examiner
CHOW, JEFFREY J
Art Unit
2618
Tech Center
2600 — Communications
Assignee
The Trustees of Columbia University in the City of New York
OA Round
1 (Non-Final)
Interview Optional

— +15.8% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 655 resolved cases, 2023–2026
Examiner Intelligence

CHOW, JEFFREY J View full profile →
Grants 77% — above average
Career Allow Rate
502 granted / 655 resolved
+14.6% vs TC avg
Strong +16% interview lift
Without
With
+15.8%
Interview Lift
resolved cases with interview
Typical timeline
3y 1m
Avg Prosecution
27 currently pending
Career history
682
Total Applications
across all art units
Statute-Specific Performance

§101
11.2%
-28.8% vs TC avg
§103
40.2%
+0.2% vs TC avg
§102
27.1%
-12.9% vs TC avg
§112
10.6%
-29.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 655 resolved cases
Office Action

§102 §103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claim(s) 1, 3 – 5, 7 – 11, and 13 – 19 is/are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Zhang et al. (US 2024/0169541).
Regarding independent claim 1, Zhang teaches a method comprising: 
receiving a prompt (paragraph 68: a user 110 can provide an initial image 510 to an image processing system 130, where the initial image may include at least one occluded object 515 and an occluding object 512) selecting an object in an input image (paragraph 107: The diffusion model can also receive the class of the occluded object as a text prompt or description that can guide the generation of the missing part of the occluded object); 
applying the input image (paragraph 69: At operation 620, a segmentation head can identify a plurality of object instance 512, 515 in the image 510 . . . The outputs of the first step are then provided to the diffusion model to perform amodal mask completion) to {1}a trained conditional generative model{1} that generates {2}an amodal image of the selected object{2} based on the prompt and the input image (paragraph 74: At operation 670, {1}a diffusion model{1} can generate {2}a complete image 590 of the occluded object 515{2}, and provide the amodal segmentation mask with or without the completed image back to the user 110; paragraph 47: the computation and parameters in {1}a diffusion model 260{1} take part in the learned mapping function which reduces noise at each timestep (denoted as F). {1}The model{1} takes as input x (i.e., noisy or partially denoised image depending on the timestep), the timestep t, and {1}conditioning information the model was trained to use{1}); and 
outputting the amodal image (paragraph 74: The diffusion model can generate a complete image 590 of the occluded object 515 based on the original image 510 and the amodal segmentation mask, where the trained diffusion model can fill in the unseen, occluded region of the occluded object 515).  

Regarding dependent claim 3, Zhang teaches wherein the object is an occluded object that is partially visible in the input image (paragraph 68: a user 110 can provide an initial image 510 to an image processing system 130, where the initial image may include at least one occluded object 515 and an occluding object 512).

Regarding dependent claim 4, Zhang teaches creating a mask of a visible portion of the selected object based on the prompt; and inputting the mask into the trained conditional generative model, wherein the trained conditional generative model generates the amodal image of the selected object by synthesizing occluded portions of the object (paragraph 71: At operation 640, an instance mask can be generated for each of the object instances 512, 515, where the instance masks specify the visible portions of each of the objects 512, 515. The instance masks can be generated by a trained mask network, that can distinguish separate objects, where the instance masks can be generated based on the output of the instance segmentation. An occlusion mask can be generated for the overlapping visible region of the occluding object 512 and occluded object 515. The occlusion mask is not necessarily the exact overlapping visible region of the occluding object 512 and occluded object 515. The occlusion mask could be as small as the ground-truth invisible region of the occluded object 515, or as large as the union of object masks of both occluding object 512 and occluded object 515. The object masks can be generated for each of the different instances).

Regarding dependent claim 5, Zhang teaches wherein the trained conditional generative model is based a training dataset comprising training occluded images comprising {1}occluded objects{1} and {2}training counterpart images of whole object counterparts of the occluded objects{2} (paragraph 30: The diffusion model can be trained to perform mask completion of an amodal segmentation mask using {2}unoccluded object images labeled with ground truths{2}, {1}partially occluded object images with ground truths{1}, and amodal segmentation masks as ground truths (i.e., true masks)).

Regarding dependent claim 7, Zhang teaches wherein the conditional generative model is a conditional diffusion model (paragraph 47: the computation and parameters in a diffusion model 260 take part in the learned mapping function which reduces noise at each timestep (denoted as F). The model takes as input x (i.e., noisy or partially denoised image depending on the timestep), the timestep t, and conditioning information the model was trained to use).

Regarding dependent claim 8, Zhang teaches conditioning a trained latent diffusion model using the training dataset to train the conditional diffusion model (paragraph 115: One or more machine learning models including a mask network and a diffusion model can be trained through supervised learning. The mask training method 1200 can involve adjusting parameters of transformers, encoders, decoders, deep neural networks, and diffusion models based on error scores between ground truth masks and predicted segmentation masks; paragraph 52: Diffusion models may also be characterized by whether the noise is added to the image itself, or to image features generated by an encoder (i.e., latent diffusion)).  

Regarding dependent claim 9, Zhang teaches wherein conditioning the trained latent diffusion model further comprises, for each occluded image of the training dataset (paragraph 123: 40,000 images with multiple annotated instances can be used as the training set): 
receiving a prompt that selects an occluded object in the training occluded image (paragraph 55: The text prompt 340 can be encoded using a text encoder 350 (e.g., a multimodal encoder) to obtain guidance features 360 in guidance space 370; paragraph 107: The diffusion model can also receive the class of the occluded object as a text prompt or description that can guide the generation of the missing part of the occluded object); 
generating a mask of a visible region of the occluded object based on the prompt (paragraph 116: At operation 1210, a training image can be identified, where the training image can include an occluded object having a visible region and an occluded region. An occluding object is in front of the occluded region of the occluded object. The training image can include a ground truth segmentation mask for the occluded object, where a ground truth amodal segmentation mask indicates a region that includes the visible region and the occluded region of the occluded object); 
applying noise to the training counterpart image (paragraph 53: Diffusion models work by iteratively adding noise to the data during a forward process and then learning to recover the data by denoising the data during a reverse process.  For example, during training, guided latent diffusion model 300 may take an original image 301 in a pixel space 305 as input and apply forward diffusion process 310 to gradually add noise to the original image 305 to obtain noisy images 320 at various noise levels); and 
conditioning the trained latent diffusion model based on the prompt, the mask, the training occluded image, and the noised training counterpart image (paragraph 55: The guidance features 360 can be combined with the noisy images 320 at one or more layers of the reverse diffusion process 330 to ensure that the output image 390 includes content described by the text prompt 340; paragraph 122: At operation 1250, the parameters of the mask network can be updated, for example, through back propagation based on the comparison, to more closely match the expected position and area of the masks and amodal segmentation mask. Updating the parameters of the diffusion model of the mask network based on the comparison can train the diffusion model to generate more accurate segmentation masks).

Regarding dependent claim 10, Zhang teaches wherein the conditioned trained latent diffusion model generates a reconstruction of the training counterpart image by denoising the noised training counterpart image (paragraph 54: Next, a reverse diffusion process 330 (e.g., a U-Net Artificial Neural Network (ANN)) gradually removes the noise from the noisy images 320 at the various noise levels to obtain an output image 390. In some cases, an output image 390 is created from each of the various noise levels. The output image 390 can be compared to the original image 301 to train the reverse diffusion process 330).  

Regarding independent claim 11, Zhang teaches a system, comprising: a memory storing instructions; and a processor communicatively coupled to the memory (paragraph 124) and configured to execute the instructions to: 
receive a prompt (paragraph 68: a user 110 can provide an initial image 510 to an image processing system 130, where the initial image may include at least one occluded object 515 and an occluding object 512) selecting an object in an input image (paragraph 107: The diffusion model can also receive the class of the occluded object as a text prompt or description that can guide the generation of the missing part of the occluded object); 
apply the input image (paragraph 69: At operation 620, a segmentation head can identify a plurality of object instance 512, 515 in the image 510 . . . The outputs of the first step are then provided to the diffusion model to perform amodal mask completion) to {1}a trained conditional generative model{1} that generates {2}an amodal image of the selected object{2} based on the prompt and the input image (paragraph 74: At operation 670, {1}a diffusion model{1} can generate {2}a complete image 590 of the occluded object 515{2}, and provide the amodal segmentation mask with or without the completed image back to the user 110; paragraph 47: the computation and parameters in {1}a diffusion model 260{1} take part in the learned mapping function which reduces noise at each timestep (denoted as F). {1}The model{1} takes as input x (i.e., noisy or partially denoised image depending on the timestep), the timestep t, and {1}conditioning information the model was trained to use{1}); and 
output the amodal image (paragraph 74: The diffusion model can generate a complete image 590 of the occluded object 515 based on the original image 510 and the amodal segmentation mask, where the trained diffusion model can fill in the unseen, occluded region of the occluded object 515).  

Regarding dependent claim 13, Zhang teaches wherein the object is an occluded object that is partially visible in the input image (paragraph 68: a user 110 can provide an initial image 510 to an image processing system 130, where the initial image may include at least one occluded object 515 and an occluding object 512).

Regarding dependent claim 14, Zhang teaches wherein the processor is further configured to execute the instructions to: create a mask of a visible portion of the selected object based on the prompt; and input the mask into the trained conditional diffusion model, wherein the trained conditional diffusion model generates the amodal image of the selected object by synthesizing occluded portions of the object (paragraph 71: At operation 640, an instance mask can be generated for each of the object instances 512, 515, where the instance masks specify the visible portions of each of the objects 512, 515. The instance masks can be generated by a trained mask network, that can distinguish separate objects, where the instance masks can be generated based on the output of the instance segmentation. An occlusion mask can be generated for the overlapping visible region of the occluding object 512 and occluded object 515. The occlusion mask is not necessarily the exact overlapping visible region of the occluding object 512 and occluded object 515. The occlusion mask could be as small as the ground-truth invisible region of the occluded object 515, or as large as the union of object masks of both occluding object 512 and occluded object 515. The object masks can be generated for each of the different instances).

Regarding dependent claim 15, Zhang teaches wherein the trained conditional diffusion model is based a training dataset comprising training occluded images comprising {1}occluded objects{1} and {2}training counterpart images of whole object counterparts of the occluded objects{2} (paragraph 30: The diffusion model can be trained to perform mask completion of an amodal segmentation mask using {2}unoccluded object images labeled with ground truths{2}, {1}partially occluded object images with ground truths{1}, and amodal segmentation masks as ground truths (i.e., true masks)).

Regarding dependent claim 16, Zhang teaches wherein the conditional generative model is a conditional diffusion model (paragraph 47: the computation and parameters in a diffusion model 260 take part in the learned mapping function which reduces noise at each timestep (denoted as F). The model takes as input x (i.e., noisy or partially denoised image depending on the timestep), the timestep t, and conditioning information the model was trained to use).

Regarding dependent claim 17, Zhang teaches wherein the processor is further configured to execute the instructions to: condition a trained latent diffusion model using the training dataset to train the conditional diffusion model (paragraph 115: One or more machine learning models including a mask network and a diffusion model can be trained through supervised learning. The mask training method 1200 can involve adjusting parameters of transformers, encoders, decoders, deep neural networks, and diffusion models based on error scores between ground truth masks and predicted segmentation masks; paragraph 52: Diffusion models may also be characterized by whether the noise is added to the image itself, or to image features generated by an encoder (i.e., latent diffusion)).  

Regarding dependent claim 18, Zhang teaches wherein conditioning the trained latent diffusion model further comprises, for each occluded image of the training dataset (paragraph 123: 40,000 images with multiple annotated instances can be used as the training set): 
receiving a prompt that selects an occluded object in the training occluded image (paragraph 55: The text prompt 340 can be encoded using a text encoder 350 (e.g., a multimodal encoder) to obtain guidance features 360 in guidance space 370; paragraph 107: The diffusion model can also receive the class of the occluded object as a text prompt or description that can guide the generation of the missing part of the occluded object); 
generating a mask of a visible region of the occluded object based on the prompt (paragraph 116: At operation 1210, a training image can be identified, where the training image can include an occluded object having a visible region and an occluded region. An occluding object is in front of the occluded region of the occluded object. The training image can include a ground truth segmentation mask for the occluded object, where a ground truth amodal segmentation mask indicates a region that includes the visible region and the occluded region of the occluded object); 
applying noise to the training counterpart image (paragraph 53: Diffusion models work by iteratively adding noise to the data during a forward process and then learning to recover the data by denoising the data during a reverse process.  For example, during training, guided latent diffusion model 300 may take an original image 301 in a pixel space 305 as input and apply forward diffusion process 310 to gradually add noise to the original image 305 to obtain noisy images 320 at various noise levels); and 
conditioning the trained latent diffusion model based on the prompt, the mask, the training occluded image, and the noised training counterpart image (paragraph 55: The guidance features 360 can be combined with the noisy images 320 at one or more layers of the reverse diffusion process 330 to ensure that the output image 390 includes content described by the text prompt 340; paragraph 122: At operation 1250, the parameters of the mask network can be updated, for example, through back propagation based on the comparison, to more closely match the expected position and area of the masks and amodal segmentation mask. Updating the parameters of the diffusion model of the mask network based on the comparison can train the diffusion model to generate more accurate segmentation masks).

Regarding independent claim 19, Zhang teaches a non-transitory machine-readable medium having instructions stored therein, which when executed by a processor, cause the processor to perform operations (paragraph 134), the operations comprising: 
building a synthetically curated training dataset by generating training data pairs from source images (paragraph 123: 40,000 images with multiple annotated instances can be used as the training set), each training data pair comprising {1}a training occluded image of an occluded object{1} and {2}a training counterpart image of a whole object corresponding to the occluded object{2} (paragraph 30: The diffusion model can be trained to perform mask completion of an amodal segmentation mask using {2}unoccluded object images labeled with ground truths{2}, {1}partially occluded object images with ground truths{1}, and amodal segmentation masks as ground truths (i.e., true masks)); and 
conditioning {1}a latent diffusion model{1} to generate {2}an amodal image of an occluded object{2} in an input image based on the synthetically curated training dataset (paragraph 74: At operation 670, {1}a diffusion model{1} can generate {2}a complete image 590 of the occluded object 515{2}, and provide the amodal segmentation mask with or without the completed image back to the user 110; paragraph 47: the computation and parameters in {1}a diffusion model 260{1} take part in the learned mapping function which reduces noise at each timestep (denoted as F). {1}The model{1} takes as input x (i.e., noisy or partially denoised image depending on the timestep), the timestep t, and {1}conditioning information the model was trained to use{1}).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claim(s) 2 and 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zhang et al. (US 2024/0169541) in view of Toizumi et al. (US 2021/0056343).
Regarding dependent claim 2, Zhang does not expressly disclose wherein the input image is a zero-shot image that has not been previously applied to the trained conditional generative model, however Zhang does disclose a user 110 can provide an initial image 510 to an image processing system 130 (paragraph 68), and the image processing system 130 can include a transformer decoder 730 that can be configured to generate masks, tokens, labels, and confidence scores (paragraph 82).  Toizumi discloses in the zero-shot recognition, labels of training data and test data are separated, and unsupervised recognition is performed with respect to the test data (paragraph 2).  It would have been obvious for one of ordinary skill in the art at the time of the invention (pre-AIA ) or at the time of the effective filing date of the application (AIA ) to modify Zhang's system to input zero-shot images to a trained diffusion module that has not been specifically trained on the object of the zero-shot images.  One would be motivated to do so because this would help test the trained diffusion module for accuracy of recognizing untested objects in zero-shot images (paragraph 73, 74).  

Regarding dependent claim 12, Zhang does not expressly disclose wherein the input image is a zero-shot image that has not been previously applied to the trained conditional generative model, however Zhang does disclose a user 110 can provide an initial image 510 to an image processing system 130 (paragraph 68), and the image processing system 130 can include a transformer decoder 730 that can be configured to generate masks, tokens, labels, and confidence scores (paragraph 82).  Toizumi discloses in the zero-shot recognition, labels of training data and test data are separated, and unsupervised recognition is performed with respect to the test data (paragraph 2).  It would have been obvious for one of ordinary skill in the art at the time of the invention (pre-AIA ) or at the time of the effective filing date of the application (AIA ) to modify Zhang's system to input zero-shot images to a trained diffusion module that has not been specifically trained on the object of the zero-shot images.  One would be motivated to do so because this would help test the trained diffusion module for accuracy of recognizing untested objects in zero-shot images (paragraph 73, 74).    

Allowable Subject Matter
Claims 6 and 20 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.  
The following is an examiner’s statement of reasons for allowance: the limitations “superimposing the first training counterpart image over the second source image to generate a training occluded image; and associating the second training counterpart image with the training occluded image, wherein the second training counterpart image and the training occluded image constitute a training data pair” are not taught or rendered obvious by the cited prior arts.
Zhang et al. (US 2024/0169541) discloses “The diffusion model can be trained to perform mask completion of an amodal segmentation mask using unoccluded object images labeled with ground truths, partially occluded object images with ground truths, and amodal segmentation masks as ground truths (i.e., true masks) (paragraph 30), but does not disclose the process of superimposing objects/images to produce the claimed counterpart training data pair.  
Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”  

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JEFFREY J CHOW whose telephone number is (571)272-8078. The examiner can normally be reached 11AM-7PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Devona Faulk can be reached at 571-272-7515. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/JEFFREY J CHOW/Primary Examiner, Art Unit 2618
Read full office action
Prosecution Timeline

Jul 23, 2024
Application Filed
Jan 02, 2026
Non-Final Rejection — §102, §103
Feb 12, 2026
Interview Requested
Feb 26, 2026
Applicant Interview (Telephonic)
Feb 27, 2026
Examiner Interview Summary
Precedent Cases

Applications granted by this same examiner with similar technology

18/463,163
Patent 12602845
UNIVERSAL STATE REPRESENTATIONS OF VISUALIZATIONS FOR DIFFERENT TYPES OF DATA MODELS
2y 5m to grant Granted Apr 14, 2026
17/065,780
Patent 12591949
IMAGE GENERATION USING ONE OR MORE NEURAL NETWORKS
2y 5m to grant Granted Mar 31, 2026
18/326,766
Patent 12586305
3D REFERENCE POINT DETECTION FOR SURVEY FOR VENUE MODEL CONSTRUCTION
2y 5m to grant Granted Mar 24, 2026
18/550,590
Patent 12586267
INTERACTION METHOD AND APPARATUS IN LIVE STREAMING ROOM, DEVICE, AND STORAGE MEDIUM
2y 5m to grant Granted Mar 24, 2026
17/957,865
Patent 12579735
A VISUALIZATION SYSTEM FOR CREATING A MIXED REALITY GAMING ENVIRONMENT
2y 5m to grant Granted Mar 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
77%
Grant Probability
92%
With Interview (+15.8%)
3y 1m
Median Time to Grant
Low
PTA Risk
Based on 655 resolved cases by this examiner. Grant probability derived from career allow rate.