Last updated: April 19, 2026
Application No. 18/626,427
SYSTEMS AND METHODS FOR IMAGE COMPOSITING VIA MACHINE LEARNING

Final Rejection §103
Filed
Apr 04, 2024
Examiner
WANG, YUEHAN
Art Unit
2617
Tech Center
2600 — Communications
Assignee
Yahoo Assets LLC
OA Round
2 (Final)
Interview Optional

— +12.9% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 485 resolved cases, 2023–2026
Examiner Intelligence

WANG, YUEHAN View full profile →
Grants 83% — above average
Career Allow Rate
404 granted / 485 resolved
+21.3% vs TC avg
Moderate +13% lift
Without
With
+12.9%
Interview Lift
resolved cases with interview
Typical timeline
2y 7m
Avg Prosecution
47 currently pending
Career history
532
Total Applications
across all art units
Statute-Specific Performance

§101
4.3%
-35.7% vs TC avg
§103
69.6%
+29.6% vs TC avg
§102
8.3%
-31.7% vs TC avg
§112
6.6%
-33.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 485 resolved cases
Office Action

§103
DETAILED ACTION

Response to Amendment
Applicant’s amendments filed on 04 February 2026 have been entered. Claims 1-20 are still pending in this application, with claims 1, 9 and 17 being independent.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zeng et al. (US 20250166237 A1), referred herein as Zeng in view of Shen et al. (US 20190361994 A1), referred herein as Shen.
Regarding Claim 1, Zeng in view of Shen teaches a method comprising (Zeng Abst: using neural networks for generating multiple related images):
training, by a processor, a machine learning model to create composite images from background scenes and foreground objects by providing the machine learning model withZeng [0088] FIG. 5 illustrates another block diagram for training a neural network to generate images with a same object… [0092] The self-attention layers 405 draw dependencies between the foreground images of the one or more input images (e.g., 502, 503) through self-attention of the concatenated image map 507. Additionally, the self-attention layer draws dependencies between the backgrounds of the one or more inputs (e.g., 502, 502). In effect, the self-attention layers 405 separate the foreground image from the background image through self-attention); Input image (502 or 503) comprises combined foreground and background images in order to be separatable.
Zeng does not but Shen teaches
a plurality of sets of triplets each comprised of a training background scene, a training foreground object (Shen [0004] In order to train machine-learning models of the convolutional neural networks, triplets of training digital images are used. Each triplet includes a positive foreground digital image and a positive background digital image taken from the same digital image, e.g., through use of segmentation mask annotations. The triplet also contains a negative foreground or background digital image that is dissimilar to the positive foreground or background digital image that is also included as part of the triplet).
Zeng in view of Shen further teaches
identifying, by the processor, a digital image file that comprises a background scene and an additional digital image file that comprises a foreground object (Shen [0068] a training data generation module 802 decomposes each of the digital images into background scenes and foreground objects. An example of this is illustrated in FIG. 8 in which an original digital image 804 is used to generate a positive background digital image 806 and a positive foreground digital image 808);
compositing, by the machine learning model executed by the processor, the digital image file that comprises the background scene and the additional digital image file that comprises the foreground object to produce a composite digital image file that comprises the foreground object and the background scene by performing at least one of a channel concatenation step and a reverse diffusion sampling step (Zeng [0115] The generated images have as the foreground image input 601 with the subject “fox” 611 in different poses. The background of images 603, 604, 605, and 606 are scenes described by prompt 602. In at least one embodiment, image 603 is input 601 with a background scene of a forest in spring; [0100] input images 502, 503 can be better adapted to personalized image generation by adding an extra mask and input image channel to the diffusion model training; [0069] a diffusion model receives a random noise sample (e.g., drawn from a Gaussian distribution) and applies reverse diffusion steps to progressively denoise it (e.g., in each layer of a diffusion neural network). In at least one embodiment, a diffusion model is conditional, and it utilizes additional conditioning information such as text descriptions or class labels to steer generation towards a desired outcome; [0088] The one or more feature image maps 505, 506 are concatenated together by one or more concatenation layers 404); and
causing display, by the processor, of the composite image file that comprises the foreground object and the background scene (Zeng [0143] generating one or more output images 912. In at least one embodiment, cross-attention layer 406 output images 519, 520. Images 519, 520 are images having inputs 502, 503 as the foreground image and background images described by prompts 514, 515. In effect, neural network 501 combines inputs 502, 503 with backgrounds as set forth in prompts 514, 515 generating personalized images 519, 520; [0427] perform pixel shading or other screen space operations, to produce a rendered image for display).
Shen discloses compositing aware digital image search techniques and systems, which is analogous to the present patent application. 
It would have been obvious for a person of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zeng to incorporate the teachings of Shen, and apply triplets of training digital images into methods for using one or more neural networks to generate several images.
Doing so would provide usable training triplets for the system to define compatibility of the foreground and background digital images with each other. 

Regarding Claim 2, Zeng in view of Shen teaches the method of claim 1, and further teaches wherein identifying, by the processor, the digital image file and the additional digital image file comprises receiving text instructions describing at least one of the foreground object and the background scene (Zeng [0074] neural network 202 generates one or more text descriptions 204 of one or more scenes creating description set {y.sub.m, y.sub.m+1, y.sub.m+2 . . . M}. For example, scenes may include one or more backgrounds, scenery, settings, environments, surroundings, one or more contexts, or other descriptions of the portion of an image that is not a foreground).

Regarding Claim 3, Zeng in view of Shen teaches the method of claim 2, and further teaches further comprising generating, by the machine learning model, at least one of the digital image file and the additional digital image file in response to receiving the text instructions (Zeng [0078] outputs 209 are images of the one or more input 201 subjects without backgrounds; [0079] training process 300 includes neural network 302 to receive one or more inputs 301 including one or more subjects, generate one or more background prompts 303a-303n, connect one or more intermediate images corresponding to outputs 209 with the one or more background prompts 303a-303n).

Regarding Claim 4, Zeng in view of Shen teaches the method of claim 1, and further teaches wherein compositing, by the machine learning model executed by the processor, the digital image file and the additional digital image file by performing the channel concatenation step comprises:
adding the foreground object as at least one channel to an intermediate composite image (Zeng [0078] generating outputs 209 that are intermediate images. In at least one embodiment, outputs 209 are images of the one or more input 201 subjects without backgrounds);
adding the background scene as at least one additional channel to the intermediate composite image (Zeng [0079] generate one or more background prompts 303a-303n, connect one or more intermediate images corresponding to outputs 209 with the one or more background prompts 303a-303n) ; and
performing channel concatenation with the intermediate composite image such that a result of the concatenation preserves information from the foreground object, the background scene, and the intermediate composite image (Zeng [0085] The feature image maps generated by convolution layer 403 are received by one or more concatenation layers 404 that concatenate each of the feature image maps into a single feature map, where said single feature map provided to one or more self-attention layers 405; [0129] neural network training includes receiving 804 subject-background prompt vector space of neural network 302 and receiving 806 intermediate images 209 of neural network 205; [0134] processing the concatenated feature maps using one or more self-attention neural networks or layers 906; [0143] generating one or more output images 912).

Regarding Claim 5, Zeng in view of Shen teaches the method of claim 1, and further teaches wherein compositing, by the machine learning model executed by the processor, the digital image file and the additional digital image file by performing the reverse diffusion sampling step comprises encoding the foreground object into tokens and performing cross-attention on the tokens (Zeng [0092] The self-attention layers 405 draw dependencies between the foreground images of the one or more input images (e.g., 502, 503) through self-attention of the concatenated image map 507; [0139] processing the divided feature maps by one or more cross-attention layers 910. processed feature maps 512, 513 are input to cross-attention layer 406; [0695] a token is a portion of input data. In at least one embodiment, a token is a word. In at least one embodiment, a token is a character; [0698] In at least one embodiment, an encoder encodes input data 4610 into one or more feature vectors. In at least one embodiment, an encoder encodes input data 4610 into a sentence embedding vector).

Regarding Claim 6, Zeng in view of Shen teaches the method of claim 1, and further teaches wherein providing the machine learning model with the plurality of sets of triplets comprises generating the plurality of sets of triplets (Shen [0077] These similar foreground digital images 902 are treated as compatible foregrounds for the positive background digital image 806, e.g., as new triplets of training digital images. In this way, the number of positive training pairs may be increased and also reduce noise in negative pair sampling. This may also be used to replace the positive background digital image 806 with a similar background digital image 904 that when combined with the positive foreground digital image 808 also acts to increase a number of triplets of training digital images).

Regarding Claim 7, Zeng in view of Shen teaches the method of claim 6, and further teaches further comprising generating the plurality of sets of triplets by compositing the training foreground object with the training background scene to create the training composite image via diffusion with classifier guidance that ensures that the training composite image contains a version of the training foreground object and a version of the training background scene (Shen [0033] The background are foreground features are usable to determine compatibility of a foreground image with a background image. In an implementation, this may also be aided through use of a category feature machine learning system 128 that is usable to learn category features from categorical data that is provided along with the foreground and background images. The categorical data, for instance, may define a category defining “what” is included in the foreground and background digital images and thus aide the search as further described below; [0048] In order to incorporate the category features by the context aware image search system 118, the category features 308 are encoded as part of the background features 210 and the foreground features 212. To do so, multimodal compact bilinear pooling (MCB) modules 310 are used in the illustrated example to take an outer product of the two vectors (e.g., the background features 210 and the category features 308; or the foreground features 212 and the category features 308) to form the combination, although other techniques are also contemplated. Feature transformation modules 314, 316 are then employed to adopt both an inner product and compact bilinear pooling along with a light computation CNN to generate scores through use of a score calculation module 214 that employs a triplet loss function).

Regarding Claim 8, Zeng in view of Shen teaches the method of claim 1, and further teaches further comprising:
learning an appearance of a foreground object from one or more images (Shen [0074] The training data generation module 802 then employs matching criteria to find similar foreground and/or background digital images 902, 904. Examples of matching criteria include semantic context and shape information; [0076], shape information has increased effectiveness in finding similar foreground digital images. Additionally, foreground objects having a more diverse appearance may vary according to different scenes and therefore semantic context information has increased effectiveness in finding similar foreground digital images); and
generating training triplets by one of adding the foreground object to a background scene using inpainting, or using a model with classifier guidance and a prompt corresponding to the learned object (Zeng [0063] processors use a neural network to receive different text prompts and generate images of a same subject (e.g., a same dog) in different backgrounds corresponding to different text prompts; [0081] training process 300 includes neural network 304 that receives as input the subject-background prompt vector space of neural network 302 and the outputs 209 of neural network 205 and outputs one or more images 305a-305n of the subjects of subject set 203 over one or more backgrounds described by the one or more background prompts 303a-303n. In at least one embodiment, neural network 304 is one or more of a text to image diffusion model, a latent text to image diffusion model, a stable diffusion inpainting model, or other neural networking model trained to connect text with images described by the text; [0191] In at least one embodiment, OpenVINO supports neural network models for various tasks and operations, such as classification, segmentation, object detection, face recognition, speech recognition, pose estimation (e.g., humans and/or objects), monocular depth estimation, image inpainting, style transfer, action recognition, colorization, and/or variations thereof).

Regarding Claims 9-16, Zeng in view of Shen teaches a non-transitory computer-readable storage medium for tangibly storing computer program instructions capable of being executed by a computer processor (Zeng Abst: using neural networks for generating multiple related images; [0001] processors, computing systems, devices, non-transitory computer medium, and/or methods for using neural networks for generating multiple related images).
The metes and bounds of the claim substantially correspond to the claimed limitations set forth in claims 1-8; thus they are rejected on similar grounds and rationale as their corresponding limitations.

Regarding Claims 17-20, Zeng in view of Shen teaches a device comprising: a processor; and a storage medium for tangibly storing thereon logic for execution by the processor (Zeng Abst: using neural networks for generating multiple related images; [0001] processors, computing systems, devices, non-transitory computer medium, and/or methods for using neural networks for generating multiple related images).
The metes and bounds of the claim substantially correspond to the claimed limitations set forth in claims 1-4; thus they are rejected on similar grounds and rationale as their corresponding limitations.

Response to Arguments
Applicant's arguments filed on 04 February 2026, with respect to the 103 rejection have been fully considered but they are not persuasive.
On page 8, Applicant's Remarks, with respect to claims 1, 9 and 17, the applicant argues Shen describes an image harmonization system, which is a fundamentally different task from image compositing. The examiner respectfully disagrees with this argument. Shen explicitly recited “An example of functionality incorporated by the image processing system 110 to process the digital image includes digital image compositing. Digital image compositing involves combining foreground objects and background scenes from different sources to generate a new composite digital image” (See [0031]). In addition of image composition, Shen further disclosed other composition related technologies, such as identifying essential features, defining compatibility of the foreground and background of images. Regarding the first argument, it is respectfully noted that Shen is a strong analogous art of the primary art Zeng and the current application.
On page 9, Applicant's Remarks, with respect to claims 1, 9 and 17, the applicant argues (A)at no point does Shen train a model to receive a separate background scene and a separate foreground object and produce a composite that combines them. The examiner respectfully disagrees with this argument. Shen explicitly recited “a background digital image 130 may be used as a basis to generate image feature data 132 that includes background features 134 that are used to determine compatibility with digital images 120 of a foreground. Likewise, a foreground digital image 136 may be used to generate image feature data 138 having foreground features 140 that are used to determine compatibility with digital images of a background” (See [0034]). The applicant further argues Shen's model never learns to spatially compose separate elements into a unified scene. However, the disclosure of “the filled portion defines a size, aspect ratio, and location in the background scene that is to receive a foreground object” (see [0042]) reads exactly on the application’s assertion of “learns to spatially compose separate elements into a unified scene”. Furthermore, in response to applicant's argument that the references fail to show certain features of the invention, it is noted that the features upon which applicant relies (i.e., “learns to spatially compose separate elements into a unified scene”) are not recited in the rejected claim(s). Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims. See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993). Regarding the second argument, it is respectfully noted that Shen teaches the limitation of identifying, by the processor, a digital image file that comprises a background scene and an additional digital image file that comprises a foreground object, as claimed.
On page 9, Applicant's Remarks, with respect to claims 1, 9 and 17, the applicant argues Shen's model, by contrast, learns nothing about how to combine separate images; it only learns how to adjust the appearance of an already-combined image. The examiner respectfully disagrees with this argument. The examiner invoked Zeng reference to teach the limitation of “compositing, by the machine learning model executed by the processor,…” The applicant further argues Zeng's system receives a single image together with a mask that designates the region to be regenerated, and the model fills in that masked region. Zeng explicitly recited “input 201 can include text, images, or a combination of text and images (e.g., inputs 101 from FIG. 1)” (See [0073]). By combining with Shen, a person of ordinary skill in the art would definitely be able to apply the background and foreground images of Shen to the input images 201 of Zeng. In addition, the digital images files as claimed and the input images 210 are NOT mutually excluded by including mask. Regarding the third argument, it is respectfully noted that Zeng teaches the limitation of compositing, by the machine learning model executed by the processor, the digital image file that comprises the background scene and the additional digital image file that comprises the foreground object to produce a composite digital image file… as claimed.
On page 10, Applicant's Remarks, with respect to claims 1, 9 and 17, the applicant argues inpainting a masked region of a single image is a structurally and functionally different operation from compositing two separate image files into a unified output. The examiner respectfully disagrees with this argument. Zeng explicitly recited “neural network 304 is one or more of a text to image diffusion model, a latent text to image diffusion model, a stable diffusion inpainting model, or other neural networking model trained to connect text with images described by the text.” (See [0081]). The mask inpainting model is just one embodiment of the neural network trained for generating multiple related images. Further, the claim 8 of the current application recited “generating training triplets by one of adding the foreground object to a background scene using inpainting, or using a model with classifier guidance and a prompt corresponding to the learned object”, from where it provided strong evidence that the current application and Zeng reference are not only not structurally and functionally different, but are strongly relevant on performing equivalent functions and achieving similar result. Regarding the fourth argument, it is respectfully noted that Zeng is a strong analogous art of Shen reference and the current application.

Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Samantha (Yuehan) Wang whose telephone number is (571)270-5011. The examiner can normally be reached Monday-Friday, 8am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, King Poon can be reached at (571)272-7440. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/Samantha (YUEHAN) WANG/
Primary Examiner
Art Unit 2617
Read full office action
Prosecution Timeline

Apr 04, 2024
Application Filed
Oct 31, 2025
Non-Final Rejection — §103
Feb 04, 2026
Response Filed
Mar 02, 2026
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/198,019
Patent 12597178
VECTOR OBJECT PATH SEGMENT EDITING
2y 5m to grant Granted Apr 07, 2026
18/528,922
Patent 12597506
ENDOSCOPIC EXAMINATION SUPPORT APPARATUS, ENDOSCOPIC EXAMINATION SUPPORT METHOD, AND RECORDING MEDIUM
2y 5m to grant Granted Apr 07, 2026
18/492,720
Patent 12586286
DIFFERENTIABLE REAL-TIME RADIANCE FIELD RENDERING FOR LARGE SCALE VIEW SYNTHESIS
2y 5m to grant Granted Mar 24, 2026
18/584,076
Patent 12586261
IMAGE PROCESSING APPARATUS, IMAGE PROCESSING METHOD, AND STORAGE MEDIUM
2y 5m to grant Granted Mar 24, 2026
18/372,370
Patent 12567182
USING AUGMENTED REALITY TO VISUALIZE OPTIMAL WATER SENSOR PLACEMENT
2y 5m to grant Granted Mar 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
83%
Grant Probability
96%
With Interview (+12.9%)
2y 7m
Median Time to Grant
Moderate
PTA Risk
Based on 485 resolved cases by this examiner. Grant probability derived from career allow rate.