DETAILED ACTION
*Note in the following document:
1. Texts in italic bold format are limitations quoted either directly or conceptually from claims/descriptions disclosed in the instant application.
2. Texts in regular italic format are quoted directly from cited reference or Applicant’s arguments.
3. Texts with underlining are added by the Examiner for emphasis.
4. Texts with
5. Acronym “PHOSITA” stands for “Person Having Ordinary Skill In The Art”.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Objections
Claim 18 is objected to because of the following informalities:
Claim 18 recites compare the output second prompts with the expected second prompts to determine a loss parameter for the first machine learning model; and update a characteristic of the first machine learning model based on the loss parameter (last two limitations). Suggest replacing compare and update with “comparing” and “updating” for consistency reason. Appropriate correction is required.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-3, 5-8, 10 and 12-20 are rejected under 35 U.S.C. 103 as being unpatentable over Couleaud et al. (US 2025/0078346A1) in view of Graham et al. (US 2024/0346731 A1).
Regarding Claim 1, Couleaud teaches or suggests a system (Abstract: Systems and methods are described for generating, using the first trained machine learning model and based on text input, a single-layer image comprising a plurality of objects; generating a plurality of masks associated with the plurality of objects) comprising:
at least one processor (Fig.8 and [0095]: Control circuitry 804 may be based on any suitable control circuitry such as processing circuitry 806. As referred to herein, control circuitry should be understood to mean circuitry based on one or more microprocessors, microcontrollers, digital signal processors, programmable logic devices, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), etc., and may include a multi-core processor (e.g., dual-core, quad-core, hexa-core, or any suitable number of cores) or supercomputer); and
at least one memory component storing instructions that, when executed by the at least one processor, cause the at least one processor to perform operations ([0099]: Storage 808 may be used to store various types of content described herein as well as the image processing system or application data described above. Nonvolatile memory may also be used (e.g., to launch a boot-up routine and other) comprising:
receiving a first prompt of a first user via a user interface of a user device ([0038]: … For example, the image processing system may receive, as shown in FIG. 1, input of a text prompt 102. Such input may be received at a user interface of a computing device from a user in any suitable form …) indicating a user's intent (Fig.2: prompt 202 shows user’s intention to generate a picture of a young woman holding a cat in front of a Victorian era building, 1870, high quality, soft focus, f/18, 60 mm, in the style of Auguste Renoir);
PNG
media_image1.png
423
785
media_image1.png
Greyscale
PNG
media_image2.png
798
503
media_image2.png
Greyscale
processing the first prompt using a first machine learning model to generate a second prompt that is applied to a second machine learning model, the second prompt indicative of attributes identified in the first prompt ([0050]: In some embodiments, as shown in FIG. 1, images 116, 118 and 120 may be input (e.g., sequentially or in parallel) to trained machine learning model 126 (e.g., an image-to-text machine learning model). In some embodiments, machine learning model 126 may comprise a contrastive language-image pretraining (CLIP) model. Model 126 may generate a set of textual prompts 128, 130, and 132 corresponding to (e.g., describing or interpreting) the plurality of images 116, 118 and 120, receptively (which in turn may depict or represent objects 211, 213 and 215 of FIG. 2). In some embodiments, textual prompts 128, 130, and 132 of FIG. 1 may respectively correspond to textual prompts 228, 230 and 232 of FIG. 2. In some embodiments, textual prompts 127, 129, and 131 of FIG. 1 may correspond to textual prompts 227, 229, and 231 of FIG. 2. In some embodiments, textual prompts 228, 230 and 232 of FIG. 2 may correspond to objects 211, 213, and 215, respectively, and/or images 216, 218, and 220, respectively);
capturing an image ([0085]: … Additionally or alternatively, the image processing system may access one or more of 604 (and 704) by capturing and/or generating the images, … See Fig.8 camera 818. [0041]: In some embodiments, image 110 (and/or subsequent images generated by the techniques described herein) may be a photo; a picture; a still image; a live photo; a video; a movie; a media asset; a recording; a slow motion video; a panorama photo, a GIF, burst mode images; images from another type of mode; or any other suitable image; or any combination thereof);
processing a combination of the image (Notice image 134,136 and 138 in Fig.1 and image 234, 236 and 238 in Fig.2); and
applying the plurality of images to the live camera feed of the user device (Fig.2: notice the composition options 240, 242, 244 and 246).
Couileaud does not explicitly recite the live photo can be a live image of the first user.
However Graham, in the same field of endeavor, discloses using machine learning model to process live image of a user for prompting a trained Al model(s) to output photoreal synthetic content in real-time ([0004]: FIG. 1 is a diagram illustrating an example technique for prompting a trained Al model(s) to output photoreal synthetic content in real-time. Also see Fig.2 and [0049]: At block 222(A), output video data 224(A) may be generated based at least in part on the input video data 208(A) and the Al-generated output data 204(A), and this output video data 224(A) may correspond to video content featuring the synthetic content 102 (e.g., the synthetic face 102(1) and/or a synthetic body part (e.g., synthetic body 102(2)) within the real-world scene (e.g., a real-world scene that is being captured by the video capture device 202, a pre-recorded real-world scene, etc.)).
PNG
media_image3.png
530
764
media_image3.png
Greyscale
Therefore it would have been obvious to a PHOSITA before the effective filing date to incorporate the teaching of Graham into that of Couleaud for prompting a trained AI model(s) to output photoreal synthetic content in real-time as suggested by Graham ([0014]).
Regarding Claim 2, Couleaud teaches or suggests wherein the first prompt is received via the user interface that is displaying the live camera feed of the user device (see Fig.6/7).
PNG
media_image4.png
447
575
media_image4.png
Greyscale
PNG
media_image5.png
491
629
media_image5.png
Greyscale
Regarding Claim 3, Couleaud discloses In some embodiments, image 704 may be accessed automatically, e.g., without explicit user input inputting such image, such as, for example, as a recommendation to a user based on user preferences or historical user interactions ([0084]). Couleaud further discloses supporting prompt semantic merge and simplification ([0055]: For example, the image processing system may perform such semantic merge, updating and/or simplification to modify textual prompt … Also see [0090]: In some embodiments, generating a plurality of prompts from the first plurality of images and the first prompt comprises using a large language model to extract a first plurality of objects from the first prompt; using a large language model to extract a first plurality of style guidance for each object in the first prompt; using an image-to-text model to generate a second plurality of objects from each image in the first plurality of images; performing a semantic merge between the first plurality of objects and the second plurality of objects using a large language model to obtain a third plurality of objects; and applying the first plurality of style guidance to the third plurality of objects to obtain a plurality of prompt).
Therefore it would have been obvious to a PHOSITA before the effective filing date to incorporate the teaching of Couleaud and to include the limitation of accessing historical interaction data of the first user with the system indicative of a first user's interaction with features provided to the first user by the system; identifying one or more preferences of the first user based on the accessed historical interaction data of the first user; and modifying the first prompt based on the identified one or more preferences of the first user, wherein processing the first prompt comprises processing the modified first prompt in order to provide a prompt which might be a better option for the user.
Regarding Claim 5, Couleaud modified by Graham further teaches or suggests wherein processing the combination of the image of the first user with the second prompt includes inputting the captured image into the second machine learning model (see Graham Fig.2A). The same reason to combine as that of Claim 1 is applied.
Regarding Claim 6, Copuleaud modified by Graham further teaches or suggests wherein processing the combination of the image of the first user with the second prompt includes identifying facial features of the first user via the image, and processing the facial features by the second machine learning model (Graham Fig.2A: 35 years old implying facial features changes). The same reason to combine as that of Claim 1 is applied.
Regarding Claim 7, Copuleaud further teaches or suggests generating a plurality of second prompts by the second machine learning model, each of the plurality of second prompts indicative of a different scenario in response to the first prompt; and selecting the second prompt from the plurality of second prompts to be applied to the second machine learning model (Fig.2 and [0061]: As another example, composite image 240, 242, 244, or 246 may be generated based on receiving user input, e.g., to edit or move around objects and/or layers corresponding to images 234, 236, and 238).
Regarding Claim 8, Couleaud discloses using machine learning models to generate multiple composite images (Fig.2: 240, 242,244 and 246). Graham teaches using a trained artificial intelligence (AI) model(s) to output photoreal synthetic content in real-time (Abstract) and the synthetic face (or other body part) can be overlaid on the face (or other body part) of the actor in real-time on the display, which is being viewed by the director, the actor, and/or others at the location where the scene is being filmed ([0015]). Therefore it would have been obvious to a PHOSITA before the effective filing date for a PHOSITA to add the limitation of wherein the selecting of the second prompt is performed randomly to help the director or actors explore character and scene.
Regarding Claim 10, Couleaud further teaches or suggests generating a plurality of second prompts by the second machine learning model, each of the plurality of second prompts indicative of a different scenario in response to the first prompt; processing each of the plurality of second prompts using the second machine learning model to generate corresponding images; and applying the each of the corresponding images to the live camera feed of the user device (notice the plurality final composite images generated based on the plurality second prompts Fig.2 ).
Regarding Claim 12, Couleaud modified by Graham teaches or suggests wherein applying the plurality of images to the live camera feed of the user device comprises overlaying one or more of the images onto the live camera feed such that the one or more of the images align with a user's head position and user movements (Graham [0015]: In some examples, the synthetic content is a synthetic body part (e.g., a synthetic face), in which case the AI model(s) may be trained with a robust dataset of body part (e.g., face) images (e.g., a face (or other body part) captured from a comprehensive set of angles), and this trained AI model(s) may be prompted with body part data (e.g., face data) in order to generate the output data representing the synthetic face (or other body part), which is displayed in real-time. In these examples, an actor, for instance, may be performing a scene that is being captured by a video capture device (e.g., a video camera), and the synthetic face (or other body part) can be overlaid on the face (or other body part) of the actor in real-time on the display, which is being viewed by the director, the actor, and/or others at the location where the scene is being filmed). The same reason to combine as that of Claim 1 is applied.
Regarding Claim 13, Couleaud discloses In some embodiments, the systems, methods, and apparatuses described herein may be further configured to receive input of a particular image, wherein the particular image is included as an object of the plurality of objects in the generated single-layer image based on the received input of the particular image; generate, for display at a graphical user interface, the multi-layer image, wherein the graphical user interface comprises one or more options to modify the multi-layer image; receive selection of the one or more options; and modify the multi-layer image based on the received selection ([0017]). Therefore it would have been obvious to a PHOSITA to include the limitation of rotating the one or more images above the head of the user in the live camera feed in order to display different options to a user to allow the user to make a selection based on options being displayed in rotation.
Regarding Claim 14, since Couleaud teaches receiving selection of the one or more options and modifying the multi-layer image based on the received selection ([0017]). Therefore it would have been obvious to a PHOSITA to further include the limitation of reducing speed of rotation until a final selected image is presented above the user's head in the live camera feed to auto displaying selection options in a recent slow speed so that the user can have enough time to make the decision.
Regarding Claim 15, Graham discloses using AI to assist a director of a movie ([0015]: For example, a director of a movie may provide live prompts to the trained AI model(s) to cause the model(s) to generate output data representing synthetic content for a scene of the movie, which is displayed in real-time on a display that is being viewed by the director). Therefore it would have been obvious to a PHOSITA to add the limitation of identifying a second user in the live camera feed; capturing an image of the second user; and updating the images being applied to the live camera feed to reflect an identity of the second user since it is a common movie scent with two or move actors or actresses.
Regarding Claim 16, Couleaud teaches or suggests wherein the operations further comprise: processing the second prompt using a third machine learning model, the third machine learning model being an LLM and being trained to generate third prompts from second prompts, the third prompt including instructions for the generation of images responsive to the first prompt, wherein processing the combination of the image of the first user with the second prompt using the second machine learning model comprises processing the combination of the image of the first user with the instructions ([0005]: In another approach, large language models can be used to extract object and layer information from a prompt, for use as a guide in a text-to-image generation. However, using the results of these queries to generate a set of images directly by inputting the resulting prompts into a text-to-image generation model would lead to undesirable results, since each prompt would be considered independent from the others, and the resulting layers would not convey a sense of consistency when assembled. Moreover, since in such approach the text-to-image model does not generate transparency, assembling in layers a set of images generated with these models (e.g., without further editing) would result in only the top layer being visible. [0015]: In some embodiments, the second machine learning model comprises a large language model (LLM), and generating, using the second trained machine learning model, the plurality of textual descriptions respectively corresponding to the plurality of objects comprises: inputting the text input to the second trained machine learning model, and generating, using the second trained machine learning model and based on the text input, the plurality of textual descriptions. [0052]: In some embodiments, such as, for example, where machine learning model 126 corresponds to a trained image-to-text machine learning model, to avoid confusing machine learning model 126 with holes 219 and 221 being present in input images 216 and 218, it may be desirable for the image processing system to perform additional cropping of regions of interest, and/or expansion or other modification of corresponding clipping mask(s), resulting in a set of images that represent the same object. Also see Fig.1/2).
Regarding Claim 17, Couleaud further teaches or suggests wherein the operations further comprise: processing the second prompt using a third machine learning model, the third machine learning model being a diffusion model and being trained to generate images from second prompts, the generated images from the third machine learning model not maintaining an identity of the user, wherein the second machine learning model further processes the images that do not maintain the identity of the user to generate the plurality of images ([0002]: In yet another approach, a two-stage model includes a first stage that generates an image embedding given a text caption, and a diffusion-based decoder at the second stage generates an image conditioned on the image embedding from the first stage. [0039]: In some embodiments, the image processing system may receive input from the user indicating one or more parameters for machine learning model 108, such as, for example, a choice of sampler (e.g., an Euler sampler, a Heun sampler, a DPM (diffusion probabilistic model) Fast sampler, a DPM2 sampler, or any other suitable sampler, or any combination thereof); a number of iterations (e.g., to obtain a stable picture, meaning that the picture or image obtained after N iterations does not significantly differ from the picture obtained in the previous iteration); an attention factor (e.g., a level of attention the model gives to each of the words in text prompt 102); or any other suitable parameter(s), or any combination thereof).
Regarding Claim 18, Couleaud teaches or suggests wherein the operations further comprise: training the first machine learning model by: identifying training first prompts and corresponding training second prompts expected for the training first prompts; applying the training first prompts to the first machine learning model to receive output second prompts (see Fig.1);
compare the output second prompts with the expected second prompts to determine a loss parameter for the first machine learning model; and update a characteristic of the first machine learning model based on the loss parameter ([0077]: In some embodiments, model 500 (and 510) may be trained with any suitable amount of training data from any suitable number and/or types of sources. In some embodiments, machine learning model 500 (and 510) may be trained by way of unsupervised learning, e.g., to recognize and learn patterns based on unlabeled data. In some embodiments, machine learning model 500 (and 510) may be trained by supervised training with labeled training examples to help the model converge to an acceptable error range, e.g., to refine parameters, such as weights and/or bias values and/or other internal model logic, to minimize a loss function. In some embodiments, each layer may comprise one or more nodes that may be associated with learned parameters (e.g., weights and/or biases), and/or connections between nodes may represent parameters (e.g., weights and/or biases) learned during training (e.g., using backpropagation techniques, and/or any other suitable techniques). In some embodiments, the nature of the connections may enable or inhibit certain nodes of the network. In some embodiments, the image processing system may be configured to receive (e.g., prior to training) user specification of (or automatic selection of) hyperparameters (e.g., a number of layers and/or nodes or neurons in each model)).
Regarding Claim 19, Claim 19 is/are similar to Claim 1 except in the format of method. Therefore the same reason(s) for rejection is/are applied to Claim 1 is/are also applied to Claim 19.
Regarding Claim 20, Claim 20 is/are similar to Claim 1 except in the format of non-transitory computer-readable storage medium. Therefore the same reason(s) for rejection is/are applied to Claim 1 is/are also applied to Claim 20.
Claims 4 and 9 are rejected under 35 U.S.C. 103 as being unpatentable over Couleaud et al. (US 2025/0078346A1) in view of Graham et al. (US 2024/0346731 A1) as applied to Claims 3, 7 above, and further in view of Maiman et al. (US 2022/0335447 A1).
Regarding Claim 4, Couleaud modified by Graham fails to explicitly recite wherein identifying the one or more preferences comprises inputting the historical interaction data into a third machine learning model to generate an identity graph of the first user, the identity graph including the one or more preferences of the first user.
However Maiman discloses using machine learning model to develop preference profiles based on historical user interaction data (Abstract: A system can include a machine learning model trained to develop preference profiles based on historical user interaction data and instructions that, when executed by the one or more processors, cause the computing system to perform operations). Therefore it would have been obvious to a PHOSITA before the effective filing date to incorporate the teaching of Maiman into that of Couleaud modified by Graham and to include the limitation of inputting the historical interaction data into a third machine learning model to generate an identity graph of the first user, the identity graph including the one or more preferences of the first user in order to create a personalized avatar according to user’s preference.
Regarding Claim 9, Couleaud discloses image 604 may be accessed automatically, e.g., without explicit user input inputting such image, such as, for example, as a recommendation to a user based on user preferences or historical user interactions ([0081]). Therefore it would have been obvious to a PHOSITA before the effective filing date to incorporate the teaching of Couleaud and to include the limitation of wherein the selecting of the second prompt is based on one or more preferences of the user in order to provide a prompt which might be a better option for the user.
Couleaud further teaches or suggests the historical interaction data of the first user with the system indicative of the first user's interaction with features provided to the first user by the system ([0081] or [0084]).
But Couleaud modified by Graham fails to explicilty recite the one or more preferences of the user being identified by inputting historical interaction data into a third machine learning model to generate an identity graph of the first user, the identity graph including the one or more preferences of the first user.
However Maiman discloses using machine learning model to develop preference profiles based on historical user interaction data (Abstract: A system can include a machine learning model trained to develop preference profiles based on historical user interaction data and instructions that, when executed by the one or more processors, cause the computing system to perform operations). Therefore it would have been obvious to a PHOSITA before the effective filing date to incorporate the teaching of Maiman into that of Couleaud modified by Graham and to include the limitation of the one or more preferences of the user being identified by inputting historical interaction data into a third machine learning model to generate an identity graph of the first user, the identity graph including the one or more preferences of the first user in order to create a personalized avatar according to user’s preference.
Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Couleaud et al. (US 2025/0078346A1) in view of Graham et al. (US 2024/0346731 A1) as applied to Claims 3, 7 above, and further in view of Tan et al. (“Rethinking the Up-Sampling Operations in CNN-based Generative Network for Generalizable Deepfake Detection”, Dec. 2023).
Regarding Claim 11, Couleaud modified by Graham fails to disclose wherein the second machine learning model includes a stable diffusion model that introduces noise iteratively to update pixel values in a generated image based on neighboring pixels.
However Tan disclose, in the same field of endeavor, a PHOSITA had already known to include includes a stable diffusion model that introduces noise iteratively to update pixel values in a generated image based on neighboring pixels (Abstract: we introduce the concept of Neighboring Pixel Relationships(NPR) as a means to capture and characterize the generalized structural artifacts stemming from up-sampling operations. Also see Table 3 column Stables Diffusion v1 and v2 using method NPR:
PNG
media_image6.png
351
1126
media_image6.png
Greyscale
Note Tan teaches:
PNG
media_image7.png
255
557
media_image7.png
Greyscale
,
see p.3 second paragraph).
Therefore it would have been obvious to a PHOSITA before the effective filing date to incorporate the teaching of Tan into that of Couleaud modified by Graham and to include the limitation of wherein the second machine learning model includes a stable diffusion model that introduces noise iteratively to update pixel values in a generated image based on neighboring pixels in order to use this simple but effective artifact representation as suggested by Tan (p.9 Section 5. Conclusion lines 10-11).
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to YINGCHUN HE whose telephone number is (571)270-7218. The examiner can normally be reached M-F 8:00-5:00 MT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Xiao M Wu can be reached at 571-272-7761. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/YINGCHUN HE/Primary Examiner, Art Unit 2613