DETAILED ACTION
This office action is responsive to applicant’s amendments and arguments filed 02/25/2026.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant's arguments filed 02/25/2026 regarding the rejection of the amended claim 1 have been fully considered but they are not persuasive.
Applicant argues that the cited references do not teach the amended limitation: “instructing a user, via an interaction application, to provide a plurality of input images for generating a personalized output image”. In particular, applicant argues against the combination of the primary reference of Gokul (US 20250173913 A1) with the secondary reference of Cao (US 20250124650 A1), which was cited to teach a similar limitation, because the purpose of Cao is removing a head-mounted display from video images of a user, rather than generating a “personalized output image”.
In response to applicant's argument, a recitation of the intended use of the claimed invention must result in a structural difference between the claimed invention and the prior art in order to patentably distinguish the claimed invention from the prior art. If the prior art structure is capable of performing the intended use, then it meets the claim.
Paragraph [0020] of the specification states: “A user might be interested in obtaining a personalized image, such as an image showing an AI-generated person with facial features resembling those of the user.” Cao teaches using a neural network to generate a modified image in which a section of a user’s face covered by a HMD is replaced with their actual face – in order to erase the HMD, it must create an AI-generated section of the user’s face showing “facial features resembling those of the user”, matching the definition suggested by the specification. Therefore, the broadest reasonable interpretation of a “personalized output image” includes the HMD replacement image of Cao. Though Cao teaches the use of a personalized output image for a more specific purpose, the generated image may still be considered a personalized output image.
Even if the output of Cao is not interpreted as a “personalized output image”, it has been held that a prior art reference must either be in the field of the inventor’s endeavor or, if not, then be reasonably pertinent to the particular problem with which the inventor was concerned, in order to be relied upon as a basis for rejection of the claimed invention. See In re Oetiker, 977 F.2d 1443, 24 USPQ2d 1443 (Fed. Cir. 1992). In this case, Cao is still in the same field of neural network-based facial image generation, and an inventor in that field may have reasonably looked to the invention of Cao for guidance regarding their invention.
Therefore, the rationale behind the rejection of the amended claim 1 has been maintained. However, in light of the amended claim 1 incorporating elements from the dependent claims, the rejection under 35 U.S.C. 102 over Gokul has been withdrawn, and a new ground of rejection is made under 35 U.S.C. 103 over Gokul in view of Cao.
Claim Interpretation
The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art. The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification.
The following terms in the claims have been given the following interpretations in light of the specification:
“Personalized output image”: [0020] “A user might be interested in obtaining a personalized image, such as an image showing an AI-generated person with facial features resembling those of the user.”
Thus, any artificially generated image containing facial features tailored to resemble a specific person may be considered a “personalized output image”.
“Combined identity representation”: [0028] “A combined identity representation can include a vector or set of vectors (e.g., one or more feature vectors, latent-space representations, or embeddings) that integrate multiple identity representations generated from respective input images to form a unified profile or feature set that captures features to characterize an identity of an entity.”
Thus, a “combined identity representation” may include any of the aforementioned data representations.
“Structural conditions” (claim 10): [0031] “As mentioned above, a text prompt is an example of an image generation control. More specifically, a text prompt representation, obtained by processing the text prompt via a text encoder, can be used as the additional image generation control. Alternatively, or additionally, one or more structural conditions can be used as image generation controls. Examples of structural conditions include structural maps, edge maps, depth maps, or pose maps that guide image generation from a structural or spatial perspective. A structural condition might, for example, be provided as an additional input to specify where to position one or more objects relative to each other in the personalized output image.”
Thus, “structural conditions” may include “structural maps, edge maps, depth maps, or pose maps” to control image generation, but do not include the image or text inputs described elsewhere in the claims.
Should applicant wish different definitions, applicant should point to the portions of the specification that clearly show a different definition.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1-2, 4-5, 9, 19-20, and 22-24 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gokul et al. (US 20250173913 A1, hereinafter "Gokul") in view of Cao (US 20250124650 A1).
Regarding claim 1, Gokul teaches: A system (fig. 3A) comprising:
at least one processor (fig. 3A processor 310; [0047] “As shown in FIG. 3A, computing device 300 includes a processor 310 coupled to memory 320. Operation of computing device 300 is controlled by processor 310.”);
at least one memory component storing instructions that, when executed by the at least one processor, cause the at least one processor to perform operations (fig. 3A memory 320; [0050] “In some examples, memory 320 may include non-transitory, tangible, machine readable media that includes executable code that when run by one or more processors (e.g., processor 310) may cause the one or more processors to perform the methods described in further detail herein.”) comprising:
enabling the user to select one or more images as part or all of the plurality of input images ([0051] “Or the computing device 300 may receive the input 340, such as prompts and reference images, from a user via the user interface.”), each of the plurality of input images depicting at least part of a subject ([0028] “In some embodiments, multiple reference images 128 may be used in the generation of a single output image 124. For example, each of the reference images may contain the same subject, but in different settings, poses, etc.”; another example is fig. 6 element 612);
accessing a plurality of input images provided by a user of an interaction application ([0051] “Or the computing device 300 may receive the input 340, such as prompts and reference images, from a user via the user interface.”), each of the plurality of input images depicting at least part of a subject ([0028] “In some embodiments, multiple reference images 128 may be used in the generation of a single output image 124. For example, each of the reference images may contain the same subject, but in different settings, poses, etc.”; another example is fig. 6 element 612);
encoding each input image of the plurality of input images to obtain, from the input image, an identity representation ([0097] “At step 702, the system encodes, via a first neural network based image model (e.g., reference UNet 155), the image containing the subject into a first latent feature vector.”; [0099] “In some embodiments, more than one image may be used as the input (e.g., multiple images of the same subject in different positions). A second image containing the subject may be received, via the data interface. The first neural network based image model may encode the second image into a third latent feature vector.”);
combining the identity representations to obtain a combined identity representation associated with the subject ([0028] “The latent representations generated by reference UNet 155 for each of the reference images 128 may be combined (e.g., via a weighted average) before being used to modify the latent representations of base UNet 150.”;
[0099] “The system may generate a combined vector combining the values of the first latent feature vector and the third latent feature vector, for example by generating a weighted average of the first latent feature vector and the third latent feature vector.”);
generating the personalized output image via a generative machine learning model that processes the combined identity representation and at least one additional image generation control (Two-step process involving generating an attention score vector as an intermediate step: [0099] “Generating the attention score vector may include generating, via the self-attention layer of the second neural network based image model, the attention score vector based on the combined vector.”; [0100] “At step 705, the system generates, by the second neural network based image model, an output image (e.g., output image 124) based on the attention score vector and the text prompt.”); and
causing presentation, at a user device, of the personalized output image in a user interface of the interaction application ([0071] “Other applications 416 may include device interfaces and other display modules that may receive input and/or output information. For example, other applications 416 may contain software programs for asset management, executable by a processor, including a graphical user interface (GUI) configured to provide an interface to the user 440 to view generated images.”).
Gokul does not explicitly teach: instructing a user, via an interaction application, to provide a plurality of input images for generating a personalized output image;
causing launching of a real-time camera feed of an interaction application at a user device; or
enabling the user to capture and select one or more images as part or all of the plurality of input images via the real-time camera feed of the interaction application.
Cao teaches: instructing a user, via an interaction application, to provide a plurality of input images for generating a personalized output image (fig. 5B-5C; [0036] “In FIG. 5B, in response to instructions displayed on a user interface, a series of images (either individual still images or individual frames of a video image data) are captured where the user was instructed to make head movements by moving their head from the right to the left while maintaining a neutral eye position. In FIG. 5C, in response to instructions displayed on a user interface, a series of images (either individual still images or individual frames of a video image data) are captured where the user was instructed to make a plurality of different facial expressions at predefined time points so that images of the user making those expressions are captures.”; the generated image may be considered a “personalized output image” as discussed in “Claim Interpretation” section);
causing launching of a real-time camera feed of an interaction application at a user device; and enabling the user to capture and select one or more images as part or all of the plurality of input images via the real-time camera feed of the interaction application ([0035] “In operation an image capture apparatus such as a video or still camera is controlled to capture a plurality of different images of the user… In operation, the data collection processing is performed using a device having a user interface and an image capture apparatus such as a mobile phone whereby one or more series of instructions can be displayed on the user interface and provide the user with guidance on what movements and expressions should be made a particular time so that sufficient amount of image data of the user are captured.” – if a mobile phone is being used to capture video data, then one of ordinary skill in the art would infer that a video application must be launched since mobile phones are not dedicated video cameras which are consistently recording).
Gokul and Cao are analogous to the claimed invention because they are in the same field of neural network-based facial image generation. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the invention of Gokul with the teachings of Cao to provide the user with a user-friendly mobile phone interface to capture input images. The motivation would have been to improve convenience and overall user experience.
Regarding claim 2, the combination of Gokul in view of Cao teaches: The system of claim 1, wherein the at least one additional image generation control comprises a text prompt representation that is obtained from a text prompt (Gokul [0098] “At step 703, the system encodes, via a second neural network based image model (e.g., base UNet 150), a second latent feature vector based on the text prompt.”), and the operations further comprise: receiving, via the user device, user input comprising the text prompt (Gokul [0096] “At step 701, a system (e.g., computing device 300 or server 430) receives, via a data interface (e.g., data interface 315 or network interface 433), an image (e.g., reference image 128) containing a subject and a text prompt (e.g., prompt 126) describing the subject in a visual environment.”), wherein the personalized output image is generated in response to receiving the text prompt (Gokul [0100] “At step 705, the system generates, by the second neural network based image model, an output image (e.g., output image 124) based on the attention score vector and the text prompt.”).
Regarding claim 4, the combination of Gokul in view of Cao teaches: The system of claim 1, wherein each of the plurality of input images depicts a face of the subject and differs from the other input images in the plurality of input images (Gokul fig. 6 shows an example of image generation using an input of 4 different images, each depicting the subject’s face), and the combined identity representation comprises a representation of facial features of the subject (Gokul [0026] “By providing a reference image 128 to reference UNet 155, the latent representations of reference UNet 155 contain information relevant to the subject of the reference image 128.”; if the input images depict a subject’s face, then the latent representations will contain information relevant to the features of the subject’s face.)
Regarding claim 5, the combination of Gokul in view of Cao teaches: The system of claim 4, wherein the operations further comprise: causing presentation, at the user device, of an instruction to provide, among the plurality of input images, at least one of depictions of the face of the subject from different angles or depictions of different facial expressions of the subject (fig. 5B-5C; [0036] “In FIG. 5B, in response to instructions displayed on a user interface, a series of images (either individual still images or individual frames of a video image data) are captured where the user was instructed to make head movements by moving their head from the right to the left while maintaining a neutral eye position. In FIG. 5C, in response to instructions displayed on a user interface, a series of images (either individual still images or individual frames of a video image data) are captured where the user was instructed to make a plurality of different facial expressions at predefined time points so that images of the user making those expressions are captures.”).
Cao and Gokul are analogous to the claimed invention because they are in the same field of neural network-based image generation. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the invention of Gokul with the teachings of Cao to provide the user with a dedicated interface to capture input images. The motivation would have been to improve convenience and overall user experience.
Regarding claim 9, the combination of Gokul in view of Cao teaches: The system of claim 1, wherein the generative machine learning model comprises a diffusion model (Gokul [0020] “The subject-driven image generation framework may comprise one or more denoising diffusion models (DDMs). DDMs are used for generating realistic images given a conditioning input (e.g., prompt and/or image).”).
Regarding claims 19 and 20, they are rejected with the same rationale as claim 1 because their limitations substantially correspond to the limitations of claim 1, along with the additional limitation in claim 20 of: A non-transitory computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform operations (Gokul [0050] “In some examples, memory 320 may include non-transitory, tangible, machine readable media that includes executable code that when run by one or more processors (e.g., processor 310) may cause the one or more processors to perform the methods described in further detail herein.”)
Regarding claim 22, the combination of Gokul in view of Cao teaches: The system of claim 1, wherein the one or more images captured and selected by the user via the real-time camera feed constitute all of the plurality of input images such that the plurality of input images is obtained without retrieving images from a pre-existing image repository stored on the user device (Gokul [0051] “The data interface 315 may comprise a communication interface, a user interface (such as a voice input interface, a graphical user interface, and/or the like). For example, the computing device 300 may receive the input 340 (such as a training dataset) from a networked database via a communication interface. Or the computing device 300 may receive the input 340, such as prompts and reference images, from a user via the user interface.” – the second embodiment in the cited section teaches receiving input from a user without the database).
Regarding claim 23, the combination of Gokul in view of Cao teaches: The system of claim 1, wherein:
the real-time camera feed is provided by a front-facing camera of the user device (Cao [0035] “In operation, the data collection processing is performed using a device having a user interface and an image capture apparatus such as a mobile phone”; modern mobile phone cameras are limited to front and rear, and it would have been obvious to try either of only 2 options with similar functionality), and
the one or more images captured via the real-time camera feed comprise selfie images of the user depicting at least part of a face of the user (Cao [0035] “In 403, a capture process is performed to capture the face of the user where there are plurality of images where the head of the user are moving in different directions. In 404, a capture process is performed to capture the face of the user where there are plurality of images where the user is making different facial expressions.”).
The motivation to combine the invention of Gokul in view of Cao with the additional teachings of Cao would have been similar to the motivation discussed for claim 1.
Regarding claim 24, the combination of Gokul in view of Cao teaches: The system of claim 23, wherein the operations further comprise:
presenting a sequence of instructions for guiding the user to capture the one or more images, the sequence of instructions directing the user to capture the selfie images depicting at least one of different facial expressions or different angles of the face of the user (Cao fig. 5B-5C; [0036] “In FIG. 5B, in response to instructions displayed on a user interface, a series of images (either individual still images or individual frames of a video image data) are captured where the user was instructed to make head movements by moving their head from the right to the left while maintaining a neutral eye position. In FIG. 5C, in response to instructions displayed on a user interface, a series of images (either individual still images or individual frames of a video image data) are captured where the user was instructed to make a plurality of different facial expressions at predefined time points so that images of the user making those expressions are captures.”).
The motivation to combine the invention of Gokul in view of Cao with the additional teachings of Cao would have been similar to the motivation discussed for claim 1.
Claim(s) 7, 8, and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gokul (US 20250173913 A1) in view of Cao (US 20250124650 A1) as applied to claim 1 above, and further in view of Zhang et al. (US 20250175679 A1, hereinafter "Zhang").
Regarding claim 7, Gokul discloses: The system of claim 1, as well as a combined identity representation ([0099] “The system may generate a combined vector combining the values of the first latent feature vector and the third latent feature vector, for example by generating a weighted average of the first latent feature vector and the third latent feature vector.”).
Gokul may not explicitly teach: wherein generating of the personalized output image comprises providing the combined identity representation and the at least one additional image generation control to the generative machine learning model via a decoupled cross-attention mechanism that separately processes the combined identity representation and the at least one additional image generation control.
Zhang teaches: wherein generating of the personalized output image comprises providing the combined identity representation and the at least one additional image generation control to the generative machine learning model via a decoupled cross-attention mechanism that separately processes the combined identity representation and the at least one additional image generation control ([0023] “Specifically, the VDM may generate a video output through a U-Net denoising diffusion model, which iteratively removes noises from an initial noise vector conditioned on the image input and the text input. The U-Net denoising diffusion model may be built on a plurality of multimodal video blocks (MVB). Each MVB may comprise spatial temporal layers for representing video features, and a decoupled cross-attention layer for image attention and text attention separately to address image and text inputs for appearance conditioning.”).
Zhang and Gokul are analogous to the claimed invention because they are in the same field of neural network-based image generation. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the invention of Gokul with the teachings of Zhang to use a decoupled cross-attention mechanism to separately and simultaneously condition the image generation on both image and text inputs. The motivation would have been to improve modularity and parallelism in processing.
Regarding claim 8, the combination of Gokul in view of Zhang teaches: The system of claim 7, wherein the generative machine learning model comprises separate cross-attention layers for the combined identity representation and the at least one additional image generation control, respectively (fig. 3C image cross-attention layer 323 and text cross-attention layer 324; further explained in paragraph [0044]).
Zhang and Gokul are analogous to the claimed invention because they are in the same field of neural network-based image generation. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the invention of Gokul with the teachings of Zhang to use a decoupled cross-attention mechanism with separate layers for image and text inputs. The motivation would have been to improve modularity and parallelism in processing.
Regarding claim 18, Gokul discloses: The system of claim 1, but does not explicitly teach: wherein the personalized output image is one of a plurality of frames of a personalized video, and the personalized video is generated for the user, via the interaction application, based on the combined identity representation and the at least one additional image generation control.
Zhang teaches: wherein the personalized output image is one of a plurality of frames of a personalized video, and the personalized video is generated for the user, via the interaction application, based on the combined identity representation and the at least one additional image generation control ([0028] “FIG. 2 is a simplified diagram illustrating an exemplary training framework 200 for an example latent diffusion model that generates a video given a conditioning input such as a text description and an image input, according to embodiments described herein.”; [0071] describes the “interaction application: “User device 610 may be utilized by a user 640 (e.g., a driver, a system admin, etc.) to access the various features available for user device 610, which may include processes and/or applications associated with the server 630 to receive a generated video.”).
Zhang and Gokul are analogous to the claimed invention because they are in the same field of neural network-based image generation. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the invention of Gokul with the teachings of Zhang to generate video output as well as individual images, expanding the invention’s applicability.
Claim(s) 10 and 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gokul (US 20250173913 A1) in view of Cao (US 20250124650 A1) as applied to claim 1 above, and further in view of Mou et al. (“T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models”. arXiv preprint (20 Mar 2023). https://arxiv.org/abs/2302.08453v2; hereinafter “Mou”).
Regarding claim 10, the combination of Gokul in view of Cao teaches: The system of claim 1, wherein the at least one additional image generation control comprises a text prompt representation that is obtained from a text prompt (Gokul [0098] “At step 703, the system encodes, via a second neural network based image model (e.g., base UNet 150), a second latent feature vector based on the text prompt.”).
The combination of Gokul in view of Cao may not explicitly teach: wherein the at least one additional image generation control comprises one or more structural conditions to guide generation of the personalized output image.
Mou teaches: wherein the at least one additional image generation control comprises one or more structural conditions to guide generation of the personalized output image (fig. 8, caption “With our proposed T2I-Adapter, the SD model can generate high-quality images conditioned on color map, sketch, depth map, semantic segmentation map, depth, and keypose.”).
Mou is analogous to the claimed invention because it is in the same field of neural network-based image generation. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the invention of Gokul in view of Cao with the teachings of Mou to include the option to add additional user-specified conditions for image generation. The motivation would have been to give a user greater control over the generated output.
Regarding claim 13, the combination of Gokul in view of Cao teaches: The system of claim 1, wherein the operations further comprise:
providing a pre-trained version of the generative machine learning model comprising predetermined parameters for processing the at least one additional image generation control ([0038] “Image generation using the framework in FIG. 1B may be accomplished using only a pre-trained UNet, with the same parameters for both the base UNet 150 and reference UNet 155. Improvements may be achieved in some instances by fine-tuning one or both of the Unets.”; subsequent paragraphs focus on embodiments where reference UNet 155 is fine-tuned while base UNet 150 is frozen, and mention processing text input);
defining parameters to process combined identity representations (Fig. 1A shows an image generation framework which includes the following components:
Combined identity representation: [0028] “The latent representations generated by reference UNet 155 for each of the reference images 128 may be combined (e.g., via a weighted average) before being used to modify the latent representations of base UNet 150.”;
Processing combined identity representation: [0034] “Reference UNet 155 processes reference images 128 (e.g., sample images of a subject to be used in the personalized generated image) to collect the features before each Self-Attention (SA) layer of reference UNet 155 (e.g., SA layers 132, 136, and 142). The base UNet's SA layers are modified to “Reference Self-Attention” (RSA) layers (e.g., RSA layers 106, 110, 116, and 120) that allow conditioning on extra features. Using the collected reference features as input, the base UNet equipped with the RSA layers estimates the noise in the input to guide the image generation towards the reference objects.”;
Defining parameters: [0043] “Finetuning parameters of the reference UNet 155 (ϕ) may be performed to extract better reference features.”); and
performing training to adjust the new parameters while keeping the predetermined parameters frozen ([0038] “In some embodiments, parameters associated with all the layers of reference UNet 155 may be trained. In some embodiments, reference UNet 155 is trained, while base UNet 150 remains frozen. This may allow for reference UNet 155 to adapt to capture the relevant information from reference image 128 while allowing base UNet 150 to maintain its image generation and prompt following capabilities.”).
The combination of Gokul in view of Cao may not explicitly teach defining new parameters to process combined identity representations.
Mou teaches defining new parameters in an additional module which provides conditional control to a pre-trained, frozen machine learning model (fig. 2, “77M parameters”, see fig. 3 for additional detail).
Mou and Gokul are analogous to the claimed invention because they are in the same field of neural network-based image generation. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the invention of Gokul with the teachings of Mou to include the option to add new parameters for conditions in which input image representations are combined. The motivation would have been to enhance user control and modularity.
Claim(s) 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gokul (US 20250173913 A1) in view of Cao (US 20250124650 A1) as applied to claim 1 above, and further in view of Liu et al. (US 20200265219 A1, hereinafter "Liu").
Regarding claim 12, the combination of Gokul in view of Cao discloses: the system of claim 1, wherein combining of the identity representations comprises processing the identity representations via a merging component to merge the identity representations into the combined identity representation (Gokul [0099] “The system may generate a combined vector combining the values of the first latent feature vector and the third latent feature vector, for example by generating a weighted average of the first latent feature vector and the third latent feature vector.”).
The combination of Gokul in view of Cao may not explicitly teach processing the identity representations via a machine learning-based merging component, the merging component being trained to generate, for a given set of identity representations encoded from respective training images of a person, a corresponding combined identity representation for the person.
Liu teaches processing the identity representations via a machine learning-based merging component, the merging component being trained to generate, for a given set of identity representations encoded from respective training images of a person, a corresponding combined identity representation for the person ([0047] “Then, at process block 204, a trained DR-GAN may be applied to generate an identity representation of the subject, or object. This step may include extracting the identity representation, in the form of features or feature vectors, by inputting received one or more images into one or more encoders of the DR-GAN. In some aspects, a pose of the subject or object in the received image(s) may be determined at process block 204. In particular, in some aspects, feature vectors obtained from multiple inputted images may be combined to generate a combined feature or feature vector indicative of the identity representation. To do so, the features or feature vectors may be weighted based on learned coefficients.”).
Liu is analogous to the claimed invention because it is in the same field of neural network-based image generation. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the invention of Gokul in view of Cao with the teachings of Liu to use machine learning to derive weights for combining input image representations. The motivation would have been to automatically adjust the process for each unique combination of input images, improving output image accuracy.
Claim(s) 14 and 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gokul (US 20250173913 A1) in view of Cao (US 20250124650 A1) and further in view of Mou (“T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models”) as applied to claim 13 above, and further in view of Liu (US 20200265219 A1).
Regarding claim 14, the combination of Gokul in view of Cao and further in view of Mou teaches: The system of claim 13, wherein combining of the identity representations comprises processing the identity representations to merge the identity representations into the combined identity representation ([0028] “In some embodiments, multiple reference images 128 may be used in the generation of a single output image 124. For example, each of the reference images may contain the same subject, but in different settings, poses, etc. The latent representations generated by reference UNet 155 for each of the reference images 128 may be combined (e.g., via a weighted average) before being used to modify the latent representations of base UNet 150.”).
The combination of Gokul in view of Cao and Mou does not explicitly teach: and the operations further comprise: defining further new parameters to generate, for a given set of identity representations encoded from respective images of a person, a corresponding combined identity representation for the person, wherein the training is performed to adjust the new parameters and the further new parameters.
Liu teaches: defining further new parameters to generate, for a given set of identity representations encoded from respective images of a person, a corresponding combined identity representation for the person, wherein the training is performed to adjust the new parameters and the further new parameters ([0047] In particular, in some aspects, feature vectors obtained from multiple inputted images may be combined to generate a combined feature or feature vector indicative of the identity representation. To do so, the features or feature vectors may be weighted based on learned coefficients.).
Liu and the combination of Gokul in view of Cao and further in view of Mou are analogous to the claimed invention because they are in the same field of neural network-based image generation. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the invention of Gokul in view of Cao and further in view of Mou with the teachings of Liu to generate parameters for combining input image representations. Furthermore, it would have been obvious to one of ordinary skill in the art that any new parameters should be adjusted via training in order to properly integrate them into the machine learning system. The motivation would have been to automatically adjust for each unique combination of input images, improving output image accuracy.
Regarding claim 15, the combination of Gokul in view of Cao and further in view of Mou and Liu teaches: The system of claim 14, wherein the new parameters form part of new layers of the generative machine learning model (Gokul [0026] “UNets 150 and 155 may have a series of internal layers, where a latent image representation (e.g., vector) is output by each layer, and that output is the input to the subsequent layer, as described further in FIG. 1B.”, where UNet 155 is the reference UNet which contains the “new parameters” which are not frozen), and the further new parameters form part of a machine-learning-based merging component that is trained to merge the identity representations into the combined identity representation (Liu [0047] “In particular, in some aspects, feature vectors obtained from multiple inputted images may be combined to generate a combined feature or feature vector indicative of the identity representation. To do so, the features or feature vectors may be weighted based on learned coefficients.”).
The motivation to combine would have been similar to that of claim 14.
Claim(s) 16 and 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gokul (US 20250173913 A1) in view of Cao (US 20250124650 A1) and further in view of Mou (“T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models”) as applied to claim 13 above, and further in view of Li et al. (US 20240161369 A1, hereinafter "Li").
Regarding claim 16, the combination of Gokul in view of Cao and further in view of Mou teaches: The system of claim 13, but may not explicitly teach: wherein each of the plurality of input images is encoded by an image encoder, and parameters of the image encoder are kept frozen while performing the training with respect to the new parameters.
Li teaches: wherein each of the plurality of input images is encoded by an image encoder, and parameters of the image encoder are kept frozen while performing the training with respect to the new parameters (Li [0037] “In some embodiments, backpropagation 208 during subject-specific fine-tuning may update parameters of multimodal encoder 108, queries 110, text encoder 120, and/or image model 122… During fine-tuning, image encoder 104, text encoder 106, queries 110, feed forward 114, and/or multimodal encoder 108 may be frozen (i.e., their parameters unchanged).”).
Li is analogous to the claimed invention because it is in the same field of neural network-based image generation. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the invention of Gokul in view of Cao and further in view of Mou with the teachings of Li to keep parameters of the image encoder frozen while training additional parameters associated with combinations of images. The motivation would have been to reduce computing cost and prevent overfitting, as taught by Li.
Regarding claim 17, the combination of Gokul in view of Cao and further in view of Mou teaches: The system of claim 13, but may not explicitly teach: wherein the at least one additional image generation control comprises a text prompt representation that is obtained from a text prompt via a text encoder, and parameters of the text encoder are kept frozen while performing the training with respect to the new parameters.
Li teaches: wherein the at least one additional image generation control comprises a text prompt representation that is obtained from a text prompt via a text encoder, and parameters of the text encoder are kept frozen while performing the training with respect to the new parameters (Li [0037] “In some embodiments, backpropagation 208 during subject-specific fine-tuning may update parameters of multimodal encoder 108, queries 110, text encoder 120, and/or image model 122. In some embodiments, text encoder 120 may be trained during the subject representation learning stage, and frozen during the fine-tuning stage to prevent over-fitting to a specific subject.”).
Li is analogous to the claimed invention because it is in the same field of neural network-based image generation. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the invention of Gokul in view of Cao and further in view of Mou with the teachings of Li to keep parameters of the text encoder frozen while training additional parameters associated with combinations of images. The motivation would have been to reduce computing cost and prevent overfitting, as taught by Li.
Claim(s) 21 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gokul (US 20250173913 A1) in view of Cao (US 20250124650 A1) as applied to claim 1 above, and further in view of Wolfensparger et al. (US 20250124653 A1, hereinafter "Wolfensparger").
Regarding claim 21, the combination of Gokul in view of Cao teaches: The system of claim 1, but does not explicitly teach: wherein the operations further comprise:
associating the combined identity representation with a user profile of the user; and
in response to a subsequent image generation request from the user via the interaction application, automatically retrieving the combined identity representation associated with the user profile to generate a further personalized output image without requiring new input images from the user.
Wolfensparger teaches: associating the combined identity representation with a user profile of the user ([0094] “FIG. 3 is a data flow diagram for training a user-specific model ensemble for photorealistic facial representation according to some implementations of the present disclosure. FIG. 3 will be discussed in conjunction with FIG. 1. More specifically, a user-specific model ensemble 302 can be stored by the machine-learned model handler 38 of the computing system 10. For example, the model repository 40 of the machine-learned model handler 38 can store and manage the user-specific model ensemble 302. The user-specific model ensemble 302 can be a grouping of models that are personalized for a particular user.”;
[0107] “In some implementations, the computing system can update a user-specific model output repository for photorealistic facial representation based on the at least one optimized model output. The user-specific model output repository can store an optimized instance of each of the plurality of model outputs.”); and
in response to a subsequent image generation request from the user via the interaction application, automatically retrieving the combined identity representation associated with the user profile to generate a further personalized output image without requiring new input images from the user ([0024]-[0025] describes how user-specific model ensembles are used to generate facial images of corresponding users during a video conference session;
[0055]-[0057] discusses how the saved models are only updated when a change in the corresponding user’s appearance is detected;
[0067] provides one specific example of saved model data being stored and reused: “Alternatively, the facial representation module 74 can recognize the facial motion previously performed by the user and can indicate that particular facial motion to the facial representation module 16. In turn, the facial representation module 16 can retrieve an animation previously generated for the particular facial motion.”).
Wolfensparger is analogous to the claimed invention because it is in the same field of using a neural network to generate a representation of a user’s face. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the invention of Gokul in view of Cao with the teachings of Wolfensparger to save and reuse a trained machine learning model for each user. The motivation would have been to reduce the computational load, as taught by Wolfensparger ([0067]).
References Cited
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Valentin et al. (US 20220343133 A1) teaches the limitation of claim 1: “instructing a user, via an interaction application, to provide a plurality of input images for generating a personalized output image” ([0062]).
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to BENJAMIN STATZ whose telephone number is (571)272-6654. The examiner can normally be reached Mon-Fri 8am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Tammy Goddard can be reached at (571)272-7773. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/BENJAMIN TOM STATZ/ Examiner, Art Unit 2611
/TAMMY PAIGE GODDARD/ Supervisory Patent Examiner, Art Unit 2611