DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Specification
The disclosure is objected to because of the following informalities:
In paragraph 0028 line 4-5, “In some examples, the network 140”, is not a complete sentence.
In paragraph 0058 line 2-4, “In other words, whether the context impacts a local area edit of the reference image, a restyle of the reference image, or a background of the reference image” is not a complete sentence.
Appropriate correction is required.
The use of the terms “DOCSIS”, “Wi-Fi”, and “WiMAX” which are trade names or marks used in commerce, has been noted in this application. The term should be accompanied by the generic terminology; furthermore the term should be capitalized wherever it appears or, where appropriate, include a proper symbol indicating use in commerce such as ™, SM , or ® following the term.
Although the use of trade names and marks used in commerce (i.e., trademarks, service marks, certification marks, and collective marks) are permissible in patent applications, the proprietary nature of the marks should be respected and every effort made to prevent their use in any manner which might adversely affect their validity as commercial marks.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 16-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter. The claim(s) does/do not fall within at least one of the four categories of patent eligible subject matter because claim 16 is directed to a signal per se.
Claim 16 recites a computer-readable medium. The broadest reasonable interpretation of a claim drawn to a computer readable medium (also called machine readable medium and other such variations) typically covers forms of non-transitory tangible media and transitory propagating signals per se in view of the ordinary and customary meaning of computer readable media, particularly when the specification is silent. See MPEP 2111.01. When the broadest reasonable interpretation of a claim covers a signal per se, the claim must be rejected under 35 U.S.C. 101 as covering non-statutory subject matter. The USPTO recognizes that applicants may have claims directed to computer readable media that cover signals per se, which the USPTO must reject under 35 U.S.C. 101 as covering both non-statutory subject matter and statutory subject matter. A claim drawn to such a computer readable medium that covers both transitory and non-transitory embodiments may be amended to narrow the claim to cover only statutory embodiments to avoid a rejection under 35 U.S.C. $ 101 by adding the limitation "non-transitory" to the claim. Such an amendment would typically not raise the issue of new matter, even when the specification is silent because the broadest reasonable interpretation relies on the ordinary and customary meaning that includes signals per se.
Applicant’s specification in paragraph 0023 recites “As defined herein, a ‘computer-readable storage medium,’ which refers to a non-transitory, physical or tangible storage medium (e.g., volatile or non-volatile memory device), may be differentiated from a ‘computer-readable transmission medium,’ which refers to an electromagnetic signal.” Since the Applicant does not specify whether the “computer-readable medium” in claim 16 is a computer-readable storage medium or a computer-readable transmission medium, it could be a signal.
As an additional note, a non-transitory computer readable medium having executable programming instructions stored thereon is considered statutory as non-transitory computer readable media excludes transitory data signals.
Claims 17-20 are similarly rejected, as they are dependent on claim 16.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-2 are rejected under 35 U.S.C. 103 as being unpatentable over Wang et al. (Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting), hereinafter Wang.
Regarding claim 1, Wang teaches a method (Paragraph 1 in 2nd Col. of Page 3 – “Imagen Editor is a text-guided image inpainting model targeting improved representation and reflection of linguistic inputs, fine-grained control and high fidelity outputs”) comprising:
receiving, via a user interface of a service, a reference image and an input comprising text associated with the reference image (Paragraph 1 in 2nd Col. of Page 3 – “Imagen Editor takes three inputs from the user, 1) the image to be edited, 2) a binary mask to specify the edit region, and 3) a text prompt – and all three inputs are used to guide the output samples”; Note: Imagen Editor provides the service, and a user interface is implied because the user would not be able to provide input without an interface), wherein the reference image comprises any one or more of an object or a person (Fig. 18 – The figure shows that the input/reference image has an object or person (or both) in it; see screenshot of Fig. 18 below);
PNG
media_image1.png
371
1004
media_image1.png
Greyscale
Screenshot of Fig. 18 (taken from Wang)
determining, via a trained machine learning (ML) model, one or more features of the reference image (Paragraph 2 in 2nd Col. of Page 1, Paragraph 3 in 1st Col. of Page 2, Paragraph 3 in 2nd Col. of Page 3 – “Imagen Editor adds image and mask context to each diffusion stage via three convolutional downsampling image encoders… In terms of text-image alignment, Imagen Editor trained with object-masking is preferred…In Imagen Editor, we modify Imagen to condition on both the image and the mask by concatenating them with the diffusion latents along the channel dimension”; Note: the diffusion latents are the features, and they are derived using convolutional image encoders. Convolutional image encoders are a type of machine learning model. Imagen Editor, which is made-up of convolutional image encoders (see Fig. 2 below), is trained);
modifying, via one or more trained latent diffusion models (LDMs), the reference image based upon the determined features and the received input, wherein any one or more of a background of the reference image, an area of the reference image or a style of the reference image are modified (Fig. 1 Caption Page 1, Fig. 2 Caption on Page 2, Paragraph 3 in 1st Col. of Page 2, Paragraph 3 in 2nd Col. of Page 3 – “Given an image, a user defined mask, and a text prompt, Imagen Editor makes localized edits to the designated areas. The model meaningfully incorporates the user’s intent and performs photorealistic edits… All of the diffusion models, i.e., the base model and super-resolution (SR) models, condition on high-resolution 1024×1024 image and mask inputs… In terms of text-image alignment, Imagen Editor trained with object-masking is preferred…In Imagen Editor, we modify Imagen to condition on both the image and the mask by concatenating them with the diffusion latents along the channel dimension”; Note: the mask area of the reference/input image is modified via Imagen Editor, which is trained and made-up of latent diffusion models (see Fig. 2 below). The modifications are based on the inputs and the diffusion latents. The diffusion latents are the features);
and causing to display, via the user interface of the service, the modified image (Fig. 1 – The figure below shows the output modified image. It would have been obvious to display it to the user because the user requested the image by providing the input and thus would want to see the output).
PNG
media_image2.png
241
976
media_image2.png
Greyscale
Screenshot of Fig. 1 (taken from Wang)
PNG
media_image3.png
275
496
media_image3.png
Greyscale
Screenshot of Fig. 2 (taken from Wang)
Regarding claim 2, Wang teaches the method of claim 1. Wang further teaches wherein the input comprises a mask of an area in the image (Paragraph 1 in 2nd Col. of Page 3 – “Imagen Editor takes three inputs from the user, 1) the image to be edited, 2) a binary mask to specify the edit region, and 3) a text prompt – and all three inputs are used to guide the output samples”).
Claims 3, 6-10, 13-17, and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Wang in view of Karpman et al. (US 11995803 B1), hereinafter Wang and Karpman respectively.
Regarding claim 3, Wang teaches the method of claim 1. Wang does not teach transmitting prior to receiving the input, via the user interface, any one or more of a suggestion of a mask of the area in the image or a suggestion affecting the background or the style of the reference image. However, Karpman teaches transmitting prior to receiving the input, via the user interface, any one or more of a suggestion of a mask of the area in the image or a suggestion affecting the background or the style of the reference image (Fig. 4A, Col. 20 lines 56-63 – “The generation interface 400 also includes a style menu 404 that enables the user to browse and select among pre-set image styles for the image generation request. In the example of FIG. 4A, the style menu 404 displays a set (e.g., array) of style option tiles, each style option tile including a text description of the image style (e.g., anime, Van Gogh, oil painting, line drawing, digital art, etc.) and a sample image in the corresponding style”; Note: the user interface has suggestions for image style. The reference image was previously taught by Wang in the rejection of claim 1. Fig. 4A below shows that the style suggestions are populated prior to receiving input).
PNG
media_image4.png
506
441
media_image4.png
Greyscale
Screenshot of Fig. 4A (taken from Karpman)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wang to incorporate the teachings of Karpman to provide suggestions of a style because “the software application layer 124 and inspiration affordance can assist the user in deciding on a type of image they wish to generate by providing visual examples of (possible) image styles and enabling users to select among pre-set styles without requiring the user to develop and/or phrase image generation prompts that include stylistic constraints, thereby reducing the cognitive burden on the user in operating the software application layer and increasing the likelihood that the image subsequently generated by the text-to-image diffusion model 112 will align with user expectations” (Karpman: Col. 21 lines 17-27). In other words, the suggestions can make it easier for the user to make decisions regarding the image edits and generation.
Regarding claim 6, Wang teaches the method of claim 1. Wang does not teach assessing, via a privacy LDM, whether any one or more of the reference image, received input or the modified image meets predetermined criteria prior to the reference image being displayed on the user interface of the service, wherein the predetermined criteria comprises any one or more of violence, profanity or nudity. However, Karpman teaches assessing, via a privacy LDM, whether any one or more of the reference image, received input or the modified image meets predetermined criteria prior to the reference image being displayed on the user interface of the service, wherein the predetermined criteria comprises any one or more of violence, profanity or nudity (Col. 7 lines 40-46, Col. 16 lines 60-67, Col. 17 lines 1-5 – “the set of content moderation models includes a multi-headed deep learning text classifier (hereinafter the “text classifier”) that is trained and/or configured to identify sexually explicit, hateful, and/or violent text prompts based on the semantic meaning of natural language prompts provided to the text-to-image diffusion model…the system can: execute the text classifier on the text prompt to generate a severity score in each supported text moderation class (e.g., by executing each model head in the text classifier on the text prompt); compare each severity score to a predetermined severity threshold; and, in response to one or more severity scores meeting (e.g., exceeding) the severity threshold, declining the image generation request and returning an error message. The system 100 can therefore leverage the text classifier to screen, filter, and/or reject requests to generate sexually explicit, hateful, obscene, or misleading images”; Note: the text-to-image diffusion model is equivalent to the privacy LDM since it is coupled with classifiers to identify if the input meets predetermined criteria). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wang to incorporate the teachings of Karpman to assess the input on whether or not it meets a predetermined criteria of violence, profanity, or nudity “in order to enforce content guidelines and prevent abuse of the image generation platform” (Karpman: Col. 17 lines 3-5). In other words, it would help ensure proper usage of the software.
Regarding claim 7, Wang in view of Karpman teaches the method of claim 6. Wang does not teach transmitting, via the user interface based upon the assessment, a request to update any one or more of the reference image or the received input; and receiving, via the user interface, any one or more of an updated reference image or an updated input. However, Karpman teaches transmitting, via the user interface based upon the assessment, a request to update any one or more of the reference image or the received input (Col. 23 lines 6-23 – “In implementations where image generation tasks are subject to content moderation, the software application layer 124 is configured to transition display of the waiting screen back to the generation interface (e.g., the generation interface shown in FIG. 4A) in response to receiving the error message from the communication interface 122 (e.g., based on outputs of the text classifier, based on outputs of the visual classifier). In these implementations, the software application layer 124 can display the error message (e.g., as a pop-up dialog box over the generation interface or adjacent to the interactive text field) informing the user that system declined to generate the requested image (e.g., “Could not generate that image—please try again with a different prompt”). The error message displayed to the user may also include a reason for the moderation decision based on a classification of the prompt by the text classifier or based on a classification of the base image by the visual classifier”); and receiving, via the user interface, any one or more of an updated reference image or an updated input (Col. 23 lines 31-35 and 39-47 – “Results interface 414 displays (a preview of) the image(s) 1 to 6 generated by the text-to-image diffusion model 112 at 416 in response to the image generation request and a description field 418 that displays the text prompt submitted to the model…The results interface 414 may also include a regenerate affordance (e.g., a button, a link) that enables the user to submit another image generation request…More specifically, in response to receiving a user input (e.g., a touch input to a display, a click) over the description field, the software application can allow the user to edit text displayed in the description field (e.g., by displaying a touch keyboard below the description field, through keyboard inputs)”; Note: the user can re-enter a new input, and in response, generate an image). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wang to incorporate the teachings of Karpman to re-prompt the user and receive an updated user input for the benefit of giving the user a chance to change their input and use the software in the appropriate way. If the software did not allow for updating input, then it may be too restrictive and prevent users from using the service.
Regarding claim 8, Wang teaches a system (Paragraph 1 in 2nd Col. of Page 3 – “Imagen Editor is a text-guided image inpainting model targeting improved representation and reflection of linguistic inputs, fine-grained control and high fidelity outputs”; Note: Imagen Editor is the system) comprising the instructions of:
receiving, via a user interface of a service, a reference image and an input comprising text associated with the reference image (Paragraph 1 in 2nd Col. of Page 3 – “Imagen Editor takes three inputs from the user, 1) the image to be edited, 2) a binary mask to specify the edit region, and 3) a text prompt – and all three inputs are used to guide the output samples”; Note: Imagen Editor provides the service, and a user interface is implied because the user would not be able to provide input without an interface);
determining, via a trained machine learning (ML) model, one or more features of the reference image (Paragraph 2 in 2nd Col. of Page 1, Paragraph 3 in 1st Col. of Page 2, Paragraph 3 in 2nd Col. of Page 3 – “Imagen Editor adds image and mask context to each diffusion stage via three convolutional downsampling image encoders… In terms of text-image alignment, Imagen Editor trained with object-masking is preferred…In Imagen Editor, we modify Imagen to condition on both the image and the mask by concatenating them with the diffusion latents along the channel dimension”; Note: the diffusion latents are the features, and they are derived using convolutional image encoders. Convolutional image encoders are a type of machine learning model. Imagen Editor, which is made-up of convolutional image encoders (see Fig. 2 above), is trained);
modifying, via one or more trained latent diffusion models (LDMs), the reference image based upon the determined features and the received input, wherein any one or more of a background of the reference image, an area of the reference image or a style of the reference image are modified (Fig. 1 Caption Page 1, Fig. 2 Caption on Page 2, Paragraph 3 in 1st Col. of Page 2, Paragraph 3 in 2nd Col. of Page 3 – “Given an image, a user defined mask, and a text prompt, Imagen Editor makes localized edits to the designated areas. The model meaningfully incorporates the user’s intent and performs photorealistic edits… All of the diffusion models, i.e., the base model and super-resolution (SR) models, condition on high-resolution 1024×1024 image and mask inputs… In terms of text-image alignment, Imagen Editor trained with object-masking is preferred…In Imagen Editor, we modify Imagen to condition on both the image and the mask by concatenating them with the diffusion latents along the channel dimension”; Note: the mask area of the reference/input image is modified via Imagen Editor, which is trained and made-up of latent diffusion models (see Fig. 2 above). The modifications are based on the inputs and the diffusion latents. The diffusion latents are the features);
and causing to display, via the user interface of the service, the modified image (Fig. 1 – The figure shows the output modified image. It would have been obvious to display it to the user because the user requested the image by providing the input and thus would want to see the output).
Wang does not teach a non-transitory memory with instructions stored thereon; and a processor operably coupled to the non-transitory memory and configured to execute the instructions. However, Karpman teaches a non-transitory memory with instructions stored thereon; and a processor operably coupled to the non-transitory memory and configured to execute the instructions (Col. 24 lines 26-33 – “Storage device 505 may include a non-transitory computer-readable storage medium holding information, instructions, or some combination thereof, for example instructions that when executed by the processor 501, cause processor 501 to be configured or operable to perform one or more operations of a method as described herein”; Note: it is implied that the processor and non-transitory memory are coupled since the processor would not be able to access or execute the instructions otherwise). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wang to incorporate the teachings of Karpman to have a non-transitory memory with instructions coupled to a processor to execute the instructions because the diffusion models in Wang would not have been able to run properly without a processor and would not exist without code to define the models. Additionally, a non-transitory memory would provide a persistent and reliable storage for future use of the instructions and data.
Regarding claim 9, Wang in view of Karpman teaches the system of claim 8. Wang further teaches wherein the input comprises a mask of an area in the image (Paragraph 1 in 2nd Col. of Page 3 – “Imagen Editor takes three inputs from the user, 1) the image to be edited, 2) a binary mask to specify the edit region, and 3) a text prompt – and all three inputs are used to guide the output samples”).
Regarding claim 10, Wang in view of Karpman teaches the system of claim 8. Wang does not teach wherein the processor when further configured to execute the instructions of: transmitting prior to receiving the input, via the user interface, any one or more of a suggestion of a mask of the area in the image or a suggestion affecting the background or style of the reference image. However, Karpman teaches transmitting prior to receiving the input, via the user interface, any one or more of a suggestion of a mask of the area in the image or a suggestion affecting the background or the style of the reference image (Fig. 4A, Col. 20 lines 56-63 – “The generation interface 400 also includes a style menu 404 that enables the user to browse and select among pre-set image styles for the image generation request. In the example of FIG. 4A, the style menu 404 displays a set (e.g., array) of style option tiles, each style option tile including a text description of the image style (e.g., anime, Van Gogh, oil painting, line drawing, digital art, etc.) and a sample image in the corresponding style”; Note: the user interface has suggestions for image style. The reference image was previously taught by Wang in the rejection of claim 8. Fig. 4A above shows that the style suggestions are populated prior to receiving input). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wang to incorporate the teachings of Karpman to provide suggestions of a style because “the software application layer 124 and inspiration affordance can assist the user in deciding on a type of image they wish to generate by providing visual examples of (possible) image styles and enabling users to select among pre-set styles without requiring the user to develop and/or phrase image generation prompts that include stylistic constraints, thereby reducing the cognitive burden on the user in operating the software application layer and increasing the likelihood that the image subsequently generated by the text-to-image diffusion model 112 will align with user expectations” (Karpman: Col. 21 lines 17-27). In other words, the suggestions can make it easier for the user to make decisions regarding the image edits and generation.
Regarding claim 13, Wang in view of Karpman teaches the system of claim 8. Wang does not teach wherein the processor when further configured to execute the instructions of: assessing, via a privacy LDM, whether any one or more of the reference image, the received input or the modified image meets predetermined criteria prior to the reference image being displayed on the user interface of the service. However, Karpman teaches assessing, via a privacy LDM, whether any one or more of the reference image, received input or the modified image meets predetermined criteria prior to the reference image being displayed on the user interface of the service (Col. 7 lines 40-46, Col. 16 lines 60-67, Col. 17 lines 1-5 – “the set of content moderation models includes a multi-headed deep learning text classifier (hereinafter the “text classifier”) that is trained and/or configured to identify sexually explicit, hateful, and/or violent text prompts based on the semantic meaning of natural language prompts provided to the text-to-image diffusion model…the system can: execute the text classifier on the text prompt to generate a severity score in each supported text moderation class (e.g., by executing each model head in the text classifier on the text prompt); compare each severity score to a predetermined severity threshold; and, in response to one or more severity scores meeting (e.g., exceeding) the severity threshold, declining the image generation request and returning an error message. The system 100 can therefore leverage the text classifier to screen, filter, and/or reject requests to generate sexually explicit, hateful, obscene, or misleading images”; Note: the text-to-image diffusion model is equivalent to the privacy LDM since it is coupled with classifiers to identify if the input meets predetermined criteria. The predetermined criteria include sexually explicit, violent, and hateful concepts). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wang to incorporate the teachings of Karpman to assess the input on whether or not it meets a predetermined criteria “in order to enforce content guidelines and prevent abuse of the image generation platform” (Karpman: Col. 17 lines 3-5). In other words, it would help ensure proper usage of the software.
Regarding claim 14, Wang in view of Karpman teaches the system of claim 13. Wang does not teach wherein the predetermined criteria comprises any one or more of violence, profanity or nudity. However, Karpman teaches wherein the predetermined criteria comprises any one or more of violence, profanity or nudity (Col. 7 lines 40-46, Col. 16 lines 60-67, Col. 17 lines 1-5 – “the set of content moderation models includes a multi-headed deep learning text classifier (hereinafter the “text classifier”) that is trained and/or configured to identify sexually explicit, hateful, and/or violent text prompts based on the semantic meaning of natural language prompts provided to the text-to-image diffusion model…The system 100 can therefore leverage the text classifier to screen, filter, and/or reject requests to generate sexually explicit, hateful, obscene, or misleading images”). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wang to incorporate the teachings of Karpman to assess the input on whether or not it meets a predetermined criteria of violence, profanity, or nudity “in order to enforce content guidelines and prevent abuse of the image generation platform” (Karpman: Col. 17 lines 3-5). In other words, it would help ensure proper usage of the software.
Regarding claim 15, Wang in view of Karpman teaches the system of claim 13. Wang does not teach wherein the processor when further configured to execute the instructions of: transmitting, via the user interface based upon the assessment, a request to update any one or more of the reference image or the received input; and receiving, via the user interface, any one or more of an updated reference image or an updated input. However, Karpman teaches transmitting, via the user interface based upon the assessment, a request to update any one or more of the reference image or the received input (Col. 23 lines 6-23 – “In implementations where image generation tasks are subject to content moderation, the software application layer 124 is configured to transition display of the waiting screen back to the generation interface (e.g., the generation interface shown in FIG. 4A) in response to receiving the error message from the communication interface 122 (e.g., based on outputs of the text classifier, based on outputs of the visual classifier). In these implementations, the software application layer 124 can display the error message (e.g., as a pop-up dialog box over the generation interface or adjacent to the interactive text field) informing the user that system declined to generate the requested image (e.g., “Could not generate that image—please try again with a different prompt”). The error message displayed to the user may also include a reason for the moderation decision based on a classification of the prompt by the text classifier or based on a classification of the base image by the visual classifier”); and receiving, via the user interface, any one or more of an updated reference image or an updated input (Col. 23 lines 31-35 and 39-47 – “Results interface 414 displays (a preview of) the image(s) 1 to 6 generated by the text-to-image diffusion model 112 at 416 in response to the image generation request and a description field 418 that displays the text prompt submitted to the model…The results interface 414 may also include a regenerate affordance (e.g., a button, a link) that enables the user to submit another image generation request…More specifically, in response to receiving a user input (e.g., a touch input to a display, a click) over the description field, the software application can allow the user to edit text displayed in the description field (e.g., by displaying a touch keyboard below the description field, through keyboard inputs)”; Note: the user can re-enter a new input, and in response, generate an image). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wang to incorporate the teachings of Karpman to re-prompt the user and receive an updated user input for the benefit of giving the user a chance to change their input and use the software in the appropriate way. If the software did not allow for updating input, then it may be too restrictive and prevent users from using the service.
Regarding claim 16, Wang teaches program instructions (Paragraph 1 in 2nd Col. of Page 3 – “Imagen Editor is a text-guided image inpainting model targeting improved representation and reflection of linguistic inputs, fine-grained control and high fidelity outputs”; Note: Imagen Editor is comprised of program instructions) that cause:
receiving, via a user interface of a service, a reference image and an input comprising text associated with the reference image (Paragraph 1 in 2nd Col. of Page 3 – “Imagen Editor takes three inputs from the user, 1) the image to be edited, 2) a binary mask to specify the edit region, and 3) a text prompt – and all three inputs are used to guide the output samples”; Note: Imagen Editor provides the service, and a user interface is implied because the user would not be able to provide input without an interface), wherein the reference image comprises any one or more of an object or a person (Fig. 18 – The figure shows that the input/reference image has an object or person (or both) in it; see screenshot of Fig. 18 above);
determining, via a trained machine learning (ML) model, one or more features of the reference image (Paragraph 2 in 2nd Col. of Page 1, Paragraph 3 in 1st Col. of Page 2, Paragraph 3 in 2nd Col. of Page 3 – “Imagen Editor adds image and mask context to each diffusion stage via three convolutional downsampling image encoders… In terms of text-image alignment, Imagen Editor trained with object-masking is preferred…In Imagen Editor, we modify Imagen to condition on both the image and the mask by concatenating them with the diffusion latents along the channel dimension”; Note: the diffusion latents are the features, and they are derived using convolutional image encoders. Convolutional image encoders are a type of machine learning model. Imagen Editor, which is made-up of convolutional image encoders (see Fig. 2 above), is trained);
modifying, via one or more trained latent diffusion models (LDMs), the reference image based upon the determined features and the received input, wherein any one or more of a background of the reference image, an area of the reference image or a style of the reference image are modified (Fig. 1 Caption Page 1, Fig. 2 Caption on Page 2, Paragraph 3 in 1st Col. of Page 2, Paragraph 3 in 2nd Col. of Page 3 – “Given an image, a user defined mask, and a text prompt, Imagen Editor makes localized edits to the designated areas. The model meaningfully incorporates the user’s intent and performs photorealistic edits… All of the diffusion models, i.e., the base model and super-resolution (SR) models, condition on high-resolution 1024×1024 image and mask inputs… In terms of text-image alignment, Imagen Editor trained with object-masking is preferred…In Imagen Editor, we modify Imagen to condition on both the image and the mask by concatenating them with the diffusion latents along the channel dimension”; Note: the mask area of the reference/input image is modified via Imagen Editor, which is trained and made-up of latent diffusion models (see Fig. 2 above). The modifications are based on the inputs and the diffusion latents. The diffusion latents are the features);
and causing to display, via the user interface of the service, the modified image (Fig. 1 – The figure shows the output modified image. It would have been obvious to display it to the user because the user requested the image by providing the input and thus would want to see the output).
Wang does not teach a computer readable medium comprising program instructions stored thereon which when executed by a processor effectuate the method. However, Karpman teaches a computer readable medium comprising program instructions stored thereon which when executed by a processor effectuate the method (Col. 24 lines 26-33 – “Storage device 505 may include a non-transitory computer-readable storage medium holding information, instructions, or some combination thereof, for example instructions that when executed by the processor 501, cause processor 501 to be configured or operable to perform one or more operations of a method as described herein”). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wang to incorporate the teachings of Karpman to have a computer readable medium with instructions coupled to a processor to execute the instructions because the diffusion models in Wang would not have been able to run properly without a processor and would not exist without code to define the models. Additionally, a non-transitory memory would provide a persistent and reliable storage for future use of the instructions and data.
Regarding claim 17, Wang in view of Karpman teaches the computer readable medium of claim 16. Wang does not teach wherein the program instructions which when executed by the processor further effectuate: transmitting prior to receiving the input, via the user interface, any one or more of a suggestion of a mask of the area in the image or a suggestion affecting the background or the style of the reference image. However, Karpman teaches transmitting prior to receiving the input, via the user interface, any one or more of a suggestion of a mask of the area in the image or a suggestion affecting the background or the style of the reference image (Fig. 4A, Col. 20 lines 56-63 – “The generation interface 400 also includes a style menu 404 that enables the user to browse and select among pre-set image styles for the image generation request. In the example of FIG. 4A, the style menu 404 displays a set (e.g., array) of style option tiles, each style option tile including a text description of the image style (e.g., anime, Van Gogh, oil painting, line drawing, digital art, etc.) and a sample image in the corresponding style”; Note: the user interface has suggestions for image style. The reference image was previously taught by Wang in the rejection of claim 16. Fig. 4A above shows that the style suggestions are populated prior to receiving input). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wang to incorporate the teachings of Karpman to provide suggestions of a style because “the software application layer 124 and inspiration affordance can assist the user in deciding on a type of image they wish to generate by providing visual examples of (possible) image styles and enabling users to select among pre-set styles without requiring the user to develop and/or phrase image generation prompts that include stylistic constraints, thereby reducing the cognitive burden on the user in operating the software application layer and increasing the likelihood that the image subsequently generated by the text-to-image diffusion model 112 will align with user expectations” (Karpman: Col. 21 lines 17-27). In other words, the suggestions can make it easier for the user to make decisions regarding the image edits and generation.
Regarding claim 19, Wang in view of Karpman teaches the computer readable medium of claim 16. Wang does not teach wherein the program instructions which when executed by the processor further effectuate: assessing, via a privacy LDM, whether any one or more of the reference image, the received input or the modified image meets predetermined criteria prior to the reference image being displayed on the user interface of the service, wherein the predetermined criteria comprises any one or more of violence, profanity or nudity. However, Karpman teaches assessing, via a privacy LDM, whether any one or more of the reference image, received input or the modified image meets predetermined criteria prior to the reference image being displayed on the user interface of the service, wherein the predetermined criteria comprises any one or more of violence, profanity or nudity (Col. 7 lines 40-46, Col. 16 lines 60-67, Col. 17 lines 1-5 – “the set of content moderation models includes a multi-headed deep learning text classifier (hereinafter the “text classifier”) that is trained and/or configured to identify sexually explicit, hateful, and/or violent text prompts based on the semantic meaning of natural language prompts provided to the text-to-image diffusion model…the system can: execute the text classifier on the text prompt to generate a severity score in each supported text moderation class (e.g., by executing each model head in the text classifier on the text prompt); compare each severity score to a predetermined severity threshold; and, in response to one or more severity scores meeting (e.g., exceeding) the severity threshold, declining the image generation request and returning an error message. The system 100 can therefore leverage the text classifier to screen, filter, and/or reject requests to generate sexually explicit, hateful, obscene, or misleading images”; Note: the text-to-image diffusion model is equivalent to the privacy LDM since it is coupled with classifiers to identify if the input meets predetermined criteria). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wang to incorporate the teachings of Karpman to assess the input on whether or not it meets a predetermined criteria of violence, profanity, or nudity “in order to enforce content guidelines and prevent abuse of the image generation platform” (Karpman: Col. 17 lines 3-5). In other words, it would help ensure proper usage of the software.
Regarding claim 20, Wang in view of Karpman teaches the computer readable medium of claim 16. Wang does not teach wherein the program instructions which when executed by the processor further effectuate: transmitting, via the user interface based upon the assessment, a request to update any one or more of the reference image or the received input; and receiving, via the user interface, any one or more of an updated reference image or an updated input. However, Karpman teaches transmitting, via the user interface based upon the assessment, a request to update any one or more of the reference image or the received input (Col. 23 lines 6-23 – “In implementations where image generation tasks are subject to content moderation, the software application layer 124 is configured to transition display of the waiting screen back to the generation interface (e.g., the generation interface shown in FIG. 4A) in response to receiving the error message from the communication interface 122 (e.g., based on outputs of the text classifier, based on outputs of the visual classifier). In these implementations, the software application layer 124 can display the error message (e.g., as a pop-up dialog box over the generation interface or adjacent to the interactive text field) informing the user that system declined to generate the requested image (e.g., “Could not generate that image—please try again with a different prompt”). The error message displayed to the user may also include a reason for the moderation decision based on a classification of the prompt by the text classifier or based on a classification of the base image by the visual classifier”); and receiving, via the user interface, any one or more of an updated reference image or an updated input (Col. 23 lines 31-35 and 39-47 – “Results interface 414 displays (a preview of) the image(s) 1 to 6 generated by the text-to-image diffusion model 112 at 416 in response to the image generation request and a description field 418 that displays the text prompt submitted to the model…The results interface 414 may also include a regenerate affordance (e.g., a button, a link) that enables the user to submit another image generation request…More specifically, in response to receiving a user input (e.g., a touch input to a display, a click) over the description field, the software application can allow the user to edit text displayed in the description field (e.g., by displaying a touch keyboard below the description field, through keyboard inputs)”; Note: the user can re-enter a new input, and in response, generate an image). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wang to incorporate the teachings of Karpman to re-prompt the user and receive an updated user input for the benefit of giving the user a chance to change their input and use the software in the appropriate way. If the software did not allow for updating input, then it may be too restrictive and prevent users from using the service.
Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Wang in view of Liu et al. (DiffProtect: Generate Adversarial Examples with Diffusion Models for Facial Privacy Protection), hereinafter Liu.
Regarding claim 4, Wang teaches the method of claim 1. Wang does not teach wherein an identity of the object or the person is preserved. However, Liu teaches wherein an identity of the object or the person is preserved (Paragraph 3 in 1st Col. of Page 2 – “we propose DiffProtect, which utilizes a pre trained diffusion autoencoder [37] to generate adversarial images for facial privacy protection. The overall pipeline of DiffProtect is shown in Fig. 2. We first encode an input face image I into a high-level semantic code z and a low-level noise code xT. We then iteratively optimize an adversarial semantic code zadv such that the resulting protected image Ip generated by the conditional DDIM decoding process [37, 50] can fool the face recognition model”; Note: the identity of the person is preserved). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wang to incorporate the teachings of Liu to preserve the identity of the person because “the widely deployed FR [face recognition] systems also pose a huge threat to personal privacy as billions of users have publicly shared their photos on social media. Through large-scale social media photo analysis, FR systems can be used for detecting user relationships [46], stalking victims [47], stealing identities [32], and performing massive government surveillance [42, 17, 34]. It is urgent to develop facial privacy protection techniques to protect individuals from unauthorized FR systems” (Liu: Paragraph 2 in 1st Col. of Page 1). In other words, preserving a person’s identify provides a layer of safety and privacy for the person.
Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Wang in view of Kim et al. (Zoom-to-Inpaint: Image Inpainting with High-Frequency Details), hereinafter Kim.
Regarding claim 5, Wang teaches the method of claim 1. Wang further teaches detecting whether a resolution of the reference image exceeds a capacity of the LDMs (Paragraph 4 in 2nd Col. of Page 3, Paragraph 1 in 1st Col. of Page 4 – “The conditioning image and the corresponding mask input to Imagen Editor are always at 1024×1024 resolution. The base diffusion 64×64 model and the 64×64→256×256 super-resolution model operate at a smaller resolution, and thus require some form of downsampling to match the diffusion latent resolution”; Note: the conditioning/reference image is detected to have a higher resolution than the diffusion model); determining a section of the reference image being preserved (Paragraph 4 in 2nd Col. of Page 3 – “In Imagen Editor, we modify Imagen to condition on both the image and the mask by concatenating them with the diffusion latents along the channel dimension”; Note: the mask is a section of the image being preserved); and editing a section of the reference image being preserved and pasting the edited section into the reference image (Fig. 1 Caption on Page 1, Fig. 2 Caption on Page 2 – “Given an image, a user defined mask, and a text prompt, Imagen Editor makes localized edits to the designated areas… All of the diffusion models, i.e., the base model and super-resolution (SR) models, condition on high-resolution 1024×1024 image and mask inputs”; Note: Fig. 2 above shows that the mask area of the image are upscaled and edited. Fig. 1 above shows the edit being pasted onto the input/reference image (the spacesuit and headphones are pasted onto the image of the dog)). Wang does not teach that the edited section is zoomed-in, from the limitation: “editing a zoomed-in section of the reference image being preserved and pasting the edited section into the reference image”. However, Kim teaches editing a zoomed-in section of the reference image being preserved (Fig. 5, Paragraph 2 in 2nd Col. of Page 4 – “our proposed refinement is achieved by zooming in, refining, then zooming out back to the input resolution”; Note: Fig. 5 shows the zoomed-in section of the image based on the mask and how that section is edited to restore missing regions). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wang to incorporate the teachings of Kim to zoom-in on the mask because “This allows the refinement network to correct local irregularities at a finer level and to learn from high-resolution (HR) labels, thus effectively reducing the spectral bias at the desired resolution and injecting more high-frequency details into the resulting image” (Kim: paragraph 2 in 1st Col. of Page 2). In other words, it would help produce a better-quality output.
PNG
media_image5.png
798
566
media_image5.png
Greyscale
Screenshot of Fig. 5 (taken from Kim)
Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Wang in view of Karpman and Liu.
Regarding claim 11, Wang in view of Karpman teaches the system of claim 8. Wang further teaches wherein: the reference image comprises any one or more of an object or a person (Fig. 18 – The figure shows that the input/reference image has an object or person (or both) in it; see screenshot of Fig. 18 above). Wang does not teach an identity of the object or the person is preserved. However, Liu teaches wherein an identity of the object or the person is preserved (Paragraph 3 in 1st Col. of Page 2 – “we propose DiffProtect, which utilizes a pre trained diffusion autoencoder [37] to generate adversarial images for facial privacy protection. The overall pipeline of DiffProtect is shown in Fig. 2. We first encode an input face image I into a high-level semantic code z and a low-level noise code xT. We then iteratively optimize an adversarial semantic code zadv such that the resulting protected image Ip generated by the conditional DDIM decoding process [37, 50] can fool the face recognition model”; Note: the identity of the person is preserved). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wang to incorporate the teachings of Liu to preserve the identity of the person because “the widely deployed FR [face recognition] systems also pose a huge threat to personal privacy as billions of users have publicly shared their photos on social media. Through large-scale social media photo analysis, FR systems can be used for detecting user relationships [46], stalking victims [47], stealing identities [32], and performing massive government surveillance [42, 17, 34]. It is urgent to develop facial privacy protection techniques to protect individuals from unauthorized FR systems” (Liu: Paragraph 2 in 1st Col. of Page 1). In other words, preserving a person’s identify provides a layer of safety and privacy for the person.
Claims 12 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Wang in view of Karpman and Kim.
Regarding claim 12, Wang in view of Karpman teaches the system of claim 8. Wang further teaches wherein the processor when further configured to execute the instructions of: detecting whether a resolution of the reference image exceeds a capacity of the LDMs (Paragraph 4 in 2nd Col. of Page 3, Paragraph 1 in 1st Col. of Page 4 – “The conditioning image and the corresponding mask input to Imagen Editor are always at 1024×1024 resolution. The base diffusion 64×64 model and the 64×64→256×256 super-resolution model operate at a smaller resolution, and thus require some form of downsampling to match the diffusion latent resolution”; Note: the conditioning/reference image is detected to have a higher resolution than the diffusion model); determining a section of the reference image being preserved (Paragraph 4 in 2nd Col. of Page 3 – “In Imagen Editor, we modify Imagen to condition on both the image and the mask by concatenating them with the diffusion latents along the channel dimension”; Note: the mask is a section of the image being preserved); and editing a section of the reference image being preserved and pasting the edited section into the reference image (Fig. 1 Caption on Page 1, Fig. 2 Caption on Page 2 – “Given an image, a user defined mask, and a text prompt, Imagen Editor makes localized edits to the designated areas… All of the diffusion models, i.e., the base model and super-resolution (SR) models, condition on high-resolution 1024×1024 image and mask inputs”; Note: Fig. 2 above shows that the mask area of the image are upscaled and edited. Fig. 1 above shows the edit being pasted onto the input/reference image (the spacesuit and headphones are pasted onto the image of the dog)). Wang does not teach that the edited section is zoomed-in, from the limitation: “editing a zoomed-in section of the reference image being preserved and pasting the edited section into the reference image”. However, Kim teaches editing a zoomed-in section of the reference image being preserved (Fig. 5, Paragraph 2 in 2nd Col. of Page 4 – “our proposed refinement is achieved by zooming in, refining, then zooming out back to the input resolution”; Note: Fig. 5 shows the zoomed-in section of the image based on the mask and how that section is edited to restore missing regions). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wang to incorporate the teachings of Kim to zoom-in on the mask because “This allows the refinement network to correct local irregularities at a finer level and to learn from high-resolution (HR) labels, thus effectively reducing the spectral bias at the desired resolution and injecting more high-frequency details into the resulting image” (Kim: paragraph 2 in 1st Col. of Page 2). In other words, it would help produce a better-quality output.
Regarding claim 18, Wang in view of Karpman teaches the computer readable medium of claim 16. Wang does not teach wherein the program instructions which when executed by the processor further effectuate: detecting whether a resolution of the reference image exceeds a capacity of the LDMs; determining a section of the reference image being preserved; and editing a zoomed-in section of the reference image being preserved and pasting the edited section into the reference image. Wang further teaches detecting whether a resolution of the reference image exceeds a capacity of the LDMs (Paragraph 4 in 2nd Col. of Page 3, Paragraph 1 in 1st Col. of Page 4 – “The conditioning image and the corresponding mask input to Imagen Editor are always at 1024×1024 resolution. The base diffusion 64×64 model and the 64×64→256×256 super-resolution model operate at a smaller resolution, and thus require some form of downsampling to match the diffusion latent resolution”; Note: the conditioning/reference image is detected to have a higher resolution than the diffusion model); determining a section of the reference image being preserved (Paragraph 4 in 2nd Col. of Page 3 – “In Imagen Editor, we modify Imagen to condition on both the image and the mask by concatenating them with the diffusion latents along the channel dimension”; Note: the mask is a section of the image being preserved); and editing a section of the reference image being preserved and pasting the edited section into the reference image (Fig. 1 Caption on Page 1, Fig. 2 Caption on Page 2 – “Given an image, a user defined mask, and a text prompt, Imagen Editor makes localized edits to the designated areas… All of the diffusion models, i.e., the base model and super-resolution (SR) models, condition on high-resolution 1024×1024 image and mask inputs”; Note: Fig. 2 above shows that the mask area of the image are upscaled and edited. Fig. 1 above shows the edit being pasted onto the input/reference image (the spacesuit and headphones are pasted onto the image of the dog)). Wang does not teach that the edited section is zoomed-in, from the limitation: “editing a zoomed-in section of the reference image being preserved and pasting the edited section into the reference image”. However, Kim teaches editing a zoomed-in section of the reference image being preserved (Fig. 5, Paragraph 2 in 2nd Col. of Page 4 – “our proposed refinement is achieved by zooming in, refining, then zooming out back to the input resolution”; Note: Fig. 5 shows the zoomed-in section of the image based on the mask and how that section is edited to restore missing regions). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wang to incorporate the teachings of Kim to zoom-in on the mask because “This allows the refinement network to correct local irregularities at a finer level and to learn from high-resolution (HR) labels, thus effectively reducing the spectral bias at the desired resolution and injecting more high-frequency details into the resulting image” (Kim: paragraph 2 in 1st Col. of Page 2). In other words, it would help produce a better-quality output.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Revanur et al. (US 12462449 B2) teaches a method of editing an input image based on a text prompt using a generator network. Schramowski et al. (Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models) teaches a safe latent diffusion model that filters or suppresses inappropriate content from being generated.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHELLE HAU MA whose telephone number is (571)272-2187. The examiner can normally be reached M-Th 7-5:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, King Poon can be reached at (571) 270-0728. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/MICHELLE HAU MA/ Examiner, Art Unit 2617
/KING Y POON/Supervisory Patent Examiner, Art Unit 2617