Office Action Analysis: 18617032 — EDITING DIGITAL IMAGES WITH LOCAL REFINEMENT VIA SELECTIVE FEATURE TRIMMING

Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) was submitted on 6/20/2024. The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Specification
The disclosure is objected to because of the following informalities: 
¶ 0016, lines 1-2, “generative system generates encodes global context” appears to be a typo. Examiner suggests “generative system encodes global context”.  
Appropriate correction is required.
Positive Statement Regarding - 35 USC § 101
The Examiner’s 35 U.S.C. 101 analysis recognizes that the claimed subject matter is
directed to a practical application of a technical solution. The claimed elements, taken as a
whole, improve the functioning of encoder-decoder based image editing methods by improving accuracy through utilizing local refinement, see ¶0021. Because the claims recite specific, claimed steps and structural elements that produce a tangible technical result, they are not directed to an abstract idea absent additional inventive concept limitations. Accordingly, the record supports a positive 101 determination for the present claims.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1- 4, 6,  8, 10-11, and 17-19  are rejected under 35 U.S.C. 103 as being unpatentable over Liu et al. (Liu, Difan, et al. "Asset: autoregressive semantic scene editing with transformers at high resolutions." ACM Transactions on Graphics (TOG) 41.4 (2022): 1-12) (hereafter, “Liu”) in view of Ren et al. (US 11,580,673) (hereafter, “Ren”).
	Regarding claim 1, Liu discloses a computer-implemented method comprising: generating, utilizing an encoder neural network (Page 5, §Transformer encoder, our transformer encoder), a latent feature vector of a digital image by encoding global context information of the digital image into the latent feature vector (Page 5, Eqn. 2, §Transformer encoder, Specifically, for each position in the sequence, three 𝑑-dimensional learned embeddings are produced: (i) an image embedding 𝐸im(𝑥𝑙 ) representing the token 𝑥𝑙 at position 𝑙 in our sequence and in turn the corresponding RGB image region, (ii) an embedding 𝐸map(𝑝𝑙 ) of the semantic token 𝑝𝑙 at the same position, and finally (iii) a positional embedding 𝐸pos(𝑙 ) for that position 𝑙; §Transformer Decoder, with the help of the global context obtained through the transformer encoder); [determining a modified latent feature vector by trimming the latent feature vector to a feature subset corresponding to a masked portion of the digital image; generating, utilizing a generative decoder neural network on the modified latent feature vector, digital image data corresponding to the masked portion of the digital image]; and generating a modified digital image including the digital image data corresponding to the masked portion combined with additional portions of the digital image (Page 5, §3.3 Image decoder, To avoid this, we follow the same strategy as SESAME and retain only the generated pixels in the masked regions while the rest of the image is retrieved from the original image).
	However, Liu fails to explicitly disclose determining a modified latent feature vector by trimming the latent feature vector to a feature subset corresponding to a masked portion of the digital image; generating, utilizing a generative decoder neural network on the modified latent feature vector, digital image data corresponding to the masked portion of the digital image.
	Ren teaches determining a modified latent feature vector by trimming the latent feature vector to a feature subset corresponding to a masked portion of the digital image (Col. 8, lines 5-10, The image synthesis process with mask embedding represented in FIG. 2 may sample a mask constraint point in a lowest resolution manifold, e.g., by locating a correct partition via mask embedding input 208 and sampling a point within that partition via a latent features vector 210. Examiner considers the sampling of the latent features via the mask embedding as “trimming to a feature subset”); generating, utilizing a generative decoder neural network on the modified latent feature vector (Fig. 3, #300; Col. 8, line 55 – Col. 9, line 20, FIG. 3 is a diagram illustrating an example generator 300… The latent features vector may be a 100-dimensional vector thus the input of the latent projection path may be a 132-dimensional vector; Examiner considers the latent projection path in Fig. 3 as the “generative decoder neural network”), digital image data corresponding to the masked portion of the digital image (Col. 9, lines 28-29, The output of the projection layer may be the synthesized image).
	Both Liu and Ren are analogous to the claimed invention because they both use encoder-decoder neural networks for image processing. It would have been obvious to a person of ordinary skill before the effective filing date of the claimed invention to incorporate the trimmed latent feature vector and masked image output of Ren into the image editing pipeline of Liu. The suggestion/motivation for including the latent features of Ren would have been for improved projection efficiency, as suggested by Ren at Col. 8, lines 52-53, These observations indicate that incorporating mask embedding input significantly improves the features projection efficiency. The suggestion/motivation for combining the output of the masked image region of Ren with the original image in Liu would have been for preventing minor edits in the unmasked region, as suggested by Liu at Page 5, §3.3 Image decoder, the reconstruction of the encoder-decoder pair is not perfect and leads to minor changes in the areas that are not edited. To avoid this, we … retain only the generated pixels in the masked regions while the rest of the image is retrieved from the original image.
	This method of improving Liu was within the ordinary ability of one of ordinary skill in the art based on the teachings of Ren.
	Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date, to modify Liu with the teachings of Ren to obtain the invention as specified in claim 1.
	Regarding claim 2, in which claim 1 is incorporated, Liu discloses wherein generating the latent feature vector comprises utilizing the encoder neural network to extract a plurality of tokens representing patches of the digital image (Fig. 2, “Patch feature tokens”; Page 3, §3.1 Image Encoder, The input RGB image X of size 𝐻im ×𝑊im × 3 is processed by a convolutional encoder resulting in a feature map F of size 𝐻im/16 ×𝑊im/16 ×𝑑 … The feature map F is subsequently quantized following VQGAN with the help of a learned codebook Z, i.e., each feature map entry f𝑖, 𝑗 at position (𝑖, 𝑗 ) in F is mapped to the closest codebook entry ˆf𝑖, 𝑗 = arg minz𝜅 ∈Z ||f𝑖, 𝑗 − z𝜅 ||, where {z𝜅 }|Z|𝜅=1 are codebook entries
with dimensionality 𝑑.  Examiner considers the convolutional encoder to extract 16x16 pixel patches and mapping of patches to codebook entries as “extracting tokens”) to encode global context information from the digital image into each of the plurality of tokens (Page 5, Eqn. 2, §Transformer encoder, Specifically, for each position in the sequence, three 𝑑-dimensional learned embeddings are produced: (i) an image embedding 𝐸im(𝑥𝑙 ) representing the token 𝑥𝑙 at position 𝑙 in our sequence and in turn the corresponding RGB image region, (ii) an embedding 𝐸map(𝑝𝑙 ) of the semantic token 𝑝𝑙 at the same position, and finally (iii) a positional embedding 𝐸pos(𝑙 ) for that position 𝑙. Examiner considers the semantic tokens and positional embeddings as global context information).
	Regarding claim 3, in which claim 1 is incorporated, Liu discloses wherein determining the modified latent feature vector comprises: determining a subset of patches of the digital image corresponding to the masked portion of the digital image (Fig. 2, Edited semantic map;  Page 3, §3.1 Image encoder, We also create a 𝐻im ×𝑊im binary mask indicating image regions
that must be replaced according to the semantic map edits… The codebook indices of the edited regions, as indicated by the binary mask, are replaced with a special [MASK] token. Examiner considers the binary mask to indicate which patches of the images correspond to the masked portion); and [trimming] tokens corresponding to the latent feature vector (Page 4, right column, first paragraph, The feature map F is subsequently quantized following VQGAN [Esser et al. 2021a] with the help of a learned codebook Z, i.e., each feature map entry f𝑖, 𝑗 at position (𝑖, 𝑗 ) in F is mapped to the closest codebook entry. Examiner considers the codebook entries as tokens) [to a subset of tokens representing the subset of patches].
	However, Liu fails to explicitly disclose trimming to a subset representing the subset of patches.
	Ren teaches trimming to a subset representing the subset of patches (Col. 8, lines 5-10, The image synthesis process with mask embedding represented in FIG. 2 may sample a mask constraint point in a lowest resolution manifold, e.g., by locating a correct partition via mask embedding input 208 and sampling a point within that partition via a latent features vector 210. Examiner considers the sampling of the latent features via the mask embedding as “trimming to a feature subset”).
	Both Liu and Ren are analogous to the claimed invention because they both use encoder-decoder neural networks for image processing. It would have been obvious to a person of ordinary skill before the effective filing date of the claimed invention to incorporate the trimmed subset of Ren into the image editing pipeline of Liu. The suggestion/motivation for doing so would have been for improved projection efficiency, as suggested by Ren at Col. 8, lines 52-53, These observations indicate that incorporating mask embedding input significantly improves the features projection efficiency.
	This method of improving Liu was within the ordinary ability of one of ordinary skill in the art based on the teachings of Ren.
	Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date, to modify Liu with the teachings of Ren to obtain the invention as specified in claim 3.
	Regarding claim 4, in which claim 3 is incorporated, Liu discloses wherein determining the subset of patches corresponding to the masked portion comprises determining one or more patches of the digital image including the masked portion of the digital image (Page 3, §3.1 Image encoder, We also create a 𝐻im ×𝑊im binary mask indicating image regions
that must be replaced according to the semantic map edits. Examiner considers the creation of the binary mask to imply determining which patches include the masked portion).
	Regarding claim 6, in which claim 1 is incorporated, Liu discloses wherein generating the digital image data corresponding to the masked portion comprises: determining a generative prompt comprising an indication of digital content to insert into the digital image (Page 3, §3 METHOD, Our method synthesizes images guided by user input in the form of an edited label map … the user paints some desired changes on the label map); and generating the digital image data according to the modified latent feature vector and the generative prompt (Page 3, §3 METHOD, Since there exist several possible output images reflecting the input edits, our method generates a diverse set of outputs allowing the user to select the most preferable one).
	Regarding claim 8, in which claim 1 is incorporated, Liu discloses wherein generating the modified digital image comprises: generating a latent composite image by inserting the digital image data into the digital image in a latent image domain at a location corresponding to the masked portion of the digital image (Page 5, §Transformer Decoder, We note that the tokens corresponding to unmasked image regions (i.e., image regions to be preserved) are set to the original image codebook indices. We predict the distributions only for positions corresponding to the edited image regions); and generating the modified digital image by utilizing a latent decoder neural network on the latent composite image (Page 5, §.3. Image decoder, The image decoder takes as input the quantized feature map and decodes an RGB image; Page 13, Table 7. Table 7 lists the neural network structure of the decoder).
	Regarding claim 10, Liu discloses a system comprising: one or more memory devices comprising a digital image (Page 10, §Comparison with full attention, Based on an NVIDIA A100 (40GB VRAM)); and one or more processors coupled to the one or more memory devices (Page 10, §Comparison with full attention, Based on an NVIDIA A100 (40GB VRAM). The A100 is a GPU which is a  processor) that cause the system to perform operations comprising: generating, utilizing a transformer-based encoder neural network (Page 4, §3.2 Autoregressive transformer, The transformer encoder captures bi-directional context of the image; Page 5, §Transformer encoder), a latent feature vector corresponding to a plurality of tokens representing patches of a digital image (Fig. 2, “Patch feature tokens”; Page 3, §3.1 Image Encoder, The input RGB image X of size 𝐻im ×𝑊im × 3 is processed by a convolutional encoder resulting in a feature map F of size 𝐻im/16 ×𝑊im/16 ×𝑑 … The feature map F is subsequently quantized following VQGAN with the help of a learned codebook Z, i.e., each feature map entry f𝑖, 𝑗 at position (𝑖, 𝑗 ) in F is mapped to the closest codebook entry ˆf𝑖, 𝑗 = arg minz𝜅 ∈Z ||f𝑖, 𝑗 − z𝜅 ||, where {z𝜅 }|Z|𝜅=1 are codebook entries with dimensionality 𝑑.  Examiner considers the convolutional encoder to extract 16x16 pixel patches and the codebook entries as “tokens”) to encode global context information of the digital image into the latent feature vector (Page 5, Eqn. 2, §Transformer encoder, Specifically, for each position in the sequence, three 𝑑-dimensional learned embeddings are produced: (i) an image embedding 𝐸im(𝑥𝑙 ) representing the token 𝑥𝑙 at position 𝑙 in our sequence and in turn the corresponding RGB image region, (ii) an embedding 𝐸map(𝑝𝑙 ) of the semantic token 𝑝𝑙 at the same position, and finally (iii) a positional embedding 𝐸pos(𝑙 ) for that position 𝑙; §Transformer Decoder, with the help of the global context obtained through the transformer encoder); [determining a modified latent feature vector by trimming the latent feature vector to a feature subset representing a subset of patches of the digital image corresponding to a masked portion of the digital image]; and generating a modified digital image by: generating, utilizing a transformer-based generative decoder neural network on the modified latent feature vector (Page 4, §3.2 Autoregressive transformer, used by the transformer decoder to generate new codebook indices autoregressively; Page 5, §Transformer Decoder), digital image data for the subset of patches corresponding to the masked portion of the digital image (Page 5, §3.3 Image Decoder, The image decoder takes as input the quantized feature map and decodes an RGB image… retain only the generated pixels in the masked regions);  and combining the digital image data generated for the subset of patches with an additional subset of patches of the digital image outside the masked portion of the digital image (Page 5, §3.3 Image Decoder, To avoid this, we follow the same strategy as SESAME and retain only the generated pixels in the masked regions while the rest of the image is retrieved from the original image).
	However, Liu fails to explicitly disclose determining a modified latent feature vector by trimming the latent feature vector to a feature subset representing a subset of patches of the digital image corresponding to a masked portion of the digital image.
	Ren teaches determining a modified latent feature vector by trimming the latent feature vector to a feature subset representing a subset of patches of the digital image corresponding to a masked portion of the digital image (Col. 8, lines 5-10, The image synthesis process with mask embedding represented in FIG. 2 may sample a mask constraint point in a lowest resolution manifold, e.g., by locating a correct partition via mask embedding input 208 and sampling a point within that partition via a latent features vector 210. Examiner considers the sampling of the latent features via the mask embedding as “trimming to a feature subset”).
	Both Liu and Ren are analogous to the claimed invention because they both use encoder-decoder neural networks for image processing. It would have been obvious to a person of ordinary skill before the effective filing date of the claimed invention to incorporate the trimmed subset of Ren into the image editing pipeline of Liu. The suggestion/motivation for doing so would have been for improved projection efficiency, as suggested by Ren at Col. 8, lines 52-53, These observations indicate that incorporating mask embedding input significantly improves the features projection efficiency.
	This method of improving Liu was within the ordinary ability of one of ordinary skill in the art based on the teachings of Ren.
	Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date, to modify Liu with the teachings of Ren to obtain the invention as specified in claim 10.
	Regarding claim 11, in which claim 10 is incorporated, Liu discloses wherein determining the modified latent feature vector comprises: determining an image mask indicating the masked portion of the digital image (Fig. 2; Page 3, §3.1 Image encoder, We also create a 𝐻im ×𝑊im binary mask indicating image regions that must be replaced according to the semantic map edits. Examiner considers the binary mask and semantic map to indicate the “masked portion of the digital image”); and determining, from the image mask, the subset of patches of the digital image corresponding to the masked portion by determining one or more patches within a boundary of the masked portion (Page 3, §3.1 Image encoder, We also create a 𝐻im ×𝑊im binary mask indicating image regions that must be replaced according to the semantic map edits. Examiner considers the creation of the binary mask to imply determining which patches include the masked portion).
	Regarding claim 17, Liu discloses a non-transitory computer readable medium (Claim 16, A non-transitory computer readable medium) storing instructions thereon that, when executed by at least one processor (Page 10, §Comparison with full attention, Based on an NVIDIA A100 (40GB VRAM). The A100 is a GPU which is a  processor), cause the at least one processor to perform operations comprising: generating, utilizing an encoder neural network, a latent feature vector of a digital image by encoding global context information of the digital image into the latent feature vector (Page 5, Eqn. 2, §Transformer encoder, Specifically, for each position in the sequence, three 𝑑-dimensional learned embeddings are produced: (i) an image embedding 𝐸im(𝑥𝑙 ) representing the token 𝑥𝑙 at position 𝑙 in our sequence and in turn the corresponding RGB image region, (ii) an embedding 𝐸map(𝑝𝑙 ) of the semantic token 𝑝𝑙 at the same position, and finally (iii) a positional embedding 𝐸pos(𝑙 ) for that position 𝑙; §Transformer Decoder, with the help of the global context obtained through the transformer encoder); [determining a modified latent feature vector by trimming the latent feature vector to a feature subset corresponding to a masked portion of the digital image; generating, utilizing a generative decoder neural network on the modified latent feature vector, digital image data corresponding to the masked portion of the digital image]; and generating a modified digital image including the digital image data corresponding to the masked portion combined with additional portions of the digital image (Page 5, §3.3 Image decoder, To avoid this, we follow the same strategy as SESAME and retain only the generated pixels in the
masked regions while the rest of the image is retrieved from the original image).
	However, Liu fails to explicitly disclose determining a modified latent feature vector by trimming the latent feature vector to a feature subset corresponding to a masked portion of the digital image; generating, utilizing a generative decoder neural network on the modified latent feature vector, digital image data corresponding to the masked portion of the digital image.
	Ren teaches determining a modified latent feature vector by trimming the latent feature vector to a feature subset corresponding to a masked portion of the digital image (Col. 8, lines 5-10, The image synthesis process with mask embedding represented in FIG. 2 may sample a mask constraint point in a lowest resolution manifold, e.g., by locating a correct partition via mask embedding input 208 and sampling a point within that partition via a latent features vector 210. Examiner considers the sampling of the latent features via the mask embedding as “trimming to a feature subset”); generating, utilizing a generative decoder neural network on the modified latent feature vector (Fig. 3, #300; Col. 8, line 55 – Col. 9, line 20, FIG. 3 is a diagram illustrating an example generator 300… The latent features vector may be a 100-dimensional vector thus the input of the latent projection path may be a 132-dimensional vector; Examiner considers the latent projection path in Fig. 3 as the “generative decoder neural network”), digital image data corresponding to the masked portion of the digital image (Col. 9, lines 28-29, The output of the projection layer may be the synthesized image).
	Both Liu and Ren are analogous to the claimed invention because they both use encoder-decoder neural networks for image processing. It would have been obvious to a person of ordinary skill before the effective filing date of the claimed invention to incorporate the trimmed latent feature vector and masked image output of Ren into the image editing pipeline of Liu. The suggestion/motivation for including the latent features of Ren would have been for improved projection efficiency, as suggested by Ren at Col. 8, lines 52-53, These observations indicate that incorporating mask embedding input significantly improves the features projection efficiency. The suggestion/motivation for combining the output of the masked image region of Ren with the original image in Liu would have been for preventing minor edits in the unmasked region, as suggested by Liu at Page 5, §3.3 Image decoder, the reconstruction of the encoder-decoder pair is not perfect and leads to minor changes in the areas that are not edited. To avoid this, we … retain only the generated pixels in the masked regions while the rest of the image is retrieved from the original image.
	This method of improving Liu was within the ordinary ability of one of ordinary skill in the art based on the teachings of Ren.
	Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date, to modify Liu with the teachings of Ren to obtain the invention as specified in claim 17.
	Regarding claim 18, in which claim 17 is incorporated, Liu discloses generating the latent feature vector comprises utilizing a transformer-based encoder neural network to extract a plurality of tokens representing patches of the digital image (Fig. 2, “Patch feature tokens”; Page 3, §3.1 Image Encoder, The input RGB image X of size 𝐻im ×𝑊im × 3 is processed by a convolutional encoder resulting in a feature map F of size 𝐻im/16 ×𝑊im/16 ×𝑑 … The feature map F is subsequently quantized following VQGAN with the help of a learned codebook Z, i.e., each feature map entry f𝑖, 𝑗 at position (𝑖, 𝑗 ) in F is mapped to the closest codebook entry ˆf𝑖, 𝑗 = arg minz𝜅 ∈Z ||f𝑖, 𝑗 − z𝜅 ||, where {z𝜅 }|Z|𝜅=1 are codebook entries
with dimensionality 𝑑.  Examiner considers the convolutional encoder to extract 16x16 pixel patches and mapping of patches to codebook entries as “extracting tokens”); and [determining the modified latent feature vector comprises trimming the latent feature vector to a set] of tokens (Codebook entries as described above) [representing patches corresponding to the masked portion of the digital image].
	However, Liu fails to explicitly disclose determining the modified latent feature vector comprises trimming the latent feature vector to a set representing patches corresponding to the masked portion of the digital image.
	Ren teaches determining the modified latent feature vector comprises trimming the latent feature vector to a set representing patches corresponding to the masked portion of the digital image (Col. 8, lines 5-10, The image synthesis process with mask embedding represented in FIG. 2 may sample a mask constraint point in a lowest resolution manifold, e.g., by locating a correct partition via mask embedding input 208 and sampling a point within that partition via a latent features vector 210. Examiner considers the sampling of the latent features via the mask embedding as “trimming to a feature subset”).
	Both Liu and Ren are analogous to the claimed invention because they both use encoder-decoder neural networks for image processing. It would have been obvious to a person of ordinary skill before the effective filing date of the claimed invention to incorporate the trimmed subset of Ren into the image editing pipeline of Liu. The suggestion/motivation for doing so would have been for improved projection efficiency, as suggested by Ren at Col. 8, lines 52-53, These observations indicate that incorporating mask embedding input significantly improves the features projection efficiency.
	This method of improving Liu was within the ordinary ability of one of ordinary skill in the art based on the teachings of Ren.
	Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date, to modify Liu with the teachings of Ren to obtain the invention as specified in claim 18.	
	Regarding claim 19, in which claim 17 is incorporated, Liu discloses wherein generating the digital image data comprises generating, utilizing a transformer-based decoder neural network (Page 4, §3.2 Autoregressive transformer, used by the transformer decoder to generate new codebook indices autoregressively; Page 5, §Transformer Decoder), a modified feature set from the feature subset corresponding to the masked portion of the digital image (Page 5, §Transformer Decoder, Specifically, the decoder predicts 𝑝(X𝑙 |{𝜒<𝑙 }), where X𝑙 is a categorical random variable representing a codebook index to be generated at position
𝑙 in the sequence and {𝜒<𝑙 } are all indices of the previous steps. We note that the tokens corresponding to unmasked image regions (i.e., image regions to be preserved) are set to the original image codebook indices. We predict the distributions only for positions corresponding to the edited image regions. Examiner considers the predicted distribution to be the “modified feature set”).
Claims 7 and 14-15 are rejected under 35 U.S.C. 103 as being unpatentable over Liu et al. (Liu, Difan, et al. "Asset: autoregressive semantic scene editing with transformers at high resolutions." ACM Transactions on Graphics (TOG) 41.4 (2022): 1-12) (hereafter, “Liu”) in view of Ren et al. (US 11,580,673) (hereafter, “Ren”) as applied to claims 1- 4, 6,  8, 10 -11, and 17-19 above, and further in view of Issenhuth et al. (Issenhuth, Thibaut, et al. "Edibert, a generative model for image editing." arXiv preprint arXiv:2111.15264 (2021)) (hereafter, “Issenhuth”).
	Regarding claim 7, Liu in view of Ren discloses the computer-implemented method of claim 1.
	However, Liu fails to explicitly disclose wherein determining the modified latent feature vector comprises: generating noise features representing an input noise comprising a size and a shape corresponding to the masked portion of the digital image; and generating the digital image data utilizing the generative decoder neural network based on the noise features representing the input noise with the modified latent feature vector.
	Issenhuth teaches wherein determining the modified latent feature vector comprises: generating noise features representing an input noise comprising a size and a shape corresponding to the masked portion of the digital image (Page 10, To delete the information
contained in the mask, all the tokens within the mask are given random values. Examiner considers the tokens as the entries of the “latent feature vector” and giving the tokens random values as “generating input noise”. Since the noise is only within the mask, it must have the size and shape of the mask); and generating the digital image data utilizing the generative decoder neural network based on the noise features representing the input noise with the modified latent feature vector (Page 11, Fig. 6; Page 12, Table 2; Page 12, Table 2 caption, Ablation study on the components of EdiBERT sampling algorithm. Fig. 6 and table 2 show results of ablation studies, including the randomization component which refers to the mask shaped noise. Examiner considers the ablation study to imply image data generation).
	Liu, Ren, and Issenhuth are analogous to the claimed invention because they are all in the field of applying encoder-decoder models for image processing. It would have been obvious to a person of ordinary skill before the effective filing date of the claimed invention to incorporate the noise features of Issenhuth into the image editing pipeline of Liu and the trimmed latent feature vector and masked image output of Ren. The suggestion/motivation for doing so would have been for flexibility in training data sets, as suggested by Issenhuth at Page 12, §5 Discussions, One of the key elements of the proposed method is that it does not require having access to paired datasets.
	This method of improving Liu was within the ordinary ability of one of ordinary skill in the art based on the teachings of Ren and Issenhuth.
	Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date, to modify Liu and the teachings of Ren with the teachings of Issenhuth to obtain the invention as specified in claim 7.
	
	Regarding claim 14, in which claim 10 is incorporated, Liu discloses wherein generating the digital image data comprises generating, utilizing the transformer-based generative decoder neural network (Page 4, §3.2 Autoregressive transformer, used by the transformer decoder to generate new codebook indices autoregressively; Page 5, §Transformer Decoder), a set of modified tokens corresponding to the masked portion of the digital image (Page 5, §Transformer Decoder, Specifically, the decoder predicts 𝑝(X𝑙 |{𝜒<𝑙 }), where X𝑙 is a categorical random variable representing a codebook index to be generated at position 𝑙 in the sequence and {𝜒<𝑙 } are all indices of the previous steps. We note that the tokens corresponding to unmasked image regions (i.e., image regions to be preserved) are set to the original image codebook indices. We predict the distributions only for positions corresponding to the edited image regions. Examiner considers the predicted distribution to be the “modified feature set”) based on [the feature subset of the modified latent feature vector with noise features corresponding to the masked portion].
	However, Liu fails to explicitly disclose the feature subset of the modified latent feature vector with noise features corresponding to the masked portion.
	Issenhuth teaches the feature subset of the modified latent feature vector with noise features corresponding to the masked portion (Page 10, To delete the information
contained in the mask, all the tokens within the mask are given random values. Examiner considers the tokens as the entries of the “latent feature vector” and the tokens with random values as “noise features”).
	Liu, Ren, and Issenhuth are analogous to the claimed invention because they are all in the field of applying encoder-decoder models for image processing. It would have been obvious to a person of ordinary skill before the effective filing date of the claimed invention to incorporate the noise features of Issenhuth into the image editing pipeline of Liu and the trimmed latent feature vector and masked image output of Ren. The suggestion/motivation for doing so would have been for flexibility in training data sets, as suggested by Issenhuth at Page 12, §5 Discussions, One of the key elements of the proposed method is that it does not require having access to paired datasets.
	This method of improving Liu was within the ordinary ability of one of ordinary skill in the art based on the teachings of Ren and Issenhuth.
	Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date, to modify Liu and the teachings of Ren with the teachings of Issenhuth to obtain the invention as specified in claim 14.
	Regarding claim 15, in which claim 14 is incorporated, Liu discloses wherein combining the digital image data with the additional subset of patches comprises: determining an additional set of tokens corresponding to the additional subset of patches of the digital image from the latent feature vector in a latent image space (Page 5, §Transformer Decoder, We note that the tokens corresponding to unmasked image regions. Examiner considers the unmasked image regions the “additional subset of patches”); determining a latent composite image by combining the set of modified tokens with the additional set of tokens in the latent image space (Page 5, §Transformer Decoder, set to the original image codebook indices. We predict the distributions only for positions corresponding to the edited image regions… Based on the predicted distribution of codebook indices, we use top-k sampling [Esser et al. 2021a; Holtzman et al. 2019] (𝑘 = 100 in our experiments) to create multiple candidate output sequences, each of which can be mapped to a new image by the image decoder. Since the final sequence of tokens can be used to generate an image, Examiner considers this to imply the combination of the mask and unmasked token sets); and generating the modified digital image utilizing a latent decoder neural network on the latent composite image (Page 5, §3.3 Image decoder, The image decoder takes as input the quantized feature map and decodes an RGB image).
Claims 9, 16, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Liu et al. (Liu, Difan, et al. "Asset: autoregressive semantic scene editing with transformers at high resolutions." ACM Transactions on Graphics (TOG) 41.4 (2022): 1-12) (hereafter, “Liu”) in view of Ren et al. (US 11,580,673) (hereafter, “Ren”) as applied to claims 1- 4, 6,  8, 10 -11, and 17-19 above, and further in view of Chen et al. (Chen, Shoufa, et al. "GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation." arXiv preprint arXiv: 2312.04557v1 (2023)) (hereafter, “Chen”).
	Regarding claim 9, in which claim 8 is incorporated, Liu discloses generating the digital image data comprises generating a set of modified tokens representing an object for the masked portion (Page 5, §Transformer Decoder, Specifically, the decoder predicts 𝑝(X𝑙 |{𝜒<𝑙 }), where X𝑙 is a categorical random variable representing a codebook index to be generated at position 𝑙 in the sequence and {𝜒<𝑙 } are all indices of the previous steps. We note that the tokens corresponding to unmasked image regions (i.e., image regions to be preserved) are set to the original image codebook indices. We predict the distributions only for positions corresponding to the edited image regions. Examiner considers each Xl as a token and therefore, the predicted distribution as a set of modified tokens); [and generating the latent composite image comprises mapping the set of modified tokens into the latent image domain utilizing a linear neural network layer].
	However, Liu fails to explicitly disclose generating the latent composite image comprises mapping the set of modified tokens into the latent image domain utilizing a linear neural network layer.
	Chen teaches generating the latent composite image comprises mapping the set of modified tokens into the latent image domain utilizing a linear neural network layer (Page 4, left column, first paragraph, Finally, a standard linear decoder is applied to convert these image tokens into latent space).
	Liu, Ren, and Chen are analogous to the claimed invention because they are all in the field of applying encoder-decoder models for image processing. It would have been obvious to a person of ordinary skill before the effective filing date of the claimed invention to incorporate the linear decoder of Chen into the image editing pipeline of Liu and the trimmed latent feature vector and masked image output of Ren. The suggestion/motivation for doing so would have been to enhance visual quality, as suggested by Chen at Page 8, §5. Conclusion, This innovative approach has demonstrably enhanced the visual quality of generated videos.
	This method of improving Liu was within the ordinary ability of one of ordinary skill in the art based on the teachings of Ren and Chen.
	Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date, to modify Liu and the teachings of Ren with the teachings of Chen to obtain the invention as specified in claim 9.
	Regarding claim 16, in which claim 10 is incorporated, Liu discloses wherein generating the digital image data comprises: [determining a text prompt] indicating an object to generate within the masked portion of the digital image (Page 3, §3 METHOD, the user paints some desired changes on the label map, e.g., replace mountain regions with water); and determining, utilizing the transformer-based generative decoder neural network (Page 4, §3.2 Autoregressive transformer, used by the transformer decoder to generate new codebook indices autoregressively; Page 5, §Transformer Decoder), the modified latent feature vector, based on the feature subset representing the subset of patches of the digital image (Page 5, §Transformer Decoder, Specifically, the decoder predicts 𝑝(X𝑙 |{𝜒<𝑙 }), where X𝑙 is a categorical random variable representing a codebook index to be generated at position 𝑙 in the sequence and {𝜒<𝑙 } are all indices of the previous steps. We note that the tokens corresponding to unmasked image regions (i.e., image regions to be preserved) are set to the original image codebook indices. We predict the distributions only for positions corresponding to the edited image regions. Examiner considers the predicted distribution to be the “modified feature vector set”. Only predicting tokens in the edited image regions corresponds to the subset of patches referred to by the user input prompt) and [the text prompt].
	However, Liu fails to explicitly disclose determining a text prompt.
	Chen teaches determining a text prompt (Page 4, §Text encoder model, Current advancements in T2I (Text 2 Image) diffusion techniques employ a variety of language models, each with its unique strengths and limitations. To thoroughly assess which model best complements transformer based diffusion methods, we have integrated several models into GenTron. Examiner considers utilizing a text encoder model implies determining a text prompt for input to the model).
	Liu, Ren, and Chen are analogous to the claimed invention because they are all in the field of applying encoder-decoder models for image processing. It would have been obvious to a person of ordinary skill before the effective filing date of the claimed invention to incorporate the linear decoder of Chen into the image editing pipeline of Liu and the trimmed latent feature vector and masked image output of Ren. The suggestion/motivation for doing so would have been to enhance visual quality, as suggested by Chen at Page 8, §5. Conclusion, This innovative approach has demonstrably enhanced the visual quality of generated videos.
	This method of improving Liu was within the ordinary ability of one of ordinary skill in the art based on the teachings of Ren and Chen.
	Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date, to modify Liu and the teachings of Ren with the teachings of Chen to obtain the invention as specified in claim 16.
	Regarding claim 20, in which claim 19 is incorporated, Liu discloses wherein generating the modified digital image comprises: [mapping the modified feature set into a latent image domain utilizing a linear neural network layer]; generating a latent composite image by combining the modified feature set in the latent image domain with an additional feature set corresponding to a portion of the digital image outside the masked portion (Page 5, §Transformer Decoder, set to the original image codebook indices. We predict the distributions only for positions corresponding to the edited image regions… Based on the predicted distribution of codebook indices, we use top-k sampling [Esser et al. 2021a; Holtzman et al. 2019] (𝑘 = 100 in our experiments) to create multiple candidate output sequences, each of which can be mapped to a new image by the image decoder. Since the final sequence of tokens can be used to generate an image, Examiner considers this to imply the combination of the mask and unmasked token sets); and generating, utilizing a latent decoder neural network, the modified digital image from the latent composite image (Page 5, §3.3 Image decoder, The image decoder takes as input the quantized feature map and decodes an RGB image).
	However, Liu fails to explicitly disclose mapping the modified feature set into a latent image domain utilizing a linear neural network layer.
	Chen teaches mapping the modified feature set into a latent image domain utilizing a linear neural network layer (Page 4, left column, first paragraph, Finally, a standard linear decoder is applied to convert these image tokens into latent space).
	Liu, Ren, and Chen are analogous to the claimed invention because they are all in the field of applying encoder-decoder models for image processing. It would have been obvious to a person of ordinary skill before the effective filing date of the claimed invention to incorporate the linear decoder of Chen into the image editing pipeline of Liu and the trimmed latent feature vector and masked image output of Ren. The suggestion/motivation for doing so would have been to enhance visual quality, as suggested by Chen at Page 8, §5. Conclusion, This innovative approach has demonstrably enhanced the visual quality of generated videos.
	This method of improving Liu was within the ordinary ability of one of ordinary skill in the art based on the teachings of Ren and Chen.
	Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date, to modify Liu and the teachings of Ren with the teachings of Chen to obtain the invention as specified in claim 20.
	Allowable Subject Matter
Claims 5, 12, and 13 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter:  
Regarding claim 5, Ren discloses trimming the latent feature based on a mask (Col. 8, lines 5-10, The image synthesis process with mask embedding represented in FIG. 2 may sample a mask constraint point in a lowest resolution manifold, e.g., by locating a correct partition via mask embedding input 208 and sampling a point within that partition via a latent features vector 210), but does not disclose determining the latent feature vector based on additional patches that comprise additional contextual information.
Regarding claim 12, Liu discloses determining whether patches are within the masked region (Page 3, §3.1 Image encoder, We also create a 𝐻im ×𝑊im binary mask indicating image regions that must be replaced according to the semantic map edits), but does not disclose determining which portion comprises additional contextual information.
Regarding claim 13, claim 13 depends on objected claim 12. Therefore, by virtue of that dependency, claim 13 is also indicated as objected subject matter.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Park et al. (Park, Dong Huk, et al. "Shape-guided diffusion with inside-outside attention." Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2024) discloses an image editing neural network based on attention masks and transformer architecture (Fig. 3; Page 4188, §3. Shape-Guided Diffusion, We build upon Stable Diffusion (SD)).
Steiner et al. (US 2025/0117893) discloses using partial subsets of tokens corresponding to a mask in a network model for image editing (¶0293, For example, a representation can be aggregated over a subset of the sequence of tokens. For instance, a representation can be aggregated over a subset of tokens that correspond to a center of an input image).
Dupont et al. (US 2025/0168368) discloses using a masked latent feature set in an encoder-decoder neural network (¶0024, The causal subset may, for example, be determined by applying a causal mask to the latent values of the grid).

Any inquiry concerning this communication or earlier communications from the examiner should be directed to XIAOMAO DING whose telephone number is (571)272-7237. The examiner can normally be reached Mon-Fri 8:00-4:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Henok Shiferaw can be reached at (571) 272-4637. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/XIAOMAO DING/Examiner, Art Unit 2676


/Henok Shiferaw/Supervisory Patent Examiner, Art Unit 2676
Read full office action
EDITING DIGITAL IMAGES WITH LOCAL REFINEMENT VIA SELECTIVE FEATURE TRIMMING

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

EDITING DIGITAL IMAGES WITH LOCAL REFINEMENT VIA SELECTIVE FEATURE TRIMMING

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email