Last updated: April 19, 2026
Application No. 18/749,438
IMAGE STYLE TRANSFER

Non-Final OA §102§103§112
Filed
Jun 20, 2024
Examiner
BEUTEL, WILLIAM A
Art Unit
2616
Tech Center
2600 — Communications
Assignee
BEIJING BAIDU NETCOM SCIENCE TECHNOLOGY CO., LTD.
OA Round
1 (Non-Final)
Interview Optional

— +20.4% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 469 resolved cases, 2023–2026
Examiner Intelligence

BEUTEL, WILLIAM A View full profile →
Grants 70% — above average
Career Allow Rate
328 granted / 469 resolved
+7.9% vs TC avg
Strong +20% interview lift
Without
With
+20.4%
Interview Lift
resolved cases with interview
Typical timeline
2y 7m
Avg Prosecution
28 currently pending
Career history
497
Total Applications
across all art units
Statute-Specific Performance

§101
9.9%
-30.1% vs TC avg
§103
49.8%
+9.8% vs TC avg
§102
10.7%
-29.3% vs TC avg
§112
22.0%
-18.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 469 resolved cases
Office Action

§102 §103 §112
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Regarding claim 1, the claim recites “calculating a first cross-attention feature of a first image feature and the text feature, wherein […] the first image feature in each of a second time step and subsequent time steps is a result image feature generated in a previous time step” in lines 10-13.  The claim then recites “generating a result image feature of the time step based on the third cross-attention feature and the text feature” in lines 18-19 of the claim.  Due to the introduction of “a result image feature” in multiple steps, the claim is rendered indefinite, as there is no clear understanding of how the parts are related.  By using the same term language, the claim is indefinite as to which “result image feature” is used in line 12, and how it relates to the subsequently introduced “result image feature” in line 18.  This could be the same result image feature, or refer to different result image features, and further more does not clearly link to the steps of the system to clearly identify whether the generated result image feature in line 18 is intended to be used in a subsequent time step, or is merely reciting an additional result image feature that is generated, and some other result image feature is intended for use in the calculating step.  The language requires clarification in order to meet the requirements of 35 U.S.C. 112(b) definiteness.  For purposes of interpretation, the images are interpreted as any image feature. 
Further regarding claim 1, the claim recites “obtaining a second cross-attention feature of a second image feature of the reference image and the text feature” in lines 14-15 of the claim. The claim limitation “a second image feature of the reference image” implies that the method uses a first image feature of the reference image, but the claim does not recite as such.  Instead, the claim states the first image feature is of an initial image.  It is common for image diffusion models to utilize a noisy image as an input, and then, over a series of steps, reduce the noise in that same noisy image be iterating the diffusion model, where an output is a reduced noise version of the noisy image, which is then used again as input (also which seems to be what applicant’s specification is directed to – see ¶19 of PG-Pub of applicant’s specification).  As this claim is currently drafted, it seems to imply that the “initial image” could be a noisy image as opposed to the reference image.  Accordingly, the claim is indefinite as it is unclear whether the initial image is the same as the reference image, in which case, the claim should use the same language (i.e. replace initial image with reference image), or instead is intending to have two different images, namely an initial image and a reference image from which two separate image features are obtained and used.   For purposes of interpretation the language is interpreted as any images, the same or different. 
Furthermore, “obtaining a second cross-attention feature of a second image of the reference image and the text image” is unclear whether the “second cross-attention feature” is only “of a second image of the reference image”, or is intended as something related to “of a second image … and the text image.”  
Accordingly, claim 1 is rendered indefinite.  
Independent claims 10 and 19 include the same indefinite language as recited by claim 1 and are therefore rejected based on the same rationale as claim 1 set forth above.  Furthermore, claims 2-9, 11-18 and 20 depend from claims 1, 10 and 19 respectively, incorporating the same indefinite language by reference, without further clarifying the scope of the claim.  As such, the dependent claims are rejected based on the same rationale as their parent claims (i.e. for the reasons set forth for claim 1 above).  
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claim(s) 1, 8, 10, 17, and 19 is/are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Zhou et al. (US 2025/0329079 A1).
Regarding claim 1, Zhou discloses: 
An electronic device (Zhou, Abstract and ¶45 and ¶81), comprising:
A processor (Zhou, ¶45: processing unit; ¶49: system including a processor executing a set of codes to control functional elements of an apparatus); and
A memory communicatively connected to the processor, (Zhou, Fig. 7 and ¶45: memory; ¶83: processor unit 705 is configured to execute computer-readable instructions stored in a memory to perform various functions; Also ¶86 disclosing memory unit as RAM/ROM and storing computer-readable, computer-executable software including instructions that, when executed, cause a processor to perform various functions)
Wherein the memory stores instructions executable by the processor, and the instructions, when executed by the processor, cause the processor to perform operations comprising: (Zhou, ¶83: processor unit 705 is configured to execute computer-readable instructions stored in a memory to perform various functions; Also ¶86 disclosing memory unit as RAM/ROM and storing computer-readable, computer-executable software including instructions that, when executed, cause a processor to perform various functions)
Obtaining a reference image and a description text, (Zhou, Fig. 3 and ¶58: user provides an input image and a text prompt; also ¶159) wherein the description text comprises a content description text describing content of the reference image and a style description text describing a style of a target image to be generated; (Zhou, Fig. 3 and ¶58: text prompt states “I want to generate something creative with this person, with post-impressionistic painting style, can you help me?” – i.e. content description = “person” and style description = “post-impressionistic painting style”; ¶102: machine learning model 720 obtains an input image and a text prompt. In some examples, machine learning model 720 obtains a reference image, where the synthetic image is generated based on the reference image)
Extracting a text feature of the description text (Zhou, ¶62: machine learning model 410 analyzes text prompt 405 and input image 400 to obtain useful information, i.e. a text embedding of text prompt 405; ¶72: system extracts semantic features from the input image and the textual information from the text prompt to generate an output embedding; ¶147: system converts text prompt 960 (or other guidance) into a conditional guidance vector or other multi-dimensional representation); and
Performing the following operations based on a pre-trained diffusion model to generate the target image: (Zhou, ¶106: image generation model 730 includes a diffusion model; ¶151: neural network trained for reverse diffusion process) in each time step of the diffusion model: 
Calculating a first cross-attention feature of a first image feature and the text feature, wherein the first image feature in a first time step is an image feature of an initial image, and the first image feature in each of a second time step and subsequent time steps is a result image feature generated in a previous time step; (Zhou, ¶128: language model 840 generates two output embeddings: a first output embedding e; and a second output embedding s; based on the projected image embedding and the text embedding, two output embeddings include an inference of a user intention based on input image 805 and text prompt 820; ¶140: guidance feature 970 can be combined with the noisy feature 935 using a cross-attention block within the reverse diffusion process 940; ¶¶141-143 disclosing cross-attention attending to multiple parts of an input sequence, capturing the interactions and dependencies between different elements; ¶146 discloses using input features to combine with intermediate features; ¶151: At each step t−1, the reverse diffusion process 940 takes xt, such as the first intermediate image, and t as input, and reverse diffusion process 940 outputs xt-1, such as the second intermediate image iteratively until xT is reverted back to x0 the original image 905 – i.e. iterative on each image from a first image as in well-known standard diffusion models that reduce noise of an input image until a final image is obtained)
Obtaining a second cross-attention feature of a second image feature of the reference image and text feature (Zhou, ¶128: language model 840 generates two output embeddings: a first output embedding e; and a second output embedding s; based on the projected image embedding and the text embedding, where two output embeddings include an inference of a user intention based on input image 805 and text prompt 820, where the second output embedding sj is provided to guidance projection layer 845 of language generation model 825 to generate guidance embedding 855; ¶¶141-143 disclosing cross-attention attending to multiple parts of an input sequence, capturing the interactions and dependencies between different elements, note: multiple includes at least a second);
Editing the first cross-attention feature based on the second cross-attention feature to obtain a third cross-attention feature (Zhou, ¶129: image generation model 860 generates synthetic image 865 based on image embedding 815 and guidance embedding 855, where image embedding 815 captures fine-grained detailed features such as edges, textures, colors, shapes, patterns, contours, and intensity gradients of the object (e.g., corgi) depicted in input image 805, and guidance embedding 855 captures high-level semantic information of input image 805 and text prompt 820, and combining image embedding 815 and guidance embedding 855, synthetic image 865 includes both high-level information of input image 805 and low-level details (such as identity) of input image 805; ¶¶142-143: The cross-attention block calculates attention scores by measuring the similarity between each query representation and the key representations, where a higher similarity indicates that more attention is given to a key element, and further  cross-attention block then normalizes the attention scores to obtain attention weights (for example, using a softmax function), where the attention weights determine how much information from each value element is incorporated into the final attended representation – i.e. weighting based on the obtained cross-attention scores of multiple parts); and 
Decoding a result image feature of a last time step to generate the target image (Zhou, ¶70: synthetic image preserves the identity of the object (e.g., a person, animal, etc.) from the input image with modifications described by the text prompt; ¶138: image decoder 950 decodes the denoised image feature 945 to obtain an output image 955 in pixel space 910)
Regarding claim 1, the system device of claim 10 performs the method of claim 1 and as such claim 1 is rejected based on the same rationale as claim 10 set forth above.
Regarding claim 19, Zhou discloses: 
A non-transitory computer readable storage medium storing computer instructions, wherein the computer instructions are configured to enable a computer to perform operations (Zhou, Abstract; ¶83: processor unit 705 is configured to execute computer-readable instructions stored in a memory to perform various functions; Also ¶86 disclosing memory unit as RAM/ROM and storing computer-readable, computer-executable software including instructions that, when executed, cause a processor to perform various functions)
Further regarding claim 19, the operations perform the method of claim 1 and as such claim 19 is further rejected based on the same rationale as claim 1 set forth above. 
Regarding claim 17, Zhou further discloses: 
wherein the calculating the first cross-attention feature of the first image feature and the text feature comprises: calculating a self-attention feature of the first image feature, generating a fourth image feature based on the self-attention feature and the first image feature; and calculating a first cross-attention feature of the fourth image feature and the text feature. (Zhou, ¶101: The term “self-attention” refers to a machine learning model in which representations of the input interact with each other to determine attention weights for the input, where self-attention can be distinguished from other attention models because the attention weights are determined at least in part by the input itself; ¶128: language model 840 generates two output embeddings: a first output embedding e; and a second output embedding s; based on the projected image embedding and the text embedding, where two output embeddings include an inference of a user intention based on input image 805 and text prompt 820, where the second output embedding sj is provided to guidance projection layer 845 of language generation model 825 to generate guidance embedding 855; ¶143: The cross-attention block then normalizes the attention scores to obtain attention weights (for example, using a softmax function), where the attention weights determine how much information from each value element is incorporated into the final attended representation, where by attending to different parts of the key-value sequence simultaneously, the cross-attention block captures relationships and dependencies across the input sequences, allowing the machine learning model to understand the context and generate more accurate and contextually relevant outputs;  ¶145: This process is repeated multiple times, and then the process is reversed. For example, the down-sampled features are up-sampled using the up-sampling process to obtain up-sampled features. The up-sampled features can be combined with intermediate features having a same resolution and number of channels via a skip connection. These inputs are processed using a final neural network layer to produce output features.)
Regarding claim 8, the system device of claim 17 performs the method of claim 8 and as such claim 8 is rejected based on the same rationale as claim 17 set forth above.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 5 and 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over:
Zhou et al. (US 2025/0329079 A1) in view of 
Datta et al. (Datta, Siddhartha, et al., “Prompt Expansion for Adaptive Text-to-Image Generation”, Computer Vision and Pattern Recognition, arXiv:2312.16720 [cs.CV], Dec. 27, 2023)
Regarding claim 14, the limitations included from claim 10 are rejected based on the same rationale as claim 10 set forth above.  Further regarding claim 14, Zhou does not explicitly discloses wherein the extracting the text feature of the description text comprises: encoding the content description text to obtain a first text feature of the content description text; introducing information of the reference image into the style description text to obtain an extended style description text; and encoding the extended style description text to obtain a second text feature of the extended style description text, wherein the text feature comprises the first text feature and the second text feature. 
However, the technique of using information of a reference image to obtain an extended description text as claimed is a known technique.
Datta discloses: 
wherein the extracting the text feature of the description text comprises: encoding the content description text to obtain a first text feature of the content description text; introducing information of the reference image into the style description text to obtain an extended style description text; and encoding the extended style description text to obtain a second text feature of the extended style description text, wherein the text feature comprises the first text feature and the second text feature. (Datta, p. 5, section 3, ¶1: The Prompt Expansion framework requires a model to take a user text query as input and return N text prompts as output, where “we invert the images to a closely corresponding prompt that includes alt-text jargon (which we refer to as flavors, refer Sec 3.2). Finally, we map the inverted text to a range of high-level queries that more closely correspond to user input (refer Sec 3.3).  These queries are paired with the prompts from the second step to form the {query:prompt} dataset.”; p. 6, section 3.3., ¶1: the final step in dataset preparation is to compute a range of potential user queries that are suitable to map the inverted text (prompt), where the few-shot prompt are prepended before the prompt as context, and a corresponding query is generated by the text-to-text model – see Fig. 6 on p. 7)
Both Zhou and Datta are directed to text-to-image image generation models.  It would have been obvious to one of ordinary skill in the art, before the effective filing date and with a reasonable expectation of success, to modify the system and method for text-to-image generation using diffusion models for artificial intelligence processing of images as provided by Zhou, using the text-to-image prompt expansion technique provided by Datta, using known electronic interfacing and programming techniques.  The modification merely applies a known technique of text-to-image text prompt expansion to the base device of text-to-image image generation ready for improvement to yield predictable results.  The teachings of Datta provide a technique to improve upon text prompts for a system that uses text prompts for text-to-image generation as provided by Zhou, which one of ordinary skill in the art would have recognized that applying such a prompt expansion improves upon the image generation modeling (see Datta, p. 4, “Contributions” section that discloses the improvement of improving image quality and diversity, aesthetics and text-image alignment) yielding predictable results of using an expanded text prompt within a system that uses text as a prompt.  The modification also provides an improved image generation model that uses text as a prompt by providing more prompt diversity (see Datta, p. 4, “Contributions” section that discloses the improvement of improving image quality and diversity, aesthetics and text-image alignment) and reduces tedious human interaction that would otherwise require a human operator to come up with a variety of prompts.  
Regarding claim 5, the system device of claim 14 performs the method of claim 5 and as such claim 5 is rejected based on the same rationale as claim 14 set forth above.
	
Claim(s) 9 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over:
Zhou et al. (US 2025/0329079 A1) in view of 
Choi et al. (US 2025/0356171 A1)
Regarding claim 18, the limitations included from claim 17 are rejected based on the same rationale as claim 17 set forth above.  Further regarding claim 18, 
wherein the reference image is any image frame except a first image frame in a reference video, and wherein the generating the fourth image feature comprises: adjusting the self-attention feature based on a historical self-attention feature corresponding to the self-attention feature to obtain an adjusted self-attention feature, wherein the historical self- attention feature is an attention feature obtained by performing style transfer on a historical image frame of the reference image by using the diffusion model and located at a same location as the self-attention feature; and generating the fourth image feature based on the adjusted self-attention feature and the first image feature (Choi, Figs. 4A and 4B and ¶49: video generation using a generative artificial intelligence model ; ¶50: an input prompt defining a video output to be generated by a generative artificial intelligence model may be processed using a two-stage framework in which an input prompt is processed by the spatial component 125 and the temporal component 130 based on a cross-attention map 405 used to mask the generation of various attention outputs in both the spatial component 125 and the temporal component 130; ¶55: motion depicted by the subject of the generated video output is generated based on cross-attention maps, wherein prior frames and information about the text prompt defining the video to be generated are used as input into the temporal component, processed by the temporal self-attention block and temporal adapter, where “The output of the temporal self-attention block 420 may be combined with the output of the temporal adapter 422 and the cross-attention map 405 generated based on an output of the previous inferencing round to generate a temporal attention map which, as illustrated, may be provided as input into a temporal feedforward network 424 for projection into an output frame of the generative artificial intelligence model)
In other words, Choi discloses a known technique for video generation using a generative artificial intelligence model for video which uses previous video frames for providing attention outputs or features for cross-attention mapping that generates additional frames to incorporate the text prompt input features along with past video frame features within the model for generating a new frame. This known technique of generative artificial intelligence modeling of vide combined with the text and image to image style transfer generative modeling provided by Zhou, teaches the limitations of the claim.  The modification merely incorporates known iterative machine modeling utilizing model weightings from past video images to generate new images from past images plus a text prompt within a system that generates new images using both an image input and text prompt to generate a new image, yielding an improved text to image style transfer system that allows for improved use by allowing the processing of video instead of merely single images (i.e. processing a plurality of static images), while maintaining consistency between video images by guiding the machine learning using the past video frames for a more consistent and aesthetically improved machine made video. 
Regarding claim 9, the system device of claim 18 performs the method of claim 9 and as such claim 9 is rejected based on the same rationale as claim 18 set forth above.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 

He et al. (He, Feihong, et al., “FreeStyle: Free Lunch for Text-guided Style Transfer using Diffusion Models”, Computer Vision and Pattern Recognition (cs.CV), arXiv:2401.15636 [cs.CV], version 1, published Jan. 28, 2024) is directed to similar a text-guided style transfer system and method including obtaining a reference image and a description text, (He, p. 3, Fig. 2: style text prompt and content image x0), extracting a text feature of the description text (He, p. 2, Figure 2: “Our dual-stream encoder generates the content feature fc guided by the input content image x0 and the style feature fs guided by the input style text prompt and noisy image”; p. 4, section 3.2, ¶1: FreeStyle includes embedding to obtain image features, equation (4), embedding of style text prompt); and performing operations based on a pre-trained diffusion model to generate the target image: in each time step of the diffusion model: (He, p. 4, section 3.2: Model Structure of FreeStyle, ¶1 discloses diffusion model including skip connections) and decoding a result image feature of a last time step to generate the target image (He, p. 2, Figure 2 discloses single stream decoder to obtain result image based on style prompt and content image; p. 4, section 3.2, ¶1: “It consists of an encoder and a decoder, along with skip connections that facilitate information exchange between corresponding layers of the encoder and decoder.”)

Park et al. (US 12,524,937 B2) is related art directed to systems and methods for image generation including obtaining a text prompt, generating a style vector based on the text prompt and generating an image corresponding to the text prompt base don the adaptive convolution filter (Park, Abstract).  Park further discloses use of a diffusion model to obtain output images (Park, [3:17-28]), including using includes machine learning model, text prompt, global vector, local vectors, latent code, style vector, and feature map, where the machine learning model includes text encoder network, mapping network, and image generation network, the text encoder network including a pretrained encoder  and learned encoder and image generation network includes convolution block, self-attention block, and cross-attention block (Park, [16:47-58]).  Park fails to teach the particulars of the interactions of the cross attention feature calculations as claimed, and further does not include use of a reference image as a prompt.  

Any inquiry concerning this communication or earlier communications from the examiner should be directed to WILLIAM A BEUTEL whose telephone number is (571)272-3132. The examiner can normally be reached Monday-Friday 9:00 AM - 5:00 PM (EST).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, DANIEL HAJNIK can be reached at 571-272-7642. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/WILLIAM A BEUTEL/Primary Examiner, Art Unit 2616
Read full office action
Prosecution Timeline

Jun 20, 2024
Application Filed
Feb 04, 2026
Non-Final Rejection — §102, §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/344,299
Patent 12581262
AUGMENTED REALITY INTERACTION METHOD AND ELECTRONIC DEVICE
2y 5m to grant Granted Mar 17, 2026
18/307,238
Patent 12572258
APPARATUS AND METHOD WITH IMAGE PROCESSING USER INTERFACE
2y 5m to grant Granted Mar 10, 2026
17/948,480
Patent 12566531
CONFIGURING A 3D MODEL WITHIN A VIRTUAL CONFERENCING SYSTEM
2y 5m to grant Granted Mar 03, 2026
18/342,458
Patent 12561927
MEDIA RESOURCE DISPLAY METHOD AND APPARATUS, DEVICE, AND STORAGE MEDIUM
2y 5m to grant Granted Feb 24, 2026
18/199,695
Patent 12554384
SYSTEMS AND METHODS FOR IMPROVED CONTENT EDITING AT A COMPUTING DEVICE
2y 5m to grant Granted Feb 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
70%
Grant Probability
90%
With Interview (+20.4%)
2y 7m
Median Time to Grant
Low
PTA Risk
Based on 469 resolved cases by this examiner. Grant probability derived from career allow rate.