Last updated: April 19, 2026

Application No. 18/492,572

HARDWARE-AWARE EFFICIENT ARCHITECTURES FOR TEXT-TO-IMAGE DIFFUSION MODELS

Final Rejection §103

Filed

Oct 23, 2023

Examiner

KY, KEVIN

Art Unit

2671

Tech Center

2600 — Communications

Assignee

Qualcomm Incorporated

OA Round

2 (Final)

Interview Optional

— +25.3% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 549 resolved cases, 2023–2026

Examiner Intelligence

KY, KEVIN View full profile →

Grants 76% — above average

Career Allow Rate

420 granted / 549 resolved

+14.5% vs TC avg

Strong +25% interview lift

Without

With

+25.3%

Interview Lift

resolved cases with interview

Typical timeline

2y 6m

Avg Prosecution

33 currently pending

Career history

582

Total Applications

across all art units

Statute-Specific Performance

§101

17.6%

-22.4% vs TC avg

§103

46.5%

+6.5% vs TC avg

§102

20.8%

-19.2% vs TC avg

§112

9.9%

-30.1% vs TC avg

Black line = Tech Center average estimate • Based on career data from 549 resolved cases

Office Action

§103

DETAILED ACTION
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) (e.g. claims 19-24) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. 
Referring to the specifications as filed, the apparatus in claims 19-24 corresponds to Fig. 1 a system-on-a-chip (SOC) 100, which may include a central processing unit (CPU) 102 or a multi-core CPU configured for text-to-image diffusion models. ¶30 further discloses “the general-purpose processor 102 may include means for receiving, means for generating, and means for training”.



Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-4, 4-10, 13-16, and 19-22 is/are rejected under 35 U.S.C. 103 as being unpatentable over Park et al (US 20240281924) in view of Korviakov et al (US 20230394285), in further view of Karpman et al (US Patent 11995803 B1).
	Regarding claim 1, Park disclose an apparatus (Fig. 5 apparatus 500), comprising: 
at least one memory (¶32 One or more embodiments of the apparatus include at least one processor; at least one memory storing instructions executable by the at least one processor); and 
at least one processor coupled to the at least one memory, the at least one processor configured to (col ¶32 One or more embodiments of the apparatus include at least one processor; at least one memory storing instructions executable by the at least one processor):
receive a text-semantic input (Fig. 10 text prompt 1005) at a first stage of a neural network, the first stage including a first convolutional block (¶125 As an example shown in FIG. 10, the layer at 16-pixel includes 5 blocks of the interleaved attention and convolutional layers. Here, 16-pixel means 16-by-16 pixels. The layer at 32-pixel includes 5 blocks of the interleaved attention and convolutional layers. Similarly, layers at 64-pixel, 128-pixel, and 256-pixel include 5 blocks of the interleaved attention and convolutional layers) and no attention layers;
receive, at a second stage, the first output from the first stage, the second stage comprising a first down sampling block including a first attention layer and a second convolutional block (¶125 As an example shown in FIG. 10, the layer at 16-pixel includes 5 blocks of the interleaved attention and convolutional layers. Here, 16-pixel means 16-by-16 pixels. The layer at 32-pixel includes 5 blocks of the interleaved attention and convolutional layers. Similarly, layers at 64-pixel, 128-pixel, and 256-pixel include 5 blocks of the interleaved attention and convolutional layers; image generation network 1045 includes downsampling residual blocks and then upsampling residual blocks, where a layer of the downsampling residual blocks is connected to a layer of the upsampling residual blocks by a skip connection in a U-net architecture.);
receive, at a third stage, a second output from the second stage, the third stage comprising a first up sampling block including a second attention layer and a first set of convolutional blocks (As an example shown in FIG. 10, the layer at 16-pixel includes 5 blocks of the interleaved attention and convolutional layers. Here, 16-pixel means 16-by-16 pixels. The layer at 32-pixel includes 5 blocks of the interleaved attention and convolutional layers. Similarly, layers at 64-pixel, 128-pixel, and 256-pixel include 5 blocks of the interleaved attention and convolutional layers; ¶125 image generation network 1045 includes downsampling residual blocks and then upsampling residual blocks, where a layer of the downsampling residual blocks is connected to a layer of the upsampling residual blocks by a skip connection in a U-net architecture.);
receive, at a fourth stage, the first output from the first stage and a third output from the third stage, the fourth stage comprising a second up sampling block including no attention layers and a second set of convolutional blocks (¶125 In some cases, skip connections 1050 in the asymmetric U-Net architecture exist between layers at the same resolution. For example, image generation network 1045 includes downsampling residual blocks and then upsampling residual blocks, where a layer of the downsampling residual blocks is connected to a layer of the upsampling residual blocks by a skip connection in a U-net architecture); and 
generate an image at the fourth stage, based on the text-semantic input (¶124 where the input low-resolution image 1015 (64-pixel image) passes through 3 downsampling residual blocks and then 6 upsampling residual blocks with attention layers to generate the high-resolution image 1055 (512-pixel image)).
Park fails to specifically teach where Korviakov teaches generate a first output comprising a feature map at the first stage (¶17 CNN is a deep learning neural network, wherein one or more building blocks are based on a convolution operation; ¶18 The input data may be related to any kind of data, for example, image data, text data, voice data, etc.; ¶19 the device may perform a convolution operation, which may be, for example, an operation that transforms input feature maps having the first number of channels into output feature maps);
and fails to specifically teach where Karpman teaches the first stage including a first convolutional block and no attention layers (col 5 lines 15-36 & 45-50 the base image diffusion model 120 defines a deep learning network (e.g., a convolutional neural network, a residual neural network, etc.) configured (e.g., through the training described) to generate images from random (e.g., Gaussian) noise based on text prompts and/or descriptions. The base image diffusion model 120 can include a U-net architecture (e.g., Efficient U-Net) defined from residual and multi-head attention blocks that enable the base image diffusion model 120 to progressively denoise (e.g., infill, generate, augment) image data according to cross-attention inputs based on the text prompt; self-attention layers in the base diffusion model architecture can be omitted to improve memory efficiency and inference time), and the fourth stage comprising a second up sampling block including no attention layers (col 5 lines 15-36 & 45-50  system can then pass the base image to the set of high-resolution diffusion models 116 for upsampling and output; self-attention layers in the base diffusion model architecture can be omitted to improve memory efficiency and inference time).
Therefore, it would have been obvious to one with ordinary skill in the art before the effective filing date of the invention to have implemented the teaching of generate a first output comprising a feature map at the first stage from Korviakov, and the teaching of the first stage including a first convolutional block and no attention layers and the fourth stage comprising a second up sampling block including no attention layers from Karpman into the method as disclosed by Park. The motivation for doing this is to improve training neural networks to perform tasks and further to improve memory efficiency and inference time.

Regarding claim 2, the combination of Park, Korviakov and Karpman disclose the apparatus of claim 1, in which the neural network comprises a text-to-image diffusion-based generative model (Karpman col. 2 lines 51-55 Text-to-image diffusion model 112 may be a probabilistic generative model used to generate image data). The motivation to combine the references is discussed above in the rejection for claim 1.

Regarding claim 3, the combination of Park, Korviakov and Karpman disclose the apparatus of claim 1, in which the neural network comprises a UNet (Park ¶124 As an example shown in FIG. 10, image generation network 1045 is rearranged to an asymmetric U-Net architecture).

Regarding claim 4, the combination of Park, Korviakov and Karpman disclose the apparatus of claim 1, in which the first stage comprises a first additional convolutional block, the second stage comprises a second additional convolutional block, the third stage comprises a third additional convolutional block, and the fourth stage comprises a fourth additional convolutional block (Park ¶125 As an example shown in FIG. 10, the layer at 16-pixel includes 5 blocks of the interleaved attention and convolutional layers. Here, 16-pixel means 16-by-16 pixels. The layer at 32-pixel includes 5 blocks of the interleaved attention and convolutional layers. Similarly, layers at 64-pixel, 128-pixel, and 256-pixel include 5 blocks of the interleaved attention and convolutional layers; image generation network 1045 includes downsampling residual blocks and then upsampling residual blocks, where a layer of the downsampling residual blocks is connected to a layer of the upsampling residual blocks by a skip connection in a U-net architecture.).

Regarding claim(s) 7-10 (drawn to a method):               
The rejection/proposed combination of Park, Korviakov and Karpman, explained in the rejection of apparatus claim(s) 1-4, anticipates/renders obvious the steps of the method of claim(s) 7-10 because these steps occur in the operation of the proposed combination as discussed above. Thus, the arguments similar to that presented above for claim(s) 1-4 is/are equally applicable to claim(s) 7-10.

Regarding claim(s) 13-16 (drawn to a CRM):               
The rejection/proposed combination of Park, Korviakov and Karpman, explained in the rejection of apparatus claim(s) 1-4, anticipates/renders obvious the steps of the computer readable medium of claim(s) 13-16 because these steps occur in the operation of the proposed combination as discussed above. Thus, the arguments similar to that presented above for claim(s) 1-4 is/are equally applicable to claim(s) 13-16. See Park ¶81-83.

Regarding claim(s) 19-22 (drawn to an apparatus):               
The rejection/proposed combination of Park, Korviakov and Karpman, explained in the rejection of apparatus claim(s) 1-4, anticipates/renders obvious the steps of the system of claim(s) 19-22 because these steps occur in the operation of the proposed combination as discussed above. Thus, the arguments similar to that presented above for claim(s) 1-4 is/are equally applicable to claim(s) 19-22. See Park ¶81-83.
	
Claim(s) 5, 11, 17, and 23 is/are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Park, Korviakov and Karpman as applied to claim 4, 10, 16 and 22 above, and further in view of Guo et al (US 20230351185).
Regarding claim 5, the combination of Park, Korviakov and Karpman disclose the apparatus of claim 4, but fails to teach where Guo teaches in which the at least one processor is further configured to: train the neural network to obtain a converged neural network (¶44 Referring to FIG. 2, in response to each pruning algorithm pruning the neural network, the processor 130 retrains the pruned neural network (step S220). Specifically, after each pruning, the processor 130 may retrain the pruned neural network. When the neural network (model) converges, the processor 130 may use another pruning algorithm to prune the pruned neural network); and train a pruned neural network based on the converged neural network (¶44 Specifically, after each pruning, the processor 130 may retrain the pruned neural network. When the neural network (model) converges, the processor 130 may use another pruning algorithm to prune the pruned neural network. For example, the neural network is retrained after channel pruning, and when the neural network converges, weight pruning is then performed).
Therefore, it would have been obvious to one with ordinary skill in the art before the effective filing date of the invention to have implemented the teaching of in which the at least one processor is further configured to: train the neural network to obtain a converged neural network, and train a pruned neural network based on the converged neural network  from Guo into the apparatus as disclosed by the combination of Park, Korviakov and Karpman. The motivation for doing this is to improve techniques for optimizing neural networks.

Regarding claim(s) 11 (drawn to a method):               
The rejection/proposed combination of Park, Korviakov, Karpman, and Guo explained in the rejection of apparatus claim(s) 5, anticipates/renders obvious the steps of the method of claim(s) 11 because these steps occur in the operation of the proposed combination as discussed above. Thus, the arguments similar to that presented above for claim(s) 5 is/are equally applicable to claim(s) 11.

Regarding claim(s) 17 (drawn to a CRM):               
The rejection/proposed combination of Park, Korviakov, Karpman and Guo, explained in the rejection of apparatus claim(s) 5, anticipates/renders obvious the steps of the computer readable medium of claim(s) 17 because these steps occur in the operation of the proposed combination as discussed above. Thus, the arguments similar to that presented above for claim(s) 5 is/are equally applicable to claim(s) 17. See Park ¶81-83.

Regarding claim(s) 23 (drawn to an apparatus):               
The rejection/proposed combination of Park, Korviakov, Karpman, and Guo, explained in the rejection of apparatus claim(s) 5, anticipates/renders obvious the steps of the system of claim(s) 23 because these steps occur in the operation of the proposed combination as discussed above. Thus, the arguments similar to that presented above for claim(s) 5 is/are equally applicable to claim(s) 23. See Park ¶81-83.

Claim(s) 6, 12, 18, and 24 is/are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Park, Korviakov, Karpman, and Guo as applied to claim 4 above, and further in view of Fukuda et al (20200034702).
Regarding claim 6, the combination of Park, Korviakov, Karpman, and Guo disclose the apparatus of claim 5, in which the converged neural network comprises a teacher neural network (Guo Fig. 7 e.g. trained neural network) and the pruned neural network comprises a student neural network (Guo Fig. 7 e.g. pruned neural network), but fail to teach where Fukuda teaches, the at least one processor is further configured to train the student neural network based on a block-wise error calculation for each stage of the student neural network relative to a same stage of the teacher neural network (¶56 At block 340, a student training section may train a student neural network with a teacher input data and the corresponding soft label output obtained at the most recent iteration of block 330. For example, in the embodiment of FIG. 4, at the first iteration, the student training section may train the student neural network, at block 340, with Input Data 1 and a soft label output that the Teacher NN1 has output in response to receiving Input Data 1. In an embodiment, the student training section, at block 340, may train the student neural network such that soft label errors between (1) a soft label output generated by the student neural network in response to receiving the teacher input data (e.g., Input Data 1) and (2) the soft label output generated by the selected teacher neural network (e.g., Teacher NN1) in response to receiving the same teacher input data, is minimized).
Therefore, it would have been obvious to one with ordinary skill in the art before the effective filing date of the invention to have implemented the teaching of the at least one processor is further configured to train the student neural network based on a block-wise error calculation for each stage of the student neural network relative to a same stage of the teacher neural network  from Fukuda into the apparatus as disclosed by the combination of Park, Korviakov, Karpman, and Guo. The motivation for doing this is to improve training a student neural network with a teacher neural network.

Regarding claim(s) 12 (drawn to a method):               
The rejection/proposed combination of Park, Korviakov, Karpman, Guo, and Fukuda explained in the rejection of apparatus claim(s) 6, anticipates/renders obvious the steps of the method of claim(s) 12 because these steps occur in the operation of the proposed combination as discussed above. Thus, the arguments similar to that presented above for claim(s) 6 is/are equally applicable to claim(s) 12.

Regarding claim(s) 18 (drawn to a CRM):               
The rejection/proposed combination of Park, Korviakov, Karpman, Guo, and Fukuda explained in the rejection of apparatus claim(s) 6, anticipates/renders obvious the steps of the computer readable medium of claim(s) 18 because these steps occur in the operation of the proposed combination as discussed above. Thus, the arguments similar to that presented above for claim(s) 6 is/are equally applicable to claim(s) 18. See Park ¶81-83.

Regarding claim(s) 24 (drawn to an apparatus):               
The rejection/proposed combination of Park, Korviakov, Karpman, Guo, and Fukuda, explained in the rejection of apparatus claim(s) 6, anticipates/renders obvious the steps of the system of claim(s) 24 because these steps occur in the operation of the proposed combination as discussed above. Thus, the arguments similar to that presented above for claim(s) 6 is/are equally applicable to claim(s) 24. See Park ¶81-83.

Response to Arguments
Applicant’s arguments with respect to claim(s) 1-24 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.




Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KEVIN KY whose telephone number is (571)272-7648. The examiner can normally be reached Monday-Friday 9-5PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vincent Rudolph can be reached at 571-272-8243. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/KEVIN KY/Primary Examiner, Art Unit 2671

Read full office action

Prosecution Timeline

Oct 23, 2023

Application Filed

Oct 15, 2025

Non-Final Rejection — §103

Dec 10, 2025

Examiner Interview Summary

Dec 10, 2025

Applicant Interview (Telephonic)

Dec 11, 2025

Response Filed

Mar 09, 2026

Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

17/676,432

Patent 12597158

POSE ESTIMATION

2y 5m to grant Granted Apr 07, 2026

18/814,687

Patent 12597291

IMAGE ANALYSIS FOR PERSONAL INTERACTION

2y 5m to grant Granted Apr 07, 2026

18/222,090

Patent 12586393

KNOWLEDGE-DRIVEN SCENE PRIORS FOR SEMANTIC AUDIO-VISUAL EMBODIED NAVIGATION

2y 5m to grant Granted Mar 24, 2026

18/570,168

Patent 12586559

METHOD AND APPARATUS FOR GENERATING SPEECH OUTPUTS IN A VEHICLE

2y 5m to grant Granted Mar 24, 2026

19/080,452

Patent 12579382

NATURAL LANGUAGE GENERATION USING KNOWLEDGE GRAPH INCORPORATING TEXTUAL SUMMARIES

2y 5m to grant Granted Mar 17, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

3-4

Expected OA Rounds

76%

Grant Probability

99%

With Interview (+25.3%)

2y 6m

Median Time to Grant

Moderate

PTA Risk

Based on 549 resolved cases by this examiner. Grant probability derived from career allow rate.