Last updated: April 19, 2026
Application No. 18/545,804
GUIDANCE SIGNALS FOR ACCELERATING INFERENCING IN GENERATIVE ARTIFICIAL INTELLIGENCE MODELS

Final Rejection §103
Filed
Dec 19, 2023
Examiner
LERNER, MARTIN
Art Unit
2658
Tech Center
2600 — Communications
Assignee
Qualcomm Incorporated
OA Round
2 (Final)
Interview Optional

— +13.5% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 984 resolved cases, 2023–2026
Examiner Intelligence

LERNER, MARTIN View full profile →
Grants 78% — above average
Career Allow Rate
768 granted / 984 resolved
+16.0% vs TC avg
Moderate +14% lift
Without
With
+13.5%
Interview Lift
resolved cases with interview
Typical timeline
3y 1m
Avg Prosecution
23 currently pending
Career history
1007
Total Applications
across all art units
Statute-Specific Performance

§101
12.5%
-27.5% vs TC avg
§103
53.1%
+13.1% vs TC avg
§102
9.6%
-30.4% vs TC avg
§112
16.6%
-23.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 984 resolved cases
Office Action

§103
DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 6, 10, 11, 19, 24, 28,  29, 37, and 40 are rejected under 35 U.S.C. 103 as being unpatentable over Horesh et al. (U.S. Patent No. 11,972,333) in view of Kim et al. (U.S. Patent Publication 2024/0311405).
Concerning independent claims 1 and 19, Horesh et al. discloses a system and method for generative artificial intelligence, comprising:
“at least one memory having executable instructions stored thereon; and one or more processors configured to execute the executable instructions to cause the processing system to:” – system 100 includes a processor 130 and a memory 135 coupled to processor 130 (column 5, lines 12 to 24: Figure 1); processor 130 may be capable of executing scripts or instructions of one or more software programs stored in memory 135 (column 6, lines 30 to 33: Figure 1);
“generate, based on an input query and a first generative artificial intelligence model, a sequence of tokens corresponding to a candidate response to the input query” – first generative AI model 170 generates a first output based on a user input provided via a prompt (column 8, lines 45 to 50: Figure 1); model input 202 is input provided to a generative AI model to prompt the generative AI model to generate an output answering the input; system 100 may provide an input prompt to a user to provide a natural language input to be received by first generative AI model 210 (column 12, lines 47 to 52: Figure 2); BERT model 240 is trained to tokenize first output 212 to generate a first output vector 242 of tokens and to tokenize second output 222 to generate a second output vector 244 of tokens (“a sequence of tokens corresponding to a candidate response”) (column 13, lines 48 to 57: Figure 2); through management of another generative AI model, a number of irrelevant answers to a query may be reduced (column 24, lines 16 to 27: Figure 1); model input 202 of natural language input from a user, then, can be “an input query”; 
“output, to a second generative artificial intelligence model, the sequence of tokens and the input query for verification” – managing a generative AI model includes using a second generative AI model to generate outputs from similar inputs and comparing the output of the generative AI models to determine their similarity (Abstract); to ensure that outputs generated and provided by a generative AI model are relevant (“for verification”), a second generative model is implemented to generate outputs based on the same prompts to the generative AI model being managed (column 4, lines 53 to 62); first generative model 170 generates a first output based on the input, and second generative model 140 generates a second output based on the input; classification model 150 receives the first output and the second output, compares the outputs, and generates a similarity indication based on the comparison (column 8, lines 48 to 53: Figure 1); system 100 may receive model input 202, which may be provided to second generative AI model 220 for the second generative AI model 220 to generate a second output 222 (“output, to a second generative artificial intelligence model . . . the input query”) (column 12, lines 56 to 59: Figure 2); classification model 230 compares the outputs from first generative AI model 210 with the same outputs of second generative model 220 based on a same input to models 210 and 220 (column 13, lines 32 to 47: Figure 2); Figure 2 illustrates that second AI generative model 220 and classification model 230 receive output from first generative AI model 210 and model input 202; here, second generative model 220, classification model 230, and policy engine 270, as a unit, may be construed as “a second generative artificial intelligence model” that receives model input 202, e.g., “the input query” and model output of first generative AI model 210, i.e., “the sequence of tokens”;
“receive, from the second generative artificial intelligence model, one or more guidance signals for the generated sequence of tokens” – an output of the second generative AI model is compared to an output of the managed generative AI model by a classification model to determine a relevance of the output from the managed generative AI model, and to perform suitable policies to optimize performance of the managed generative AI model, e.g., providing alternative outputs or preventing providing the output (Abstract); NN classifier 250 generates similarity indication 252 indicating the similarity between the first output and the second output; a similarity indication 252 may be used to execute various policies for managing first generative AI model 210 (column 14, lines 47 to 51: Figure 2); policy engine 270 is implemented to execute one or more policies for managing first generative AI model 210 based on a relevancy of first output 212 to model input 202; policy engine 270 receives identification signal 262 or similarity signal 252 regarding the similarity of first output 212 and second output 222 (column 15, lines 21 to 27: Figure 2); policy engine 270, then, provides “one or more first guidance signals” for managing output by first generative AI model 210; that is, policy engine 270 manages (‘guides’) first generative AI model 210 by ‘signaling’ first generative AI model 210 to provide alternative outputs or prevent providing the outputs;
“revise the candidate response to the input query based on the generated sequence of tokens and the one or more first guidance signals” – first generative model 170 is the model to be managed from system 100; managing the generative AI model may include any action to manage the outputs of the generative AI model including adjusting (‘revising’) the output to be provided or providing an alternative output (column 7, lines 1 to 10: Figure 1); an output of a classification model is used to perform various suitable policies to optimize the performance of the managed generative AI model, e.g., providing alternate outputs or prevent providing the outputs (column 1, lines 54 to 58); policy engine 160 may instruct system 100 to provide the input again to first generative AI model 170, and first generative AI model 170 generates a third output that is different than the first output generated previously based on the input (column 10, lines 54 to 59: Figure 1); here, providing an alternate output for a response is equivalent to “revise the candidate response”; 
“output the revised candidate response as a response to the input query” – interface 110 may provide outputs from one or more generative AI models including in system 100 (column 5, lines 25 to 34: Figure 1); if the outputs are similar during a second comparison as a second similarity threshold being greater than a defined threshold, policy engine 160 may instruct system 100 to output the third output for use to a user (column 11, lines 6 to 11: Figure 1); policy engine 270 may instruct system 100 to output first output 212 for use in response to identifying that first output 212 is similar to second output 222, or policy engine 270 may instruct system 100 to output second output 222 from second generative AI model 220 (column 15, lines 39 to 53: Figure 2).
Concerning independent claims 1 and 19, Horesh et al. discloses all of the limitations with the exception of “wherein the first generative artificial intelligence model is a smaller version of the second artificial intelligence model”.  Here, Horesh et al. does not disclose anything about the relevant sizes of the first and second generative AI models 210, 220, but describes an embodiment of a plurality of generative AI models being trained on various time ranges of years from ten years of training data.  (Column 20, Line 54 to Column 21, Line 53: Figure 5)  However, Kim et al. teaches dynamic selection from among multiple candidate generative models with different computational efficiencies.  (Abstract)  Many generative models can be of a very large size including billions of parameters and smaller size counterparts can be separately trained with less parameters or pruned and/or quantized by applying one or more pruning techniques and/or one or more quantization techniques.  Due to the large size of a generative model, there can be significant resource utilization and latency, but smaller size models can be less robust and/or less accurate than their larger size counterparts.  (¶[0002] - ¶[0003])  An implementation can dynamically select between at least a smaller LLM and a larger LLM on a request-by-request basis to achieve reduced latency and/or improved computational efficiency while mitigating occurrences of any inaccurate and/or under-specified responses.  (¶[0007])  Selection between smaller and larger LLMs can be made based on first and second measures that characterize a probability of generating a correct response to a request, where a first measure characterizes generating a correct response using the smaller LLM and a second measure characterizes generating a correct response using the larger LLM.  (¶[0008])  A trained machine learning (ML) model can be used in selecting from among multiple candidate generative models, e.g., from between at least a smaller LLM and a larger LLM.  (¶[0012])  Candidate generative models 150 can include LLM 150A with less than 100 billion parameters, LLM 150B with between 100 billion and 250 billion parameters, and LLM 150C with over 250 billion parameters.  (¶[0027]: Figure 1)  Selection engine 126 utilizes request features to select which of multiple candidate generative models 150 should be utilized in responding to a request.  (¶[0042]: Figure 1)  Kim et al., then, teaches “wherein the first generative artificial intelligence model is a smaller version of the second generative artificial intelligence model” because LLM 150A is smaller than LLM 150B or LLM 150C.  An objective is to reduce latency and/or conserve computational resources to mitigate occurrences of a generated response being inaccurate or under-specified.  (¶[0004)  It would have been obvious to one having ordinary skill in the art to provide a first generative artificial intelligence model that is a smaller version of a second artificial intelligence model as taught by Kim et al. as first generative AI model and second generative AI model of Horesh et al. for a purpose of conserving computational resources and mitigating occurrences of a generated response being inaccurate.

Concerning claims 6 and 24, Horesh et al. discloses that policy engine 160 instructs system 100 to generate a third output that is different from first output by applying first input again to first generative AI model 170.  (Column 10, Lines 41 to 59: Figure 2)  Policy engine 160, then, provides “a guidance signal” to first generative AI model 170 based on output from second generative AI model 140.  
Concerning claims 10 and 28, Horesh et al. discloses that an input 202 is provided to a second generative artificial intelligence model 220 in Figure 2, and this input may represent a query.  Consequently, an output of classification model may manage a first artificial intelligence model to provide alternative outputs, prevent providing the output, adjusting the output, and retraining the model (Abstract; Column 1, Lines 54 to 58; Column 7, Lines 1 to 10: Figure 1)  Broadly, providing alternative outputs, preventing providing the output, adjusting the output, and retraining the model are “a list of actions to be performed by the first generative artificial intelligence model to generate the candidate response”.  
Concerning claims 11 and 29, Horesh et al. discloses first generative AI model 170 and second generative AI model 140, and that a system may host first generative AI model if the model is not included in system 100; an externally developed generative AI model may be hosted by a system associated with the model’s developer and a user may interface with the system 100 in order to use the externally developed generative model (column 5, lines 37 to 58: Figure 1).  Generally, Horesh et al. discloses that at least one of a first and second generative AI model may be hosted externally (“on a system remote from the processing system”), and it would be an obvious expedient to host one of the first and second generative AI models locally and one remotely under principles of distributed processing in a client/server architecture.
Concerning claims 37 and 40, Kim et al. teaches generative models can be of a very large size including billions of parameters, and smaller size counterparts can be separately trained with less parameters or pruned and/or quantized by applying one or more pruning techniques and/or one or more quantization techniques.  (¶[0003])  A smaller LLM can be a quantized and/or pruned version of the larger LLM (“wherein the first generative artificial intelligence model is a pruned version of the second generative artificial intelligence model”).  (¶[0005])

Claims 2 to 5 and 20 to 23 are rejected under 35 U.S.C. 103 as being unpatentable over Horesh et al. (U.S. Patent No. 11,972,333) in view of Kim et al. (U.S. Patent Publication 2024/0311405) as applied to claims 1 and 19 above, and further in view of Madaan et al. (U.S. Patent Publication 2022/0261535).
Concerning claims 2, 4, 20, and 22, Horesh et al. discloses generating tokens for output and providing an alternative output or adjusting an output of a first generative artificial intelligence model, but does not expressly “identify a token location within the candidate response at which at least one incorrect token is to be replaced” “or replacing “one incorrect token”.  That is, Horesh et al. generally discloses adjusting a response, but is not specifically directed to replacing individual tokens at a location in an adjusted response.  However, Madaan et al. teaches automatically modifying responses from generative models by replacing at least a portion of proposed words by at least a portion of alternate words.  (Abstract; ¶[0002])  Step 308 includes modifying at least a portion of the words proposed by the at least one automated conversation exchange software program in connection with the at least one conversation by replacing at least a portion of the one or more identified words with at least a portion of one or more alternate words.  (¶[0024]: Figure 3: Step 308)  Implicitly, words that are being replaced correspond to “tokens”, and a word that is being replaced in a portion implicitly has some specific “location” as a portion in a response of Madaan et al.  An objective is to improve text generation in automated conversation exchange software programs, or chatbots, to reduce human biases.  (¶[0001])  It would have been obvious to one having ordinary skill in the art to generate an alternate response in Horesh et al. by identifying a location of a portion of a token to be replaced as taught by Madaan et al. for a purpose of improving text generation in chatbots to reduce issues of human bias.

Concerning claims 3 and 21, Horesh et al. discloses an iterative procedure of policy engine 160 instructing first generative AI model 170 to generate a third output that is different than the first output previously based on the input (“generate a second candidate response”), that second generative AI model 140 then generates a new output to compare to the third output (“output, to the second generative artificial intelligence model, tokens corresponding to the second candidate response”), and that if the outputs are similar during the second comparison to determine if the second output is relevant (“receive one or more second guidance signals for the tokens corresponding to the second candidate response”).  Then policy engine 160 may instruct system 100 to output the third output (“generate a third candidate response based on the tokens corresponding to the second candidate response and the one or more second guidance signals”).  (Column 10, Line 41 to Column 11, Line 11: Figure 1)  Policy engine 270 may instruct system 100 to perform any number of iterations of checking the outputs of first generative AI model 210, and may generate a new output to model input 202 a defined number of iterations.  (Column 15, Line 54 to Column 16, Line 15: Figure 2)  
Concerning claims 5 and 23, Madaan et al. teaches modifying responses from a generative model using artificial intelligence to reduce human bias in conversations with a chatbot, e.g., as directed to nationality.  (¶[0001])  Bias-neural BERT embeddings are used to predict if a word belongs to a predetermined category, e.g., an age-related category, a gender-related category, a race-related category, and a nationality-related category.  (¶[0021])  Madaan et al. replaces words having predetermined categories of bias with words that are more “semantically acceptable” because the replacement words reduce the bias of the words.  That is, replacement words are more “semantically acceptable” because they have more acceptable meanings with lesser bias.   

Claims 7 to 8 and 25 to 26 are rejected under 35 U.S.C. 103 as being unpatentable over Horesh et al. (U.S. Patent No. 11,972,333) in view of Kim et al. (U.S. Patent Publication 2024/0311405) as applied to claims 1 and 19 above, and further in view of Zorn et al. (U.S. Patent Publication 2023/0418815).
Horesh et al. discloses “instructing the first artificial intelligence model to generate the revised candidate response according to one or more rules.”  A policy engine instructs a first generative artificial intelligence model to provide an alternate response based on a similarity and a defined number of iterations, and preventing all of the generated outputs if a new output is not sufficiently similar.  (Column 15, Lines 54 to Column 16, Line 15: Figure 2)  That is, “one or more rules” can be construed as determining to revise output or prevent output based on a similarity and a predetermined number of iterations.  However, Horesh et al. does not disclose characteristics of a guidance prompt as “one or more structured grammar commands” or “a natural language command”.  
Still, Zorn et al. teaches that it is known that prompts to artificial intelligence models may be generated in natural language or in a structured grammar.  Specifically, Zorn et al. teaches that conventional large language models can receive natural language text and generate an appropriate response using a prompt in the form of a natural language description of what imperative code could be able to do.  (¶[0002])  A task prompt may be expressed using natural language, but this need not be the case.  The task prompt could be expressed using a particular language for expressing the task prompt.  The task prompt may be expressed in natural language or some query language including Structured Query Language (SQL).  (¶[0038])  Here, structured query language may be construed as “one or more structured grammar commands”.  Zorn et al., then, teaches that a prompt to a large language model can be provided as “one or more structured grammar commands”, e.g., in SQL, or as “a natural language command”.  An objective is to generate declarative code to expand a utility of language models to aid in the generation of additional declarative code.  (¶[0019])  It would have been obvious to one having ordinary skill in the art to generate an alternate response in Horesh et al. from a natural language prompt or a structured grammar prompt as taught by Zorn et al. for a purpose of expanding a utility of language models to aid in the generation of declarative code.

Claims 9 and 27 are rejected under 35 U.S.C. 103 as being unpatentable over Horesh et al. (U.S. Patent No. 11,972,333) in view of Kim et al. (U.S. Patent Publication 2024/0311405) as applied to claims 1 and 19 above, and further in view of Callegari et al. (U.S. Patent Publication 2024/0362422).
Horesh et al. discloses that policy engine 270 may instruct system 100 to perform any number of iterations of checking the outputs of first generative AI model 210, and that first generative AI model 210 may be configured to generate a new output to model input 202 for a defined number of iterations.  If the defined number of iterations is reached with all of the new outputs identified as not being similar to the second output, policy engine 270 may instruct system 100 to prevent all of the generated outputs from first generative AI model 210 from being output for use.  (Column 15, Line 54 to Column 16, Line 15: Figure 2)  Horesh et al., then, discloses “to revise the candidate response to the input query, the one or more processors are configured to cause the processing system to determine that a threshold number of revisions have been performed with respect to a response generated by the first generative artificial intelligence model to the input query”.  However, Horesh et al. only prevents output if a defined number of iterations is reached, and does not clearly “output the candidate response as the response to the input query based on determining that the threshold number of revisions have been performed.”  
Still, Callegari et al. teaches revising large language model prompts to generate a revised prompt in response to second input for the large language model to revise the prompt in view of an assessment report, generate a final response to the revised prompt, and output the final response.  (Abstract)  If a user is not satisfied with the next response generated based on the revised prompt, if a predetermined number of iterations have not been performed, or if the assessment report 34 for a current response has not met a predefined assessment threshold, then the assessment and revision is repeated for at least another iteration.  However, if the current response is acceptable by either the user or predefined criteria, then revised response 36 is output to the user.  (¶[0018]: Figure 3)  Assessment and revision may be iterated once or a number of times, and the revised response of the final iteration is referred to as the final response 56 of the final iteration.  (¶[0030]: Figure 5)  Callegari et al., then, teaches outputting a final response after a predetermined number of iterations are performed instead of preventing output of the final response in Horesh et al.  An objective is to address a problem that a usefulness of a response is greatly influenced by a quality of a prompt and the technical challenge of crafting a right prompt in order for a large language model to respond with a level of detail and precision that a user desires.  (¶[0003])  It would have been obvious to one having ordinary skill in the art to output a revised candidate response based on a threshold number of revisions as taught by Callegari et al. in a generative artificial intelligence model of Horesh et al. for a purpose of enabling a large language model to respond with a level of detail and precision that a user desires.  

Claims 38 and 41 are rejected under 35 U.S.C. 103 as being unpatentable over Horesh et al. (U.S. Patent No. 11,972,333) in view of Kim et al. (U.S. Patent Publication 2024/0311405) as applied to claims 1 and 19 above, and further in view of Wang et al. (U.S. Patent Publication 2020/0134506).
Kim et al. teaches first and second generative LLMs with a smaller LLM that can in one embodiment be derived from a larger LLM by pruning or quantization, but does not provide “wherein the first generative artificial intelligence model and the second generative artificial intelligence model have similar probability distributions.”  However, Wang et al. teaches model training of a student model corresponding to a teacher model.  (Abstract)  Once training of a complex network model is completed, a simplified model may be extracted from the complex model through knowledge distillation that includes forcing the smaller neural network to output the same result.  The small neural network is referred to as a ‘student’ model and the large neural network is referred to as a ‘teacher’ model.  A difference between output of the teacher model and output of the student model may be indicated by a loss function.  Logit loss indicates a difference between probability distributions generated by the teacher model and the student model.  (¶[0046] - ¶[0048])  The student model is trained by iteratively decreasing the total loss.  (¶[0069])  Wang et al., then, teaches “wherein the first generative artificial intelligence model and the second generative artificial intelligence model have similar probability distributions” because a student model is forced to have a ‘similar’ probability distribution to a teacher model by minimizing a loss function of a logit loss in knowledge distillation.  An objective is to increase a robustness of a trained student model without retraining the teacher model by knowledge distillation.  (¶[0011])  It would have been obvious to one having ordinary skill in the art to train a smaller student model so that it has a similar probability distribution to a larger teacher model as taught by Wang et al. for the smaller LLM and larger LLM of Kim et al. in order to increase robustness of a student model in generating a same result as a teacher model.  

Claims 39 and 42 are rejected under 35 U.S.C. 103 as being unpatentable over Horesh et al. (U.S. Patent No. 11,972,333) in view of Kim et al. (U.S. Patent Publication 2024/0311405) as applied to claims 1 and 19 above, and further in view of Sridhar et al. (U.S. Patent Publication 2021/0279595).
Kim et al. teaches first and second generative LLMs with a smaller LLM that can in one embodiment be derived from a larger LLM by pruning or quantization, and that a smaller LLM has fewer parameters than a larger LLM, but does not provide “wherein the first generative artificial intelligence model was trained using a smaller dataset than the second generative artificial intelligence model.”  However, Sridhar et al. teaches training one or more student-teacher modules as part of teacher neural network training.  (Abstract)  Knowledge distillation (KD) is a compression technique used to transfer knowledge of a bigger neural network with many learned parameters to a smaller neural network with fewer learned parameters.  Output of a larger neural network is a ‘soft target’ used as a supervision signal for training a smaller model called the student sub-network.  The student sub-network receives both soft targets and hard targets as supervision signals to enable a student model to be trained using a smaller training dataset (“wherein the first generative artificial intelligence model was trained using a smaller dataset than the second generative artificial intelligence model”), as the soft targets provide higher entropy and less variation, e.g., better generalization, than the hard targets.  (¶[0003] - ¶[0004])  Integrated system 300 uses a cascade of supervised signals and provides a soft target for knowledge distillation training of student sub-networks.  (¶[0058]: Figure 2B)  An objective is to improve feature representations of student sub-networks in distillation frameworks and enable sharing computation to eliminate redundant computation and achieve faster inference.  (¶[0027])  It would have been obvious to one having ordinary skill in the art to train a smaller student model with a smaller dataset than a larger teacher model as taught by Sridhar et al. for the smaller LLM and larger LLM of Kim et al. in order to improve feature representations in knowledge distillation and achieve faster inference.

Response to Arguments
Applicants’ arguments filed 23 February 2026 have been considered but are moot in view of new grounds of rejection as necessitated by amendment.
Applicants amend independent claims 1 and 19 to set forth a new limitation of “wherein the first generative artificial intelligence model is a smaller version of the second generative artificial intelligence model”, and add new dependent claims 37 to 42.  Applicants present arguments traversing the prior rejection of the independent claims as being obvious under 35 U.S.C. §103 over Horesh et al. (U.S. Patent No. 11,972,333) in view of Jain et al. (U.S. Patent Publication 2024/0256582).  Mainly, Applicants argue that Horesh et al. does not disclose the new limitation, and this deficiency is not cured by Jain et al.  Applicants state that a first generative artificial intelligence model 210 is not a ‘version’ of a lumped second generative model 200, classification model 230, and policy engine 270 of Horesh et al.  Applicants alternatively state that Horesh et al. discloses that second generative AI model 140 as being the same basic model as the first generative AI model 170, but is trained using a second training set instead of a first training set, but a first generative AI model 170 is still not a smaller version of the second generative model 140.
Applicants’ amendments overcome the objections to the title and to the Specification.
Applicants’ amendments to the independent claims necessitate new grounds of rejection under 35 U.S.C. §103 over Horesh et al. (U.S. Patent No. 11,972,333) in view of Kim et al. (U.S. Patent Publication 2024/0311405).  The rejection no longer relies upon Jain et al., and Kim et al. is being substituted in the rejection of the independent claims for Jain et al.  The rejection of certain dependent claims as being further obvious under 35 U.S.C. §103 continues to rely upon Madaan et al. (U.S. Patent Publication 2022/0261535), Zorn et al. (U.S. Patent Publication 2023/0418815), and Callegari et al. (U.S. Patent Publication 2024/0362422).  New grounds of rejection are set forth as directed to newly added dependent claims being obvious under 35 U.S.C. §103 over Wang et al. (U.S. Patent Publication 2020/0134506) and Sridhar et al. (U.S. Patent Publication 2021/0279595).  All of these new grounds of rejection are necessitated by amendment.  
Generally, Kim et al. is maintained to teach the new limitations of “wherein the first generative artificial intelligence model is a smaller version of the second generative artificial intelligence model”.  Specifically, Kim et al. teaches deriving a smaller generative model from a larger generative model by pruning or quantization.  (¶[0003])  An implementation can dynamically select between at least a smaller LLM and a larger LLM on a request-by-request basis to achieve reduced latency and/or improved computational efficiency while mitigating occurrences of any inaccurate and/or under-specified responses.  (¶[0007])  Selection between smaller and larger LLMs can be made based on first and second measures that characterize a probability of generating a correct response to a request, where a first measure characterizes generating a correct response using the smaller LLM and a second measure characterizes generating a correct response using the larger LLM.  (¶[0008])  A trained machine learning (ML) model can be used in selecting from among multiple candidate generative models, e.g., from between at least a smaller LLM and a larger LLM.  (¶[0012])  Candidate generative models 150 can include LLM 150A with less than 100 billion parameters, LLM 150B with between 100 billion and 250 billion parameters, and LLM 150C with over 250 billion parameters.  (¶[0027]: Figure 1)  Selection engine 126 utilizes request features to select which of multiple candidate generative models 150 should be utilized in responding to a request.  (¶[0042]: Figure 1)  However, Wang et al. and Sridhar et al., as well, teach a larger teacher model and a smaller student model that is derived from the teacher model by knowledge distillation.  Mainly, a larger generative model might produce a more accurate response than a smaller model, but a larger model might require more resource utilization and have greater latency in generating a response.
Horesh et al. is maintained to produce “first guidance signals” from a second generative artificial intelligence model for verification of tokens generated from a first generative artificial intelligence model, and Kim et al. teaches selecting a response to a request from a smaller first generative model and a larger second generative model.  Kim et al., then, teaches that a first generative artificial intelligence model can be a smaller generative model and a second generative artificial intelligence model can be a larger generative model in Horesh et al.  Selecting responses between a larger generative model and a smaller generative model has relative advantages of requiring less resource utilization for a smaller generative model at the expense of potentially decreased accuracy, and increased accuracy for a larger generative model at the expense of increased resource utilization.  Obviousness can be premised on a rationale of KSR International Co. v. Teleflex Inc. (KSR), 550 U.S. 398, 82 USPQ2d 1385 (2007): (B) Simple substitution of one known element for another to obtain predictable results.  See MPEP §2141.  It would be a simple substitution of a large generative model and a small generative model of Kim et al. for a first generative AI model and a second generative AI model of Horesh et al.  It would be predictable to provide first and second generative models of difference sizes so that if a smaller model does not produce a sufficiently accurate response then a larger model could be substituted to produce a better response.  
Horesh et al. broadly provides equivalent “first guidance signals” from a second generative artificial intelligence model to revise a candidate response.  By comparing a similarity of a first output of a first generative AI model to a second output of a second generative AI model, policy engine 160 produces “first guidance signals” to determine if “a candidate response to the input query” produced by a first generative artificial intelligence model is to be revised to output “a revised candidate response” produced by a second generative artificial intelligence model.  Similarly, Kim et al. teaches a selection engine 126 to select which of multiple candidate generative models 150 should be used in responding to a request.  Selection between output of a smaller first generative model 150A and a larger second generative model 150B is based on a first and second measures that characterize a probability of generating a correct response.  Selection between a response from a smaller generative model and a response from a larger generative model by selection engine 126 based on these first and second measures, then, is equivalent to “first guidance signals” for determining if a candidate response from a first of the generative models should be revised to a revised candidate response from a second of the generative models.  Applicants’ claim language does not distinguish a nature of these “guidance signals” over what is being performed by Horesh et al. and Kim et al.  The Specification, ¶[0023], states:
These guidance signals may include, for example, corrections to the candidate response generated by the draft model, information identifying that portions of a candidate response are incorrect and information that the draft model can use to generate a corrected response, information identifying tasks for the draft model to perform in generating a corrected response, or the like.

Broadly, Horesh et al. and Kim et al. provide “guidance signals” that are at least “corrections to the candidate response generated by the draft model”.  That is, if it is determined by policy engine 160 that the outputs from the two generative models are not the same, then policy engine 160 produces “guidance signals” that a candidate response should be revised to a revised candidate response in Horesh et al.  Similarly, if a smaller candidate LLM 150A produces a first measure that characterizes a probability of generating a correct response as being low, then selection engine 126 produces “first guidance signals” to instead generate a revised candidate response with a larger candidate LLM 150B.  
	Applicants’ amendments necessitate new grounds of rejection.  Applicants’ arguments are moot.  All of the new grounds of rejection are necessitated by amendment.  Accordingly, this rejection is properly FINAL.

Conclusion
The prior art made of record and not relied upon is considered pertinent to Applicants’ disclosure.
Kennel et al. and Lee et al. disclose related prior art.
Applicants’ amendment necessitated the new grounds of rejection presented in this Office Action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP §706.07(a).  Applicants are reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MARTIN LERNER whose telephone number is (571) 272-7608.  The examiner can normally be reached Monday-Thursday 8:30 AM-6:00 PM.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached at (571) 272-7602.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center.  Unpublished application information in Patent Center is available to registered users.  To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov.  Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format.  For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).  If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/MARTIN LERNER/Primary Examiner
Art Unit 2658                                                                                                                                                                                                        March 12, 2026
Read full office action
Prosecution Timeline

Dec 19, 2023
Application Filed
Oct 20, 2025
Non-Final Rejection — §103
Feb 23, 2026
Response Filed
Mar 12, 2026
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/365,535
Patent 12596880
DETERMINING CAUSALITY BETWEEN FACTORS FOR TARGET OBJECT BY ANALYZING TEXT
2y 5m to grant Granted Apr 07, 2026
17/882,447
Patent 12586592
METHODS AND APPARATUS FOR GENERATING AUDIO FINGERPRINTS FOR CALLS USING POWER SPECTRAL DENSITY VALUES
2y 5m to grant Granted Mar 24, 2026
18/336,831
Patent 12585680
CONTEXTUAL TITLES BASED ON TEMPORAL PROXIMITY AND SHARED TOPICS OF RELATED COMMUNICATION ITEMS WITH SENSITIVITY POLICY
2y 5m to grant Granted Mar 24, 2026
17/432,681
Patent 12579987
METHODS FOR FREQUENCY DOMAIN PACKET LOSS CONCEALMENT AND RELATED DECODER
2y 5m to grant Granted Mar 17, 2026
18/532,969
Patent 12579973
Natural Language Processing With Contextual Data Associated With Content Displayed By a Computing Device
2y 5m to grant Granted Mar 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
78%
Grant Probability
92%
With Interview (+13.5%)
3y 1m
Median Time to Grant
Moderate
PTA Risk
Based on 984 resolved cases by this examiner. Grant probability derived from career allow rate.