Last updated: July 17, 2026
Application No. 18/479,672
SPECULATIVE DECODING IN AUTOREGRESSIVE GENERATIVE ARTIFICIAL INTELLIGENCE MODELS

Non-Final OA §103§112
Filed
Oct 02, 2023
Priority
Mar 24, 2023 — provisional 63/454,605
Examiner
LERNER, MARTIN
Art Unit
2658
Tech Center
2600 — Communications
Assignee
Qualcomm Incorporated
OA Round
3 (Non-Final)
Interview Optional

— +13.4% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 78% grant rate with +13.4% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 990 resolved cases, 2023–2026
Examiner Intelligence

LERNER, MARTIN View full profile →
Grants 78% — above average
Career Allowance Rate
772 granted / 990 resolved
+16.0% vs TC avg
Moderate +13% lift
Without
With
+13.4%
Interview Lift
resolved cases with interview
Typical timeline
2y 11m
Avg Prosecution
25 currently pending
Career history
1016
Total Applications
across all art units
Statute-Specific Performance

§101
9.9%
-30.1% vs TC avg
§103
74.2%
+34.2% vs TC avg
§102
3.5%
-36.5% vs TC avg
§112
8.8%
-31.2% vs TC avg
Black line = Tech Center average estimate • Based on career data from 990 resolved cases
Office Action

§103 §112
DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 1 to 32 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement.  The claims contain subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventors, at the time the application was filed, had possession of the claimed invention.
Independent claims 1, 12, 23, and 30 set forth limitations of speculatively generating a second plurality of sets of tokens “after the first plurality of sets of tokens is outputted” and receiving the indication of the selected set of tokens “after the second plurality of sets of tokens is speculatively generated”, which are maintained to represent new matter under 35 U.S.C. §112(a).  Generally, any claim amendments must be supported expressly or implicitly by the originally-filed Specification.  See MPEP §2163 I. B., §2163 II. A. 3(b), and §2163.05.  Here, Applicants’ Specification, ¶[0079] - ¶[0082]: Figure 8, does not expressly include any language directed precisely to generating a second plurality of sets of tokens “after the first plurality of sets of tokens is outputted” or receiving the indication of the selected tokens from the first plurality of sets of tokens “after the second plurality of sets of tokens is speculatively generated”, so that there is no express written description of this limitation.  The Specification, ¶[0175], actually states that the ordering of the steps does not matter in a material way: 
The methods disclosed herein comprise one or more steps or actions for achieving the methods. The method steps and/or actions may be interchanged with one another without departing from the scope of the claims. In other words, unless a specific order of steps or actions is specified, the order and/or use of specific steps and/or actions may be modified without departing from the scope of the claims.  

Here, there is no express written description that the ordering of steps requires that receiving an indication of the selected set of tokens from the first plurality of sets of tokens must occur only “after the second plurality sets of tokens is speculatively generated”, so that an attempt to add this limitation constitutes new matter.  Moreover, it is a rule of patent law that implicit or inherent disclosure requires some necessity.  See MPEP 2163 II. A. 3 (b) and §2163.07(a).  If there are possible alternatives of operation that do not necessarily require a limitation, then inherency (or implicitness) is not established.  Applicants’ limitation of only receiving the indication of the selected set of tokens “after the second plurality of tokens is speculatively generated” is not inherent or implicit to the described embodiments because the indication of the selected set of tokens from the first plurality of tokens could be received before the second plurality of sets of tokens is speculatively generated, or the second plurality of sets of tokens could be output before the indication of the selected set of tokens is received from the second generative model, consistent with a written description at ¶[0079] - ¶[0082] of the Specification.  Applicants’ attempt to overcome the prior art by this amendment, then, introduces new matter under 35 U.S.C. §112(a) because the limitation is neither expressly or implicitly described by the originally-filed Specification.   
New claim 31 sets forth a limitation of “wherein the first portion of the candidate response does not overlap the second portion of the candidate response”, which is maintained to be new matter under 35 U.S.C. §112(a).  There is no express description of the first portion of a candidate response not “overlapping” the second portion of the candidate response because there are no occurrences of the term ‘overlap’ anywhere in the originally-filed Specification.  Applicants’ Remarks attempt to explain how this limitation is supported by reference to an embodiment described at ¶[0075] - ¶[0076]: Figure 7 of the Specification.  However, the negative concept of ‘not overlapping’ is not implicit or inherent according to the description of Figure 7.  The semantic meaning of overlap is not the same as a meaning of being ‘different’ or ‘distinct’.  Specifically, ‘overlap’ can be defined as ‘to coincide in part with’ or ‘have in common with’.  See OVERLAP Definition & Meaning | Dictionary.com.  However, Figure 7 illustrates that tokens 1 to 3 ‘coincide in part with’ tokens 4 to 6 at the point where they are joined together in the tree. Conceivably, a construction of a tree in this embodiment presupposes that generation of tokens 4 to 6 depend upon and coincide with the prior generation of tokens 1 to 3 so that there is an ‘overlap’ of the first and second portions of the response construed at the point where the two portions are joined.  The term ‘overlap’, then, raises issues of new matter under 35 U.S.C. §112(a) because it is neither expressly or implicitly described in a manner semantically required by the Specification.  

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1 to 2, 6 to 10, 12 to 13, 17 to 21, 23 to 24, 26 to 28, and 30 to 32 are rejected under 35 U.S.C. 103 as being unpatentable over Chen et al. (“Accelerating Large Language Model Decoding with Speculative Sampling”) in view of Olabiyi et al. (U.S. Patent Publication 2021/0027023).
Concerning independent claims 1, 12, 23, and 30, Chen et al. discloses an algorithm that is performed by a computer program executing as a method on a computer system comprising:	“generating, based on [an input query and] a first generative model, a first plurality of sets of tokens, each set of tokens in the first plurality of sets of tokens corresponding to a first portion [of a candidate response to the input query]” – a short draft of length K is attained with either a parallel model or by calling a faster, auto-regressive model K times; this model is referred to as the draft model (“a first generative model”) (Page 1); Algorithm 2 provides that a draft model generates K sets of draft tokens in an inner loop: “while n < T, for t = 1 to K, do sample draft autoregressively” to generate drafts x1(tilde), . . . , xK(tilde) (Page 3); here, K draft tokens, x1(tilde), . . . , xK(tilde) are “a first plurality of sets of tokens” for a set of tokens comprising x1, . . . , xt; 
“outputting, from the first generative model to a second generative model, the first plurality of sets of tokens for verification” – a draft is scored using a larger, more powerful model referred to as the target model (“a second generative model”); a target model can accept (‘verify’) a subset of tokens (Page 1); a target model, then, is accepting (‘verifying’) a subset of preliminary output from a draft model; these “the first plurality of sets of tokens” are K draft tokens, x1(tilde), . . . , xK(tilde);
“after the first plurality of sets of tokens is outputted and while waiting to receive, from the second generative model, an indication of a selected set of tokens from the first plurality of sets of tokens, speculatively generating a second plurality of sets of tokens using the first generative model, each set of tokens in the second plurality of sets of tokens corresponding to a second portion [of the candidate response to the input query]” – speculative sampling enables generation of multiple tokens for each transformer call by parallel scoring of short continuations generated by a faster but less powerful draft model (Abstract; Page 1); Algorithm 2 provides that a draft model generates K sets of draft tokens in an inner loop: “while n < T, for t = 1 to K, do sample draft autoregressively” to generate drafts x1(tilde), . . . , xK(tilde) (Page 3); here, Algorithm 2 provides that a draft model generates a next K set of tokens in parallel while n < T; that is, T is a total length of the a target sequence that is being generated, and K < T; a draft model, then, operates to generate a plurality of chunks of size K for a larger target sequence of total length T;
“after the second plurality of sets of tokens is speculatively generated, receiving, at the first generative model from the second generative model, the indication of the selected set of tokens from the first plurality of sets of tokens” – using a modified rejection sampling scheme, a subset of the K draft tokens are accessed by the target model (Page 1); Algorithm 2 provides that a target model operates upon logits from the draft model p(x|x1, . . . xn+t-1) and upon its own logits from the target model q(x|x1, . . . xn+t-1) to accept a plurality of next tokens xn+t from an estimated draft tokens xt(tilde) for t = 1 to K; Algorithm 2 then returns these next verified tokens xn+t for a chunk of K tokens in the while n < T loop to draft model to iterate to a next iteration; Compare Specification, ¶[0175], which states that a precise ordering of the steps is not material;
“outputting, from the first generative model to the second generative model, tokens from the second plurality of sets of tokens associated with the selected set of tokens for verification” – a short draft of length K is attained with either a parallel model or by calling a faster, auto-regressive model K times; this model is referred to as the draft model (“a first generative model”) (Page 1); Algorithm 2 provides a return from a while statement in an outer loop; draft model generates next K sets of draft tokens in an inner loop: “while n < T, for t = 1 to K, do sample draft autoregressively” to generate drafts x1(tilde), . . . , xK(tilde) (Page 3); that is, a next chunk of K tokens are output from an inner loop of a draft model for a next chunk of tokens t = 1 to K as long as an end of a target sequence T is not reached because n < T; 
“outputting the selected set of tokens [as a response to the input query]” – a subset of K draft tokens are accepted (Page 1); if all tokens are accepted, a sample extra token is set (Page 3); a natural language summarization task is performed (Page 6); here, a natural language summarization task outputs a summary from decoded tokens.
Concerning independent claims 1, 12, 23, and 30, Chen et al. discloses a general concept of providing a draft model and a target model in parallel operation for processing sets of tokens, but does not include an application to generating a response to an input query in the limitations of “based on an input query . . . corresponding to a first portion of a candidate response to the input query” and outputting the selected set of tokens “as a response to the input query.”  However, Olabiyi et al. teaches multi-turn dialogue response generation that includes a forward model that may be used to generate an output sequence based on previously generated output sequences and a backward model that may reevaluate previous generated outputs sequences.  (Abstract)  Systems may use transformer-based machine classifiers to perform a variety of natural language understanding tasks including question answering.  (¶[0004])  A backward model may be used to reevaluate previously generated output sequences and a forward model may be used to generate an output sequence based on previously generated output sequences.  (¶[0008])  The application may use autoregressive models.  (¶[0048])  Olabiyi et al., then, teaches an application of autoregressive models to a multi-turn dialogue that generates responses as answers to questions (“based on an input query” and “a response to the input query”).  An objective is to provide transformer-based machine classifiers that are trained to more accurately identify and generate relevant and interesting responses, saving processing time and processing resources.  (¶[0024])  It would have been obvious to one having ordinary skill in the art to apply accelerated large language model decoding with speculative sampling of Chen et al. to a natural language task of question answering to generate a response to a query as taught by Olabiyi et al. for a purpose of accurately identifying and generating relevant and interesting responses that saves processing time and resources.

Concerning claims 2, 13, and 24, Chen et al. discloses that a subset of the K draft tokens are accepted from left to right upon iteration of Algorithm 2 (Pages 1 and 3); that is, an algorithm operates iteratively and in parallel between a draft model and a target model so that as a target model accepts a set of tokens from a draft model (“receive an indication of a second selected set of tokens from the second plurality of sets of tokens associated with the selected set of tokens”), and this set of tokens is added to the output (“output the second selected set of tokens as another portion”).
Concerning claims 6, 17, and 26, Chen et al. discloses a draft model (“the first generative model”) that is trained with four billion parameters optimized for sampling latency with relatively few number of layers to enable it to achieve a sample speed (Page 5); a draft model, then, generates sets of tokens x1, . . . , xK (“each respective set of tokens in the first plurality of sets of tokens”) using “a unique instance of the first generative model and unique parameters as input” corresponding to the parameters of the trained draft model.
Concerning claims 7, 18, and 27, Chen et al. discloses an iterative procedure with a target model that produces “a refined subsequent set of tokens” that are accepted (‘verified’) and output, and a draft model provides a parallel process to “speculatively generate a third plurality of sets of tokens”.  Mainly, these claims are simply an iteration of a second plurality of tokens produced by a draft model to a third plurality of tokens produced by the draft model.  Specifically, p(xn+1| x1, . . . , xn) is the probability of a token of the draft model conditioned on context so far, and accepted tokens are set for xn+1.  (Page 4)
Concerning claims 8 and 19, Olabiyi et al. teaches appending random paddings before and/or after the input data to reduce syntactic redundancies in the input data (¶[0007]); encoder and decoder sequences may be padded, and prepended, appended, or randomly inserted within the encoder sequences as appropriate (“wherein the sets of tokens in the subsequent plurality of sets of tokens includes padding accounting for a number of tokens in the selected set of tokens being less than a maximum number of tokens”) (¶[0052] - ¶[0053]: Figure 5).
Concerning claims 9, 20, and 28, Chen et al. discloses a draft model and a target model with speculative decoding that generates tokens and can be designated as “the first generative model in a speculative decoding pipeline” and “the second generative model in the speculative decoding pipeline”.  (Page 1)
Concerning claims 10 and 21, Chen et al. discloses that a method recovers the distribution of the target model from samples of the draft model with q(xn+1| x1, . . . , xn) and p(xn+1| x1, . . . , xn) as the probabilities of the target and draft models (Page 4); a portion of the activations of the target model is used as an input to the draft model, and the draft model is trained with this input; a smaller version of the target language model is used as the draft model (Page 5).  The draft model, then, is trained to have a similar probability distribution to the target model (“wherein the draft model comprises a model trained to have a probability distribution that approximates a corresponding probability distribution for the target model”); a token is accepted according to min(1, q(xn+1| x1, . . . , xn)/p(xn+1| x1, . . . , xn) (Page 4); p(xn+1| x1, . . . , xn), then, ‘approximates’ q(xn+1| x1, . . . , xn) because their ratio is ideally equal to one.
Concerning claims 31 to 32, Chen et al. discloses that Algorithm 2 processes tokens in chunks of K tokens for t = 1 to K in a total target sequence of length T as long as n < T (Page 3); that is, Algorithm 2 includes a for loop from 1 to K for a draft model and a for loop of 1 to K for a target model, and a first chunk of K tokens represents “the first plurality of sets of token corresponding to a first portion of a candidate response” and a second chunk of K tokens represents “the second plurality of sets of tokens corresponding to a second portion of the candidate response”.  That is, a first chunk of tokens 1 to K does not ‘overlap’ a second chunk of tokens 1 to K in a sequence of length T, and a second chunk of tokens 1 to K is “subsequent to and different from” a first chunk of tokens 1 to K in a sequence of length T. 

Claims 3 to 4 and 14 to 15 are rejected under 35 U.S.C. 103 as being unpatentable over Chen et al. (“Accelerating Large Language Model Decoding with Speculative Sampling”) in view of Olabiyi et al. (U.S. Patent Publication 2021/0027023) as applied to claims 1 and 12 above, and further in view of Araki (U.S. Patent Publication 2023/0316001).
Chen et al. discloses that a draft model (“the first generative model”) outputs sets of tokens x1, . . . , xK according to a probability distribution p(xn+1| x1, . . . , xn), but does not clearly disclose that these sets of tokens represent “the highest probabilities within a probability distribution with the first generative model over a universe of tokens” and that “each set of tokens in the first plurality of sets of tokens comprises a group of tokens selected based on a sum of probabilities associated with tokens in the group of tokens, the sum exceeding a threshold probability.”  However, Araki teaches selecting an entity type from among a set of entity types, and outputting a selected candidate.  (Abstract)  Specifically, Araki teaches selecting a correct answer based on a confidence for single token or multi-token phrases.  A machine learning system uses a likelihood score which is the sum of the log probabilities of each predicted token conditioned on the other tokens.  A candidate generator is configured to select a set of candidates with a final prediction having a predetermined number of predictions with the highest confidence scores.  The candidate generator may select the set of candidates based on a predetermined number of candidates less than or equal to the beam size with the highest confidence scores, and may select a candidate to be in the set of candidates if that candidate has a confidence score that is above a threshold value and/or if that candidate satisfies other threshold criteria.  (¶[0025] - ¶[0027]: Figure 2)  Araki, then, teaches selecting “a group of tokens having the highest probabilities” and “a group of tokens selected based on a sum of probabilities associated with the tokens in the group of tokens, the sum exceeding a threshold probability.”  An objective is to provide a question answering system in virtual assistants.  (¶[0019])  It would have been obvious to one having ordinary skill in the art to select tokens in Chen et al. according to a procedure of Araki that determines highest probabilities above a threshold as a sum of probabilities of tokens in group for a purpose of enabling a virtual assistant to answer questions.

Claims 5, 16, and 25 are rejected under 35 U.S.C. 103 as being unpatentable over Chen et al. (“Accelerating Large Language Model Decoding with Speculative Sampling”) in view of Olabiyi et al. (U.S. Patent Publication 2021/0027023) as applied to claims 1, 12, and 23 above, and further in view of Zhang et al. (“Planning With Large Language Models for Code Generation”).
Chen et al. does not disclose that tokens “are represented as a tree data structure” and “a root node of the tree data structure corresponds to the input query” and “each path through the tree data structure corresponds to a set of tokens from the first plurality of tokens.”  However, Zhang et al. teaches an analogous procedure of generating code using a tree search algorithm with a tree structure having a root node.  (Pages 4 to 5: Figure 2)  A transformer tree search algorithm builds a tree structure, which can be used by future iterations of a search.  Only b partial programs are kept in a beam, and only b nodes are expanded at each level, and other nodes are dropped.  (Page 6: Figure 3)  Figures 2 and 3 illustrate paths through the tree with sets of tokens.  An objective is to use a tree search to perform lookahead planning to guide a search process with a transformer beam search algorithm.  (Page 5)  It would have been obvious to one having ordinary skill in the art to represent sets of tokens of Chen et al. as a tree structure with a root node as taught by Zhang et al. for a purpose of performing lookahead planning to guide a search process with a transformer beam search algorithm.       

Claims 11, 22, and 29 are rejected under 35 U.S.C. 103 as being unpatentable over Chen et al. (“Accelerating Large Language Model Decoding with Speculative Sampling”) in view of Olabiyi et al. (U.S. Patent Publication 2021/0027023) as applied to claims 1, 12, and 23 above, and further in view of Mukherjee et al. (U.S. Patent Publication 2023/0316000).
Chen et al. briefly notes a distributed serving setting for speculative decoding.  (Page 2)  Similarly, Olabiyi et al. teaches an operating environment 100 that may include at least one client device 110 and at least one server system 130.  (¶[0025]: Figure 1)  Arguably, it would have been obvious to allocate a draft model to a local client device and a target model to a remote server device according to principles of distributed processing.  However, Mukherjee et al. teaches generation of conversational responses using neural networks to determine an answer to an input query using a trained first neural network to extract the answer from a corpus of information and a second trained neural network to generate a formulation of the input query combined with the answer.  (Abstract)  A query may be extracted from input 102 and provided to EQA model 104 and AE model 108 may generate response 110.  AE model 108 may be a generative model.  (¶[0022] -¶[0025]: Figure 1)  Environment 200 may include one or more processing units which may be locally hosted or part of one or more distributed system.  Input 202 may be provided at a local client, and communicatively coupled to one or more remote servers.  (¶[0028]: Figure 2)  One or more components may be hosted locally on a local client or may be accessible via one or more networks at a remote server.  (¶[0038]: Figure 5)  Mukherjee et al., then, suggests that a first EQA model 104 may be provided “on a local system” and a second AE model 108 may be provided “on a remote system”.  An objective is to provide a pipeline utilizing an extractive question answering model to retrieve an answer responsive to an input query.  (¶[0015])  It would have been obvious to one having ordinary skill in the art to provide a draft model and a target model of Chen et al. on a local system and a remote system as taught by Mukherjee et al. for a purpose of retrieving an answer responsive to an input query in a pipeline utilizing an extractive question answering model.  

Response to Arguments
Applicants’ arguments filed 21 May 2026 have been fully considered but they are not persuasive.
Applicants amend independent claims 1, 17, 23, and 30 to set forth new limitations of “after the first plurality of tokens is outputted”, “after the second plurality of tokens is speculatively generated”, “from the first generative model”, “using the first generative model”, and “at the first generative model”.  Applicants add new dependent claims 31 to 32.  Then Applicants present arguments traversing the rejection of these independent claims as being obvious under 35 U.S.C. §103 over Chen et al. (“Accelerating Large Language Model Decoding with Speculative Sampling”) in view of Olabiyi et al. (U.S. Patent Publication 2021/0027023).  Applicants contend that Chen et al. does not disclose the limitations of “after the second plurality of sets of tokens is speculatively generated, receive . . . the indication of the selected set of tokens” and “output, from the first generative model to the second generative model, tokens”.  Applicants characterize Chen et al. as generating multiple sets of draft tokens with each set of tokens being immediately provided to the target model.  Applicants allege that their second set of tokens are generated before an indication of the selected set of tokens is received and are output to the second generative model after the indication of the selected set of tokens is received.  Applicants characterize Chen et al. as describing outputting a second set of draft tokens to the target model before receiving a selected set of draft tokens.  Additionally, Applicants allege there is support for limitations of new claim 31 directed to portions that do not overlap as considered during a prior telephone interview.  
Generally, Applicants’ arguments are not persuasive, but new grounds of rejection are set forth as directed to independent claims 1, 17, 23, and 30 introducing new matter under 35 U.S.C. §112(a) for the limitations of “after the first plurality of sets of tokens is outputted” while waiting to receive an indication of a selected set of tokens, and “after the second plurality of sets of tokens is speculatively generated” receiving the indication of the selected set of tokens from the second generative model.  Mainly, these limitations are not expressly described in the originally-filed Specification, and they are not necessarily inherent or implicit as required by 35 U.S.C. §112(a).  MPEP §2163 I. B., §2163 II. A. 3(b), and §2163.05 state that any claim amendments must be supported expressly or implicitly by the originally-filed Specification.  These limitations are not expressly described by the Specification, nor are they necessarily inherent or implicit.  Conceivably, a first generative model could receive the indication of selected set of tokens from the second generative model before or after a second plurality of sets of tokens are speculatively generated or output by the first generative model in a manner consistent with ¶[0079] - ¶[0082]: Figure 8 of the Specification.  MPEP 2163 II. A. 3 (b) and §2163.07(a) state that it is a rule of patent law that implicit or inherent disclosure requires some necessity and not merely probability.  Applicants’ new limitations, then, are not supported by an original written description in the Specification as required by 35 U.S.C. §112(a).  Applicants are attempting to overcome the prior art by introducing new matter into the claim language.  
Moreover, Applicants’ new claim 31 sets forth a limitation of wherein the first portion of the candidate response does not “overlap” the second portion of the candidate response, but this limitation is maintained to represent new matter.  There are no occurrences of the term ‘overlap’ in the originally-filed Specification, so that this limitation is not expressly supported.  The limitation of ‘overlap’ is not implicitly or inherently supported because ‘overlap’ of the first and second portions is not semantically equivalent to a limitation of the first and second portions being different.  Applicants attempt to justify these limitations by explaining the embodiments of ¶[0075] - ¶[0076]: Figure 7.  The examiner agrees that tokens 1 to 3 represent different portions of a sequence as compared to tokens 4 to 6, but one skilled in the art could construe the first and second portions as ‘overlapping’ because they are connected by the tree structure of Figure 7.  That is, tokens 1 to 3 could be construed to ‘overlap’ tokens 4 to 6 at the place where they join together in the tree structure so that it is not necessarily true that everyone would agree that they do not overlap.  The semantic meaning of overlap is not the same as a meaning of being distinct.  Specifically, ‘overlap’ can be defined as ‘to coincide in part with’ or ‘have in common with’.  See OVERLAP Definition & Meaning | Dictionary.com.  However, Figure 7 illustrates that tokens 1 to 3 ‘coincide in part with’ tokens 4 to 6 and are in common at the point where they are joined together in the tree.  The term ‘overlap’, then, raises issues of new matter under 35 U.S.C. §112(a) because it is neither expressly or implicitly described in the Specification.  
Nor do these amendments clearly overcome the rejection of the independent claims as being obvious under 35 U.S.C. §103 over Chen et al. (“Accelerating Large Language Model Decoding with Speculative Sampling”) in view of Olabiyi et al. (U.S. Patent Publication 2021/0027023).  Applicants contend that Chen et al. generates draft tokens and immediately provides these draft tokens to target model, and at best generates a second plurality of tokens before an indication of the selected set of tokens is received, so that it outputs a second set of draft tokens before receiving the selected set of draft tokens.  However, Chen et al.’s Algorithm 2 includes inner and output loops and a statement of parallel operation that are maintained to equivalently describe, and render obvious, the claimed limitations.  Here, Chen et al.’s Algorithm 2 operates an inner loop that generates chunks of K tokens by a draft model, and then forwards these tokens as “the first plurality of sets of tokens” to a target model.  This inner loop is represented by ‘for t = 1 to K do’.  At the same time, there is an outer loop represented by ‘while n < T do’, so that a draft model generates a second chunk of K tokens in parallel after it submits the first chunk of K tokens to the target model (“after the first plurality of sets of token is outputted, while waiting to receive, from the second generative model, an indication of a selected set of tokens . . . speculatively generating a second plurality of sets of tokens using the first generative model”).  Correspondingly, the target model samples the draft tokens in parallel to select tokens of the draft model within its own inner loop ‘for t = 1 to K do’ during a ‘while n < T do’ which selected tokens are received by the first generative model as tokens xn+t ← xt(tilde) from a chunk of K tokens (‘after the second plurality of sets of tokens is speculatively generated, receive at the first generative model from the second generative model, the indication of the selected set of tokens”).  Here, a nature of the loop structure and the described parallel operation of the draft model and the target model render obvious these limitations of the timings of the generation of the first and second sets of K tokens by the first generative model relative to an indication of the selected sets of tokens by the second generative model.  
Significantly, Applicants’ arguments for patentability rest on these relative timings of token generation and reception of an indication of selected tokens by the two models, but there is an express disavowal of the importance of the ordering of these steps in the Specification.  Applicants’ Specification, ¶[0175], states:
The methods disclosed herein comprise one or more steps or actions for achieving the methods. The method steps and/or actions may be interchanged with one another without departing from the scope of the claims. In other words, unless a specific order of steps or actions is specified, the order and/or use of specific steps and/or actions may be modified without departing from the scope of the claims. Further, the various operations of methods described above may be performed by any suitable means capable of performing the corresponding functions. The means may include various hardware and/or software component(s) and/or module(s), including, but not limited to a circuit, an application specific integrated circuit (ASIC), or processor. Generally, where there are operations illustrated in figures, those operations may have corresponding counterpart means-plus-function components with similar numbering.  (emphasis added)

Applicants’ arguments, then, are not persuasive that the new limitations overcome what is disclosed by Chen et al.  The limitations that are alleged to overcome the rejection are maintained to represent new matter under 35 U.S.C. §112(a).  Chen et al. discloses inner and outer loops that operate in parallel so that additional chunks of K tokens are generated by a draft model after a first chunk of K tokens are output by the draft model.  An inner ‘for’ loop continues to operate to generate additional chunks of K tokens by a draft model as long as a ‘while n < T’ outer loop remains in effect.  Similarly, an inner ‘for’ loop continues to select verified tokens xn+t ← xt(tilde) by a target model that are received by the draft model while the draft model continues to generate additional sets of K draft tokens.  Moreover, Applicants’ Specification, ¶[0175], describes that an ordering of steps may be interchanged or modified without departing from the scope of the claims so that any variations are admitted to be obvious variants.  
Accordingly, Applicants’ arguments are not persuasive.  This rejection is NON-FINAL.   

Conclusion
The prior art made of record and not relied upon is considered pertinent to Applicants’ disclosure.
Suresh et al. and Sun et al. disclose related prior art patent publications.
Leviathan et al. (“Looking Back at Speculative Decoding”) and Bala Priya (“The Machine Learning Practitioner’s Guide to Speculative Decoding”) are related non-patent literature.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MARTIN LERNER whose telephone number is (571) 272-7608.  The examiner can normally be reached Monday-Thursday 8:30 AM-6:00 PM.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached at (571) 272-7602.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center.  Unpublished application information in Patent Center is available to registered users.  To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov.  Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format.  For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).  If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MARTIN LERNER/Primary Examiner
Art Unit 2658                                                                                                                                                                                                        June 9, 2026
Read full office action
Prosecution Timeline

Show 2 earlier events
Feb 09, 2026
Response Filed
Feb 23, 2026
Final Rejection mailed — §103, §112
Apr 16, 2026
Applicant Interview (Telephonic)
Apr 16, 2026
Examiner Interview Summary
Apr 22, 2026
Response after Non-Final Action
May 21, 2026
Request for Continued Examination
May 26, 2026
Response after Non-Final Action
Jun 11, 2026
Non-Final Rejection mailed — §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/225,990
Patent 12681964
BLOCKWISE CONTROLLED DECODING OF NATURAL LANGUAGE (NL) BASED OUTPUT GENERATED USING A LARGE LANGUAGE MODEL (LLM) TO REDUCE LATENCY IN RENDERING THEREOF
2y 11m to grant Granted Jul 14, 2026
18/420,657
Patent 12675634
Methods and Systems for Graphically Organizing the Meanings of Words
2y 5m to grant Granted Jul 07, 2026
18/420,565
Patent 12670322
SPECIALIZED TOKEN PREDICTION BY A LARGE LANGUAGE MODEL TO PROMPT EXTERNAL INTERVENTION
2y 5m to grant Granted Jun 30, 2026
18/464,422
Patent 12664367
Training a Tokenizer Using Altered Text Data
2y 9m to grant Granted Jun 23, 2026
17/891,258
Patent 12657387
CLASSIFICATION OF COMPLIANCE TEXT DATA WITH FEW SHOT NATURAL LANGUAGE MODEL
3y 10m to grant Granted Jun 16, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
78%
Grant Probability
91%
With Interview (+13.4%)
2y 11m (~2m remaining)
Median Time to Grant
High
PTA Risk
Based on 990 resolved cases by this examiner. Grant probability derived from career allowance rate.