DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The Information Disclosure Statement (IDS) submitted on 31 December 2025 cites two publications of non-patent literature that were already cited in prior Information Disclosure Statements filed on 04 January 2024 and 18 July 2024 by Applicants. Specifically, Applicants provide a duplicative citation of Chen et al. (“Accelerating Large Language Model Decoding with Speculative Sampling”), which was already cited in a prior Information Disclosure Statement, and is the basis of the current rejection. Accordingly, these references are being lined through on the Information Disclosure Statement submitted on 31 December 2025 to indicate that they were already considered.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1 to 2, 6 to 10, 12 to 13, 17 to 21, 23 to 24, 26 to 28, and 30 are rejected under 35 U.S.C. 103 as being unpatentable over Chen et al. (“Accelerating Large Language Model Decoding with Speculative Sampling”) in view of Olabiyi et al. (U.S. Patent Publication 2021/0027023).
Concerning independent claims 1, 12, 23, and 30, Chen et al. discloses an algorithm that is implicitly performed by a computer program executing as a method on a computer system comprising: “generating, based on [an input query and] a first generative model, a first plurality of sets of tokens, each set of tokens in the first plurality of sets of tokens corresponding to a first portion [of a candidate response to the input query]” – a short draft of length K is attained with either a parallel model or by calling a faster, auto-regressive model K times; this model is referred to as the draft model (“a first generative model”) (Page 1); Algorithm 2 provides code for generating a sample draft auto-regressively from a prompt sequence x1, . . . , xt with drafts x1, . . . , xK (Page 3); here, drafts x1, . . . , xK are “a first plurality of sets of tokens” for a set of tokens comprising x1, . . . , xt;
“outputting to a second generative model, the first plurality of sets of tokens for verification” – a draft is scored using a larger, more powerful model referred to as the target model (“a second generative model”); a target model can accept (‘verify’) a subset of tokens (Page 1); a target model, then, is accepting (‘verifying’) a subset of preliminary output from a draft model;
“while waiting to receive, from the second generative mode, an indication of a selected set of tokens from the first plurality of sets of tokens, speculatively generating a second plurality of sets of tokens, each set of tokens in the second plurality of sets of tokens corresponding to a second portion [of the candidate response to the input query]” – speculative sampling enables generation of multiple tokens for each transformer call by parallel scoring of short continuations generated by a faster but less powerful draft model (Abstract; Page 1); here, speculative sampling operates in parallel between a draft model and a target model, so that a target model is processing tokens from the draft model while the draft model processes additional sets of tokens x1, . . . , xK (“speculatively generating a second plurality of sets of tokens”) (Page 3);
“receiving, from the second generative model, the indication of the selected set of tokens from the first plurality of sets of tokens” – using a modified rejection sampling scheme, a subset of the K draft tokens are accessed by the target model (Page 1); a rejection sampling scheme of the drafted tokens is performed given a sequence of tokens x1, . . . , xn and K draft tokens xn+1, . . . , xn+K (Page 4); here, a set of tokens that is not rejected is selected (“receiving . . . the indication of the selected set of tokens”);
“outputting, to the second generative model, tokens from the second plurality of sets of tokens associated with the selected set of tokens for verification” – a short draft of length K is attained with either a parallel model or by calling a faster, auto-regressive model K times; this model is referred to as the draft model (“a first generative model”) (Page 1); Algorithm 2 provides code for generating a sample draft auto-regressively from a prompt sequence x1, . . . , xt with drafts x1, . . . , xK (Page 3); here, additional drafts x1, . . . , xK are “the second plurality of sets of tokens” for a set of tokens comprising x1, . . . , xt as an algorithm iterates with parallel operation of a draft model and a target model;
“outputting the selected set of tokens [as a response to the input query]” – a subset of K draft tokens are accepted (Page 1); if all tokens are accepted, a sample extra token is set (Page 3); a natural language summarization task is performed (Page 6); implicitly, a natural language summarization task outputs a summary from decoded tokens.
Concerning independent claims 1, 12, 23, and 30, Chen et al. discloses a general concept of providing a draft model and a target model in parallel operation for processing sets of tokens, but does not include an application to generating a response to an input query in the limitations of “based on an input query . . . corresponding to a first portion of a candidate response to the input query” and outputting the selected set of tokens “as a response to the input query.” However, Olabiyi et al. teaches multi-turn dialogue response generation that includes a forward model that may be used to generate an output sequence based on previously generated output sequences and a backward model that may reevaluate previous generated outputs sequences. (Abstract) Systems may use transformer-based machine classifiers to perform a variety of natural language understanding tasks including question answering. (¶[0004]) A backward model may be used to reevaluate previously generated output sequences and a forward model may be used to generate an output sequence based on previously generated output sequences. (¶[0008]) The application may use autoregressive models. (¶[0048]) Olabiyi et al., then, teaches an application of autoregressive models to a multi-turn dialogue that generates responses as answers to questions (“based on an input query” and “a response to the input query”). An objective is to provide transformer-based machine classifiers that are trained to more accurately identify and generate relevant and interesting responses, saving processing time and processing resources. (¶[0024]) It would have been obvious to one having ordinary skill in the art to apply accelerated large language model decoding with speculative sampling of Chen et al. to a natural language task of question answering to generate a response to a query as taught by Olabiyi et al. for a purpose of accurately identifying and generating relevant and interesting responses that saves processing time and resources.
Concerning claims 2, 13, and 24, Chen et al. discloses that a subset of the K draft tokens are accepted from left to right upon iteration of Algorithm 2 (Pages 1 and 3); that is, an algorithm operates iteratively and in parallel between a draft model and a target model so that as a target model accepts a set of tokens from a draft model (“receive an indication of a second selected set of tokens from the second plurality of sets of tokens associated with the selected set of tokens”), and this set of tokens is added to the output (“output the second selected set of tokens as another portion”).
Concerning claims 6, 17, and 26, Chen et al. discloses a draft model (“the first generative model”) that is trained with four billion parameters optimized for sampling latency with relatively few number of layers to enable it to achieve a sample speed (Page 5); a draft model, then, generates sets of tokens x1, . . . , xK (“each respective set of tokens in the first plurality of sets of tokens”) using “a unique instance of the first generative model and unique parameters as input” corresponding to the parameters of the trained draft model.
Concerning claims 7, 18, and 27, Chen et al. discloses an iterative procedure with a target model that produces “a refined subsequent set of tokens” that are accepted (‘verified’) and output, and a draft model provides a parallel process to “speculatively generate a third plurality of sets of tokens”. Mainly, these claims are simply an iteration of a second plurality of tokens produced by a draft model to a third plurality of tokens produced by the draft model. Specifically, p(xn+1| x1, . . . , xn) is the probability of a token of the draft model conditioned on context so far, and accepted tokens are set for xn+1. (Page 4)
Concerning claims 8 and 19, Olabiyi et al. teaches appending random paddings before and/or after the input data to reduce syntactic redundancies in the input data (¶[0007]); encoder and decoder sequences may be padded, and prepended, appended, or randomly inserted within the encoder sequences as appropriate (“wherein the sets of tokens in the subsequent plurality of sets of tokens includes padding accounting for a number of tokens in the selected set of tokens being less than a maximum number of tokens”) (¶[0052] - ¶[0053]: Figure 5).
Concerning claims 9, 20, and 28, Chen et al. discloses a draft model and a target model with speculative decoding that generates tokens and can be designated as “the first generative model in a speculative decoding pipeline” and “the second generative model in the speculative decoding pipeline”. (Page 1)
Concerning claims 10 and 21, Chen et al. discloses that a method recovers the distribution of the target model from samples of the draft model with q(xn+1| x1, . . . , xn) and p(xn+1| x1, . . . , xn) as the probabilities of the target and draft models (Page 4); a portion of the activations of the target model is used as an input to the draft model, and the draft model is trained with this input; a smaller version of the target language model is used as the draft model (Page 5). The draft model, then, is trained to have a similar probability distribution to the target model (“wherein the draft model comprises a model trained to have a probability distribution that approximates a corresponding probability distribution for the target model”); a token is accepted according to min(1, q(xn+1| x1, . . . , xn)/p(xn+1| x1, . . . , xn) (Page 4); p(xn+1| x1, . . . , xn), then, ‘approximates’ q(xn+1| x1, . . . , xn) because their ratio is ideally equal to one.
Claims 3 to 4 and 14 to 15 are rejected under 35 U.S.C. 103 as being unpatentable over Chen et al. (“Accelerating Large Language Model Decoding with Speculative Sampling”) in view of Olabiyi et al. (U.S. Patent Publication 2021/0027023) as applied to claims 1 and 12 above, and further in view of Araki (U.S. Patent Publication 2023/0316001).
Chen et al. discloses that a draft model (“the first generative model”) outputs sets of tokens x1, . . . , xK according to a probability distribution p(xn+1| x1, . . . , xn), but does not clearly disclose that these sets of tokens represent “the highest probabilities within a probability distribution with the first generative model over a universe of tokens” and that “each set of tokens in the first plurality of sets of tokens comprises a group of tokens selected based on a sum of probabilities associated with tokens in the group of tokens, the sum exceeding a threshold probability.” However, Araki teaches selecting an entity type from among a set of entity types, and outputting a selected candidate. (Abstract) Specifically, Araki teaches selecting a correct answer based on a confidence for single token or multi-token phrases. A machine learning system uses a likelihood score which is the sum of the log probabilities of each predicted token conditioned on the other tokens. A candidate generator is configured to select a set of candidates with a final prediction having a predetermined number of predictions with the highest confidence scores. The candidate generator may select the set of candidates based on a predetermined number of candidates less than or equal to the beam size with the highest confidence scores, and may select a candidate to be in the set of candidates if that candidate has a confidence score that is above a threshold value and/or if that candidate satisfies other threshold criteria. (¶[0025] - ¶[0027]: Figure 2) Araki, then, teaches selecting “a group of tokens having the highest probabilities” and “a group of tokens selected based on a sum of probabilities associated with the tokens in the group of tokens, the sum exceeding a threshold probability.” An objective is to provide a question answering system in virtual assistants. (¶[0019]) It would have been obvious to one having ordinary skill in the art to select tokens in Chen et al. according to a procedure of Araki that determines highest probabilities above a threshold as a sum of probabilities of tokens in group for a purpose of enabling a virtual assistant to answer questions.
Claims 5, 16, and 25 are rejected under 35 U.S.C. 103 as being unpatentable over Chen et al. (“Accelerating Large Language Model Decoding with Speculative Sampling”) in view of Olabiyi et al. (U.S. Patent Publication 2021/0027023) as applied to claims 1, 12, and 23 above, and further in view of Zhang et al. (“Planning With Large Language Models for Code Generation”).
Chen et al. does not disclose that tokens “are represented as a tree data structure” and “a root node of the tree data structure corresponds to the input query” and “each path through the tree data structure corresponds to a set of tokens from the first plurality of tokens.” However, Zhang et al. teaches an analogous procedure of generating code using a tree search algorithm with a tree structure having a root node. (Pages 4 to 5: Figure 2) A transformer tree search algorithm builds a tree structure, which can be used by future iterations of a search. Only b partial programs are kept in a beam, and only b nodes are expanded at each level, and other nodes are dropped. (Page 6: Figure 3) Figures 2 and 3 illustrate paths through the tree with sets of tokens. An objective is to use a tree search to perform lookahead planning to guide a search process with a transformer beam search algorithm. (Page 5) It would have been obvious to one having ordinary skill in the art to represent sets of tokens of Chen et al. as a tree structure with a root node as taught by Zhang et al. for a purpose of performing lookahead planning to guide a search process with a transformer beam search algorithm.
Claims 11, 22, and 29 are rejected under 35 U.S.C. 103 as being unpatentable over Chen et al. (“Accelerating Large Language Model Decoding with Speculative Sampling”) in view of Olabiyi et al. (U.S. Patent Publication 2021/0027023) as applied to claims 1, 12, and 23 above, and further in view of Mukherjee et al. (U.S. Patent Publication 2023/0316000).
Chen et al. briefly notes a distributed serving setting for speculative decoding. (Page 2) Similarly, Olabiyi et al. teaches an operating environment 100 that may include at least one client device 110 and at least one server system 130. (¶[0025]: Figure 1) Arguably, it would have been obvious to allocate a draft model to a local client device and a target model to a remote server device according to principles of distributed processing. However, Mukherjee et al. teaches generation of conversational responses using neural networks to determine an answer to an input query using a trained first neural network to extract the answer from a corpus of information and a second trained neural network to generate a formulation of the input query combined with the answer. (Abstract) A query may be extracted from input 102 and provided to EQA model 104 and AE model 108 may generate response 110. AE model 108 may be a generative model. (¶[0022] -¶[0025]: Figure 1) Environment 200 may include one or more processing units which may be locally hosted or part of one or more distributed system. Input 202 may be provided at a local client, and communicatively coupled to one or more remote servers. (¶[0028]: Figure 2) One or more components may be hosted locally on a local client or may be accessible via one or more networks at a remote server. (¶[0038]: Figure 5) Mukherjee et al., then, suggests that a first EQA model 104 may be provided “on a local system” and a second AE model 108 may be provided “on a remote system”. An objective is to provide a pipeline utilizing an extractive question answering model to retrieve an answer responsive to an input query. (¶[0015]) It would have been obvious to one having ordinary skill in the art to provide a draft model and a target model of Chen et al. on a local system and a remote system as taught by Mukherjee et al. for a purpose of retrieving an answer responsive to an input query in a pipeline utilizing an extractive question answering model.
Response to Arguments
Applicants’ arguments filed 09 February 2026 have been fully considered but they are not persuasive.
Applicants provide a minor amendment to the independent claims, but mainly present arguments traversing the prior rejection of obviousness under 35 U.S.C. §103 over Chen et al. (“Accelerating Large Language Model Decoding with Speculative Sampling”) in view of Olabiyi et al. (U.S. Patent Publication 2021/0027023). Applicants’ argument is that Chen et al. fails to disclose the limitation of “while waiting to receive, from the second generative model, an indication of a selected set of tokens from the first plurality of sets of tokens, speculatively generate a second plurality of sets of tokens”. Applicants argue that Chen et al. instead describes (1) receiving the selected draft tokens from a sampling algorithm, but not the target model, or (2) generating a second set of draft tokens once the resulting tokens are received, but not before they are received. Applicants consider the code of Algorithm 2 of Chen et al., which they characterize as receiving an initial prompt sequence (Lines 2 to 3), generating a draft set of tokens via sampling of the prompt sequence with a draft model, p(.|.) (Lines 6 to 7), generating, in parallel, a set of logits from the draft set of tokens via a target model, q(.|.) (Lines 9 to 10), performing sampling to remove draft tokens which do not correspond to a minimum probability (Lines 12 to 16), and iteratively repeating the process with the prompt sequence and the selected draft tokens (Lines 2 to 30). Applicants conclude that if the draft model is considered to be “a first generative model” and the target model is considered to be “a second generative model”, and the remaining selected draft tokens are considered to be “the indications of a selected set of tokens”, then there is no equivalent to “while waiting to receive, from the second generative model, an indication of a selected set of tokens from the first plurality of sets of tokens, speculatively generate a second plurality of sets of tokens”.
Applicants’ amendment overcomes the objections to the Specification.
Applicants’ amendment overcomes the rejection for non-patent-eligible subject matter under 35 U.S.C. §101.
Applicants’ arguments are being carefully considered, but the rejection is being maintained to be proper for the independent claims as obvious under 35 U.S.C. §103 over Chen et al. (“Accelerating Large Language Model Decoding with Speculative Sampling”) in view of Olabiyi et al. (U.S. Patent Publication 2021/0027023). Applicants’ arguments are not persuasive because they do not consider the autoregressive nature of the algorithm, and do not evidence understanding of what is being performed by sampling of a distribution. Moreover, Applicants only consider code of Algorithm 2 of Chen et al., and do not consider what is disclosed by the remainder of the reference. Given Applicants’ claim language and Chen et al. are both directed to “speculative decoding”, Applicants would appear to face a high hurdle in arguing that the reference fails to disclose the limitations “while waiting to receive, from the second generative model, an indication of a selected set of tokens from the first plurality of sets of tokens, speculatively generating a second plurality of tokens”. That is, it is the entire purpose of a draft model to speculative generate tokens that are ‘edited’ by the target model so that tokens selected by the target model from the draft tokens are fed back to the draft model. Tokens generated by the target model are fed back as the correct tokens to the draft model which relies upon these tokens to iteratively generate new draft tokens. Consequently, a draft model continues to generate a next speculative draft token, xi(tilda), “while waiting to receive” “an indication of a selected set of tokens”, a token xn+t, from a target model. This token xn+t, from a target model is a reliable token for a draft model to estimate a next token in a sequence, xi(tilda).
Generally, Chen et al. discloses that a faster but less powerful draft model produces tokens that are scored by the target model. (Abstract) The draft tokens produced by the draft model are scored with the larger, more powerful target model, and sampling produces a subset of the K draft tokens. (Introduction, Page 1) Speculative sampling is produced autoregressively by a draft model and a target model. Here, an autoregressive nature of the algorithm provides that a draft model speculatively generates a sampled draft token xt(tilda) from a probability distribution p(.|.) that depends on all of the prior tokens, x1 . . . xt and on speculative draft tokens x1(tilda). . . xt-1(tilda). The target model then generates a reliable next token xn+t by considering these draft tokens with a probability distribution q(.|.) that depends upon all of the prior tokens x1 . . . xt and on the draft tokens x1(tilda). . . xK(tilda). Chen et al. describes this by the expression xn+1 ← xn+1 (tilda) at the bottom on Page 4. This reliable next token xn+t then is fed back autoregressively to the draft model to estimate the next draft token xt(tilda) in an iterative process. Given that next draft token is being prospectively generated by a draft model while the draft model is waiting to receive a confirmation by the target model that one of the tokens generated by the draft model is correct, Chen et al. reasonably discloses the limitation of “while waiting to receive, from the second generative model, an indication of a selected set of tokens from the first plurality sets of tokens, speculatively generating a second plurality of tokens”. That is, a faster draft model ‘speculatively generates’ a prospective “second set of tokens” while waiting to receive confirmation that one of “a first plurality of sets of tokens” generated as draft tokens by the draft model is correct.
Looking at Algorithm 2 of Chen et al., a faster draft model produces K speculative tokens x1(tilda), . . . , xK(tilda) for a prompt sequence that iterates K times over t = 1 to K as “a first plurality of sets of tokens” for every next token xn+t selected by a target model. While a target model is considering these K speculative tokens by applying them to probability distributions q(x| x1, . . . , xn), q(x| x1, . . . , xn, x1(tilda)), . . . , q(x| x1, . . . , xn, x1(tilda), . . . xK(tilda)), a draft model has already gone ahead and estimated “a second plurality of sets of tokens” that incorporate a selected next token xn+t from the target model.
Applicants’ argument appears to misconstrue the nature of sampling as disclosed by Chen et al. Algorithm 2 of Chen et al. discloses that sampling is performed on output of both the draft model and the target model. Generally, a probability distribution does not produce only one unique output, but generates a plurality of outputs comprising a ranked set of possibilities, so that an output of the distribution must be ‘sampled’ to generate a predicted token. That is, there must be some way of selecting the next word from a probabilistic prediction of next words, and that way is called ‘sampling’. See, e.g., “Algorithms for Sampling from Language Models”, jhu.edu, 09-10.sampling-algorithms.pdf, for explanation of various ways of sampling (one of these is top-k sampling). Applicants’ argument that Chen et al. (1) describes receiving the selected draft tokens from a sampling algorithm and not the target model is accordingly not accurate because a probability distribution of a target model is sampled to autoregressively feed back a next accurate token xn+t from the target model to the draft model. Similarly, Applicants’ argument that Chen et al. (2) describes generating a second set of draft tokens once the resulting tokens are received, but not before they are received, is not accurate because a draft model continues to generate tokens xt (tilda) while the target model is working out a next accurate token xn+t from among the tokens received in a prior iteration from the draft model. Logically, Chen et al. can receive the tokens from a sampling algorithm and from a target model, as these are not mutually exclusive, and can be generating additional draft tokens while waiting to receive a next token from a target model even if it has received next tokens from a target model in prior rounds of iterations.
Applicants’ arguments, then, are not persuasive. There are no new grounds of rejection. Accordingly, this rejection is properly FINAL.
Conclusion
The prior art made of record and not relied upon is considered pertinent to Applicants’ disclosure.
THIS ACTION IS MADE FINAL. Applicants are reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MARTIN LERNER whose telephone number is (571) 272-7608. The examiner can normally be reached Monday-Thursday 8:30 AM-6:00 PM.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached at (571) 272-7602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/MARTIN LERNER/Primary Examiner
Art Unit 2658 February 19, 2026