DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Objections
Claims 4 and 14 are objected to because of the following informalities: the claims recite the limitation “the input tokens”. Appropriate correction is required.
Claims 2 and 12 are objected to because of the following informalities: the claims recite the limitation “all the input tokens”. Appropriate correction is required.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 1 and 11 recites the limitation "the next token." There is insufficient antecedent basis for this limitation in the claim.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-20 are rejected under 35 U.S.C. 101.
Claims 1 and 11 are directed to an abstract idea of reciting generating outputs, computing and maintaining a “BoS cache,” applying quantization, and selecting a next token based on computed outputs or input content. These elements recite mental operations and mathematical concepts (data transformations, algorithmic selection, quantization math, token prediction), which are judicial exceptions to patentable subject matter. See MPEP §2106.02(II) (mental processes) and MPEP §2106.04(a) (mathematical concepts); Alice Corp. v. CLS Bank Int’l, 573 U.S. 208 (2014); Electric Power Group v. Alstom, 830 F.3d 1350 (Fed. Cir. 2016).
In particular, because the claim language does not require any specific machine implemented data structures or hardware and recites high level steps of computing and selecting tokens (operations that can be expressed as mental steps or pen and paper calculations), the claim reads on performance in the human mind or by generic computation steps. Thus the claim recites a judicially recognized abstract idea comprised of mental processes and mathematical concepts.
The claims merely applies the abstract idea using generic computer components and conventional machine-learning techniques, such as “a machine learning model,” “quantized model,” “cache,” and “inference,” without specifying any particular model architecture, data structure, memory organization, or hardware-level optimization. The use of quantization and caching is described at a high level of abstraction and reflects well-understood practices in machine learning for efficiency, rather than a specific improvement to computer functionality itself. The claim does not recite how quantization is technically performed, how the cache is structured in memory, or how these steps improve processing at a hardware or system level.
The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because the claims are (i) mere instructions to implement the idea on a computer, and/or (ii) recitation of generic computer structure that serves to perform generic computer functions that are well-understood, routine, and conventional activities previously known to the pertinent industry. Viewed as a whole, these additional claim element(s) do not provide meaningful limitation(s) to transform the abstract idea into a patent eligible application of the abstract idea such that the claim(s) amounts to significantly more than the abstract idea itself. Therefore, the claim(s) are rejected under 35 U.S.C. 101 as being directed to non-statutory subject matter. There is further no improvement to the computing device.
Dependent claims 2-10 and 11-20 further recite an abstract idea and do not amount to significantly more than the abstract idea as they do not provide steps other than what is conventionally known in machine learning models.
Claims 2 and 12: an abstract information labeling/organizing step.
Claims 3 and 13: an abstract parameter selection/data-handling rule for a mathematical model.
Claims 4 and 14: an abstract mathematical classification/quantization algorithm on data.
Claims 5 and 15: an abstract data clustering rule based on numeric similarity.
Claims 6 and 16: an abstract rule-based parameter selection for model execution.
Claims 7 and 17: an abstract choice of numeric representation for performing the model’s calculations.
Claims 8 and 18: an abstract model configuration choice without a specific technical improvement.
Claims 9 and 19: an abstract mathematical number format conversion for model parameters.
Claim 10: an abstract functional description of a prediction algorithm (next token depends on prior tokens).
Claim 20: an abstract data sequence definition/formatting limitation.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1-2, 7, 9-12, 17 and 19-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yan et al. (US 2022/0318601) in view of Sheng et al. (“FlexGen: High-Throughput Generative Inference of Large Language Models with Single GPU”; July 3, 2023).
Claims 1 and 11,
Yan teaches an execution method of a machine learning model ([Abstract] attention mechanism 102), comprising: generating output and a begin of sentence (BoS) cache of a BoS token using the machine learning model ([Figs. 1-2] [0031-0032] [0038-0039] a token that designates a start of a sequence; upon predicating an output token, the decoder system adds that token to the end of the sentence that is fed as input information into the decoder system; the decoder system is tasked; the decoder system is tasked with responsibility of caching the head-specific key information and the head-specific value information during self-attention and the input sequence explicitly begins with, meaning the cache includes the BoS token’s state)
during the inference, input the next token following the BoS token as a first input token ([0038] “Jack”; example input sequence shows the BoS token followed by the next token) wherein the next token is based on the output of the Bos token or based on an input content ([0038] “Jack” (given input token after BoS); output-based case (decoder output information predicts one or more candidate tokens that follow).
The difference between the prior art and the claimed invention is that Yan does not explicitly teach before or after performing model quantization on the machine learning model to generate a quantized model; and executing inference based on the quantized model, and the BoS cache into the quantized model to generate output and cache of the next token.
Sheng teaches before or after performing model quantization on the machine learning model to generate a quantized model ([5.] [Group-wise Quantization] 4-bit means using group-wise quantization to compress both weights and KV cache into 4-bit integers); and
executing inference based on the quantized model ([3.] [Generative Inference] inference using the quantized representations: prefill and decoding stages operate with compressed (quantized) weight and KV cache), and
the BoS cache into the quantized model to generate output and cache of the next token ([3.] [Generative Inference] [Section 3. Equations] during the decode phase, the inference computation need to update the KV cache (Concat, the decode equations) indicating use of previously generated cache when processing the next token).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Yan with teachings of Sheng by modifying the resource-efficient attention in a neural network as taught by Yan to include before or after performing model quantization on the machine learning model to generate a quantized model; and executing inference based on the quantized model, and the BoS cache into the quantized model to generate output and cache of the next token for the benefit of producing high-throughput LLM inference using limited resources (Sheng [Abstract]).
Claims 2 and 12,
Yan further teaches the method of claim 1, wherein the BoS cache is used as information to indicate the BoS token is at the beginning of all the input tokens ([0032] is a token that designates a start of a sequence).
Claims 7 and 17,
Sheng further teaches the method of claim 1, wherein generating the output and the BoS cache of the BoS token using the machine learning model comprises: generating the output and the BoS cache of the BoS token by using the machine learning model which is based on float data type ([5.] [Group-wise Quantization] dequantize the tensors back to FP16 before computation).
Claims 9 and 19,
Sheng further teaches the method of claim 1, wherein performing the model quantization on the machine learning model to generate the quantized model comprises: performing the model quantization on the machine learning model to generate the quantized model which is based on integer data type ([6.2 Approximations] to compress both weights and KV cache into 4-bit integers).
Claim 10,
Sheng further teaches the method of claim 1, wherein the machine learning model is an autoregressive language model ([Abstract] LLM).
Claim 20,
Yan further teaches the method of claim 11, wherein the fixed sequence of tokens comprises a begin of sentence (BoS) token and at least one other tokens ([0032] example of token sequence beginning with a start token and followed by another token).
Claim(s) 3 and 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yan et al. (US 2022/0318601) in view of Sheng et al. (“FlexGen: High-Throughput Generative Inference of Large Language Models with Single GPU”; July 3, 2023) and further in view of Yao et al. (“ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers”; 2002).
Claims 3 and 13,
Yan and Sheng teach all the limitations in claim 1. The difference between the prior art and the claimed invention is that Yan nor Sheng explicitly teach wherein the BoS token is omitted during the model quantization.
Yao taches the method of claim 1, wherein the BoS token is omitted during the model quantization ([4.1] [Token-wise Quantization for Activations] adopts a token-wise quantization and dynamically calculate the min/max activation range for each token during inference).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Yan and Sheng with teachings of Yao by modifying the resource-efficient attention in a neural network as taught by Yan to include wherein the BoS token is omitted during the model quantization as taught by Yao for the benefit of achieving better efficiency with LLM (Yao [Abstract]).
Claim(s) 8 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yan et al. (US 2022/0318601) in view of Sheng et al. (“FlexGen: High-Throughput Generative Inference of Large Language Models with Single GPU”; July 3, 2023) and further in view of Xiao et al. (“SmoothQuant: Accurate and Efficient Post-Training Quantization for LLMs”; July 3, 2023).
Claims 8 and 18,
Yan and Sheng teach all the limitations in claim 1. The difference between the prior art and the claimed invention is that Yan nor Sheng explicitly teach generating the output and the BoS cache of the BoS token by using the machine learning model which is an un-quantized model
Xiao teaches the method of claim 1, generating the output and the BoS cache of the BoS token by using the machine learning model which is an un-quantized model ([4. SmoothQuant] quantization is derived from running calibration samples through the model offline (before quantized execution): we estimate the scale of activation channels using calibration samples and the smoothing factor s is obtained on calibration samples and the entire transformation is performed offline).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Yan and Sheng with teachings of Xiao by modifying the resource-efficient attention in a neural network as taught by Yan to include generating the output and the BoS cache of the BoS token by using the machine learning model which is an un-quantized model as taught by Xiao for the benefit of reducing hardware costs and democratized LLMs (Xiao [Abstract]).
Allowable Subject Matter
Claims 4-6 and 14-16 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims AND overcome the 101 Abstract Idea.
The following is a statement of reasons for the indication of allowable subject matter: Yan in view of Sheng teach all the limitations in claim 1. Yao et al. (“ZeroQuant”) teaches token-wise activation quantization (pre-token min/max / ranges) and group -wise weight quantization but not grouping tokens into N clusters and then generating N parameter types for those clusters.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Lefebvre et al. (US 10,599,645) – A speech recognition and natural language understanding system performs insertion, deletion, and replacement edits of tokens at positions with low probabilities according to both a forward and a backward statistical language model (SLM) to produce rewritten token sequences. Multiple rewrites can be produced with scores depending on the probabilities of tokens according to the SLMs. The rewritten token sequences can be parsed according to natural language grammars to produce further weighted scores. Token sequences can be rewritten iteratively using a graph-based search algorithm to find the best rewrite. Mappings of input token sequences to rewritten token sequences can be stored in a cache, and searching for a best rewrite can be bypassed by using cached rewrites when present. Analysis of various initial token sequences that produce the same new rewritten token sequence can be useful to improve natural language grammars.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHREYANS A PATEL whose telephone number is (571)270-0689. The examiner can normally be reached Monday-Friday 8am-5pm PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached at 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
SHREYANS A. PATEL
Primary Examiner
Art Unit 2653
/SHREYANS A PATEL/ Examiner, Art Unit 2659
/PIERRE LOUIS DESIR/ Supervisory Patent Examiner, Art Unit 2659