DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This action is responsive to application filed on 28 February 2025. Claims 1-20 are pending in the case. Claims 1, and 11 are the independent claims. This action is non-final.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on February 28th, 2025 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Hensley et al. (US 2025/0251932 A1) in view of Wang et al. (US 2025/0005365 A1).
Regarding claim 1, Hensley teaches a method, comprising:
obtaining, by a large language model in a computer, an input sequence of text comprising a plurality of elements of text (see Hensley, Paragraph [0133], “the search engine 43 may include an encoder 46 and a vector database 45. In some embodiments, the encoder 46 may be operative to receive as input unstructured natural language text, such as an intermediate representation or a query received from user computer 14.” [An unstructured natural language text (i.e., an input sequence of text comprising a plurality of elements of text) may be received (i.e., obtaining).]);
converting, by the computer, each element in the plurality of elements of text into a token, resulting in a plurality of tokens (see Hensley, Paragraph [0133], “The encoder may be configured and trained to transform that input into one or more vectors in an embedding space. In some embodiments input text or characters are tokenized according to one or more tokenizing schemes associated with one or more embedding models used to transform directly token streams to vectors in an embedding space.” [The input is transformed into tokens.]);
transforming, by the computer, each token in the plurality of tokens into a query vector, key vector, and value vector, resulting in a plurality of query vectors, key vectors, and value vectors (see Hensley, Paragraph [0135], “The token representations may then be processed through multiple layers of self-attention mechanisms, where each token attends to all other tokens in the sequence. In each self-attention layer, query, key, and value vectors may be computed” [Query, key, and value vectors may be computed (i.e., transformed) based on the tokens.]);
However, Hensley does not explicitly teach:
obtaining, by the computer, a plurality of quantized key vectors by applying product quantization to the plurality of key vectors;
Wang teaches:
obtaining, by the computer, a plurality of quantized key vectors by applying product quantization to the plurality of key vectors (see Wang, Paragraph [0030], “Product quantization (PQ) is an effective vector quantization technique used for dataset compression.” [Product quantization may be applied in order to obtain quantized vectors (i.e., quantized key vectors).]);
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined Hensley (teaching pre-computation for intermediate-representation infused and retrieval augmented generation) in view of Wang (teaching neural network inference based on table lookup), and arrived at a method that incorporates product quantization. One of ordinary skill in the art would have been motivated to make such a combination for the purposes of achieving a better model accuracy (see Wang, Paragraph [0002]). In addition, both the references (Hensley and Wang) teach features that are directed to analogous art and they are directed to the same field of endeavor, such as machine learning. The close relation between both the references highly suggests an expectation of success.
The combination of Hensley, and Wang further teaches:
and for each query vector: compressing, by the computer, a lookup table into at least one single instruction multiple data (SIMD) register of a plurality of SIMD registers in a computer processor to produce a compressed lookup table, wherein the lookup table is based on the query vector and the plurality of quantized key vectors (see Wang, Paragraph [0084], “Since the lookup table may be quantized into integers in a certain range (e.g., INT8 in some examples), leverage shuffle instructions, which are widely supported in the instruction set, e.g., in the Single Instruction Multiple Data (SIMD) instruction set, can be leveraged to achieve parallel and efficient table read. The implementation of table read using shuffle instructions in FIG. 7 . In a vectorized table lookup step 716, the shuffle instruction permutes each byte of a vector based on an index vector and stores the shuffled bytes in the result vector register in each clock cycle. In an example, on 128-bit wide SIMD, a vectorized table read instruction handles 16 sub-vector lookups (128/8=16) on 16 results (128/8=16) simultaneously, greatly simplifying table read and reducing overheads.” [The lookup table may be quantized (i.e., compressed) into integers in a certain range in the SIMD register, which may be based on the query vectors and quantized key vectors.]);
performing, by the computer, a parallel processing operation using a plurality of SIMD instructions and the lookup table in the SIMD register to determine a plurality of attention scores (see Wang, Paragraphs [0084]-[0085], “Since the lookup table may be quantized into integers in a certain range (e.g., INT8 in some examples), leverage shuffle instructions, which are widely supported in the instruction set, e.g., in the Single Instruction Multiple Data (SIMD) instruction set, can be leveraged to achieve parallel and efficient table read. The implementation of table read using shuffle instructions in FIG. 7 . In a vectorized table lookup step 716, the shuffle instruction permutes each byte of a vector based on an index vector and stores the shuffled bytes in the result vector register in each clock cycle. In an example, on 128-bit wide SIMD, a vectorized table read instruction handles 16 sub-vector lookups (128/8=16) on 16 results (128/8=16) simultaneously, greatly simplifying table read and reducing overheads. … The number of parallel processing units within a SIMD instruction is called SIMD lanes. In some implementations, the number of centroids may be set based on the number of SIMD lanes. For example, the number of centroids may be set to 16 (K=16) to maximize the utilization of all SIMD instruction lanes.” [The table reads may be performed in parallel (i.e., performing a parallel processing operation) in order to determine a plurality of results (i.e., a plurality of attention scores).]);
and generating, by the large language model, a predicted sequence of text based on the plurality of attention scores, wherein the large language model determines the predicted sequence of text using the plurality of attention scores by accessing the compressed lookup tables in the at least one SIMD register (see Hensley, Paragraphs [0072], [0143], [0259], “Some implementations may apply SIMD (single instruction, multiple data) operations or graphics processing unit (GPU) acceleration to enhance distance computations, particularly when working with large-scale embedding spaces. … a transformer architecture-based LLM may be used. … Subsequently, the system may construct separate encoder and decoder layers, each consisting of components like multi-headed self-attention and feed-forward neural networks. Encoder only, decoder only, or encoder-decoder LLMs may be used. Some embodiments may implement a multi-headed attention mechanism using scaled dot-product attention. This, in some embodiments, involves linear transformations and partitioning of queries, keys, and values across multiple heads. … For inference tasks, various sampling strategies, such as greedy, beam search, or top-k sampling may be implemented. The temperature parameter in softmax functions might be adjusted to influence the randomness in the model's predictions.” [A large language model may be used to predict a sequence of text based on the multi-headed attention mechanism using scaled dot-product attention (i.e., wherein the large language model determines the predicted sequence of text using the plurality of attention scores by accessing the compressed lookup tables in the at least one SIMD register).]).
Regarding claim 2, Hensley in view of Wang teaches all the limitations of claim 1. Wang further teaches:
dividing, by the computer, the plurality of key vectors into a plurality of sub-vectors; performing, by the computer, clustering on a set of a plurality of training vectors selected from the plurality of sub-vectors resulting in a plurality of cluster centroids; and for each sub-vector, using a sub-quantizer to: quantize, by the computer, the sub-vector based on the nearest cluster centroid of the plurality of cluster centroids, resulting in a codebook, comprising plurality of codes, for each sub-quantizer; wherein each code in the codebook represents a set of indices to the cluster centroids that are nearest to each sub-vector (see Wang, Paragraphs [0030], [0051], “Product quantization (PQ) is an effective vector quantization technique used for dataset compression. It compresses the dataset by clustering the vectors in the dataset first, and then learning the centroids to represent vectors in each cluster, which can reduce the cardinality of the dataset. The set of centroids is called a codebook. For vector quantization, the dataset is composed of D-dimension vectors. The vector quantizer can encode each vector to a centroid in the codebook. As such, PQ can decompose the high-dimensional vector space into the Cartesian product of sub-vector spaces and then quantize these sub-vector spaces separately. For example, the D-dimension vector may be split into C distinct V-dimension sub-vectors (D=C·V). The sub-vectors are quantized separately using C codebooks. The quantization result of the vector is the concatenation of the C centroids. … For centroid learning, it is expected that each centroid can represent a cluster of sub-vectors in the features 430, which have matched (or similar) feature information.” [Applying product quantization includes clustering vectors, and learning the centroids (i.e., resulting in a plurality of cluster centroids). The set of centroids is a codebook. The vectors are split into sub-vectors, and quantized using the codebook, which is based on the cluster centroids. The centroids may represent a cluster of sub-vectors (i.e., resulting in a codebook, comprising plurality of codes, for each sub-quantizer; wherein each code in the codebook represents a set of indices to the cluster centroids that are nearest to each sub-vector).]).
Regarding claim 3, Hensley in view of Wang teaches all the limitations of claim 2. Wang further teaches:
wherein each code in the codebook uses 8 bits of memory (see Wang, Paragraph [0086], “It would be appreciated that the byte lengths of the integers (e.g., INT8, INT16, INT32) illustrated in FIG. 7 are provided as examples, and any other suitable length may be configured.” [The byte lengths may be INT8 (i.e., 8 bits of memory).]).
Regarding claim 4, Hensley in view of Wang teaches all the limitations of claim 2. Wang further teaches:
wherein the lookup table for each query vector includes a plurality of dot product values computed, by the computer, between the query vector and the plurality of sub-vectors using the codes in the codebook of each sub-quantizer (see Wang, Paragraphs [0035]-[0036], “For a matrix multiplication A×BT, a and b are the rows of A and B, respectively, and can be considered as two vectors. For a layer of a neural network, the matrix A may be the input of this layer (which may be the model input for the first layer or an input feature from a preceding layer), and the matrix B may be the weights for the layer. Various types of layers in the neural network, including the convolution layer, can be considered as matrix multiplication. Since the matrix Bis constant in the context of model inference, the centroid codebooks P 220 are prepared for the matrix A and the multiplication of all the centroids and the matrix B 230 containing the weights may be precomputed to construct a lookup table 240, as shown in FIG. 2 . The cth codebook corresponding to a sub-vector Âc contains K centroids P0 c, P1 c . . . . PK-1 c. The table construction function hc(bc) for a weight sub-vector be corresponding to an input sub-vector ac is shown in Eq. (3). hc(bc)=[P0c·bc,P1c·bc,…,PK-1c·bc](3) … The matrix multiplication can then be approximated by looking up and aggregating the results of the nearest centroids in the precomputed lookup table, formulated in Eq. (4). a·b=∑cacbc≈∑cgc(ac)·hc(bc)(4)” [The construction of the lookup table involves the multiplication (i.e., dot product) between the matrices using the codebook (i.e., using the codes in the codebook of each sub-quantizer).]).
Regarding claim 5, Hensley in view of Wang teaches all the limitations of claim 4. Wang further teaches:
wherein for each sub-quantizer, the dot product values between each query vector and cluster centroid in the codebook are dynamically compressed based on maximum and minimum dot product values (see Wang, Paragraphs [0072], “the third mechanism for the differentiable centroid learning is to apply scalar quantization during centroid learning. Lookup tables are generally the main disk and memory cost. In some implementations, the table size may be reduced by scalar quantization (e.g., FP32 to INT8). In some examples, the classic range-based linear quantization. The formula is r=s(q−z), in which r is the real value, s is the scaling factor, q is the quantized value, and z is the zero point. In some examples, the symmetric quantization is used, so z may be set as 0, and the quantized range is [−2n −1 , 2n −1 −1]. The scaling factor sis calculated as the max absolute value in the lookup table divided by half of the range, i.e., s=max(value)2n-1-1, where “value” represents the max absolute value.” [The range-based linear quantization technique is implemented to dynamically compress the dot product values based on maximum and minimum dot product values.]).
Regarding claim 6, Hensley in view of Wang teaches all the limitations of claim 4. Wang further teaches:
wherein determining a plurality of attention scores comprises: retrieving dot product values stored in the lookup table for each query vector using codes in the codebook of each sub-quantizer, and accumulating the retrieved dot product values across the sub-quantizers (see Wang, Paragraphs [0083], “An output for corresponding to the input for Layeri is then determined based on aggregation of the respective target computation results. The table read and accumulation stage may include a table lookup step to read the pre-computed results from the corresponding lookup table through indices of the target centroids (the closest centroids) and an accumulation step to accumulate the pre-computed results by an accumulation operation. For example, convolution operators directly read out filter's outputs from lookup tables and accumulate each input channel's result for output channels.” [The pre-computed results (i.e., dot product values) are read from the corresponding lookup table through indices (i.e., codes in codebook of each sub-quantizer) of the target centroids, and the pre-computed results (i.e., dot product values) are accumulated across the sub-quantizers.]).
Regarding claim 7, Hensley in view of Wang teaches all the limitations of claim 2. Hensley further teaches:
wherein: the large language model uses multiple attention heads; and the codebook cluster centroids are learned through performing clustering separately on each attention head (see Hensley, Paragraph [0259], “the system may construct separate encoder and decoder layers, each consisting of components like multi-headed self-attention and feed-forward neural networks. Encoder only, decoder only, or encoder-decoder LLMs may be used. Some embodiments may implement a multi-headed attention mechanism using scaled dot-product attention.” Also, see Wang, Paragraph [0030], “Product quantization (PQ) is an effective vector quantization technique used for dataset compression. [The large language mode may use multiple attention heads, in which the codebook cluster centroids may be learned based on the attention head.]).
Regarding claim 8, Hensley in view of Wang teaches all the limitations of claim 6. Wang further teaches:
wherein compressing the lookup table into the SIMD register comprises dynamically quantizing the dot product values as 8-bit representations (see Wang, Paragraph [0084], “Since the lookup table may be quantized into integers in a certain range (e.g., INT8 in some examples), leverage shuffle instructions, which are widely supported in the instruction set, e.g., in the Single Instruction Multiple Data (SIMD) instruction set, can be leveraged to achieve parallel and efficient table read.” [The lookup table may be quantized into integers in a INT8 range (i.e., 8 bits of memory).]).
Regarding claim 9, Hensley in view of Wang teaches all the limitations of claim 1. Wang further teaches:
wherein the plurality of quantized key vectors are stored in a transposed block format (see Wang, Paragraph [0085], “The number of parallel processing units within a SIMD instruction is called SIMD lanes. In some implementations, the number of centroids may be set based on the number of SIMD lanes. For example, the number of centroids may be set to 16 (K=16) to maximize the utilization of all SIMD instruction lanes.” [The utilization of all SIMD instruction lanes may be maximized indicating the quantized vectors are stored in a transposed block format.]).
Regarding claim 10, Hensley in view of Wang teaches all the limitations of claim 1. Wang further teaches:
wherein the plurality of SIMD instructions include a shuffle command and an add command (see Wang, Paragraphs [0084]-[0085], “Since the lookup table may be quantized into integers in a certain range (e.g., INT8 in some examples), leverage shuffle instructions, which are widely supported in the instruction set, e.g., in the Single Instruction Multiple Data (SIMD) instruction set, can be leveraged to achieve parallel and efficient table read. … the accumulation throughput may be maximized by a mixed precision accumulation step 718. It first accumulates computation results in INT16 to utilize more SIMD lanes and then gathers computation results in INT16 to computation results in INT32 to avoid overflow.” [The SIMD includes shuffling and accumulation (i.e., a shuffle command and an add command).]).
Regarding claims 11-20, Hensley in view of Wang teaches all of the limitations of claims 1-10, in method form rather than in non-transitory computer-readable medium form. Hensley also discloses a non-transitory computer-readable medium [0265]. Therefore, the supporting rationale of the rejection to claims 1-10, applies equally as well to those elements of claims 11-20.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HUSAM TURKI SAMARA whose telephone number is (571)272-6803. The examiner can normally be reached on Monday - Thursday, Alternate Fridays.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Apu Mofiz can be reached on (571)-272-4080. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
HUSAM TURKI SAMARA/Examiner, Art Unit 2161
/APU M MOFIZ/Supervisory Patent Examiner, Art Unit 2161