Last updated: May 29, 2026

Application No. 19/067,354

METHODS AND SYSTEMS FOR OPERATING LARGE LANGUAGE MODELS USING SINGLE-INSTRUCTION-MULTIPLE-DATA REGISTERS

Final Rejection §103

Filed

Feb 28, 2025

Priority

Feb 29, 2024 — provisional 63/559,714

Examiner

SAMARA, HUSAM TURKI

Art Unit

2161

Tech Center

2100 — Computer Architecture & Software

Assignee

William Marsh Rice University

OA Round

2 (Final)

This examiner grants 56% of cases after interview

— +16.5% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.

Based on 169 resolved cases, 2023–2026

Examiner Intelligence

SAMARA, HUSAM TURKI View full profile →

Grants 56% of resolved cases

Career Allowance Rate

94 granted / 169 resolved

+0.6% vs TC avg

Strong +16% interview lift

Without

With

+16.5%

Interview Lift

resolved cases with interview

Typical timeline

3y 9m

Avg Prosecution

16 currently pending

Career history

191

Total Applications

across all art units

Statute-Specific Performance

§101

0.9%

-39.1% vs TC avg

§103

90.8%

+50.8% vs TC avg

§102

8.2%

-31.8% vs TC avg

Black line = Tech Center average estimate • Based on career data from 169 resolved cases

Office Action

§103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
	Applicant’s Remarks, filed April 13th, 2026, has been fully considered and entered. Accordingly, claims 1-8, 10-18, and 20 are pending in the case. Claims 1, and 11 were amended. Claims 9, and 19 were cancelled. Claims 1, and 11 are the independent claims.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1-8, 10-18, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Hensley et al. (US 2025/0251932 A1) in view of Wang et al. (US 2025/0005365 A1).
Regarding claim 1, Hensley teaches a method, comprising:
obtaining, by a large language model in a computer, an input sequence of text comprising a plurality of elements of text (see Hensley, Paragraph [0133], “the search engine 43 may include an encoder 46 and a vector database 45. In some embodiments, the encoder 46 may be operative to receive as input unstructured natural language text, such as an intermediate representation or a query received from user computer 14.” [An unstructured natural language text (i.e., an input sequence of text comprising a plurality of elements of text) may be received (i.e., obtaining).]);
converting, by the computer, each element in the plurality of elements of text into a token, resulting in a plurality of tokens (see Hensley, Paragraph [0133], “The encoder may be configured and trained to transform that input into one or more vectors in an embedding space. In some embodiments input text or characters are tokenized according to one or more tokenizing schemes associated with one or more embedding models used to transform directly token streams to vectors in an embedding space.” [The input is transformed into tokens.]);
transforming, by the computer, each token in the plurality of tokens into a query vector, key vector, and value vector, resulting in a plurality of query vectors, key vectors, and value vectors (see Hensley, Paragraph [0135], “The token representations may then be processed through multiple layers of self-attention mechanisms, where each token attends to all other tokens in the sequence. In each self-attention layer, query, key, and value vectors may be computed” [Query, key, and value vectors may be computed (i.e., transformed) based on the tokens.]);

However, Hensley does not explicitly teach:
obtaining, by the computer, a plurality of quantized key vectors by applying product quantization to the plurality of key vectors;

Wang teaches:
obtaining, by the computer, a plurality of quantized key vectors by applying product quantization to the plurality of key vectors (see Wang, Paragraph [0030], “Product quantization (PQ) is an effective vector quantization technique used for dataset compression.” [Product quantization may be applied in order to obtain quantized vectors (i.e., quantized key vectors).]);

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined Hensley (teaching pre-computation for intermediate-representation infused and retrieval augmented generation) in view of Wang (teaching neural network inference based on table lookup), and arrived at a method that incorporates product quantization. One of ordinary skill in the art would have been motivated to make such a combination for the purposes of achieving a better model accuracy (see Wang, Paragraph [0002]). In addition, both the references (Hensley and Wang) teach features that are directed to analogous art and they are directed to the same field of endeavor, such as machine learning. The close relation between both the references highly suggests an expectation of success.

The combination of Hensley, and Wang further teaches:
and for each query vector: compressing, by the computer, a lookup table into at least one single instruction multiple data (SIMD) register of a plurality of SIMD registers in a computer processor to produce a compressed lookup table, wherein the lookup table is based on the query vector and the plurality of quantized key vectors (see Wang, Paragraph [0084], “Since the lookup table may be quantized into integers in a certain range (e.g., INT8 in some examples), leverage shuffle instructions, which are widely supported in the instruction set, e.g., in the Single Instruction Multiple Data (SIMD) instruction set, can be leveraged to achieve parallel and efficient table read. The implementation of table read using shuffle instructions in FIG. 7 . In a vectorized table lookup step 716, the shuffle instruction permutes each byte of a vector based on an index vector and stores the shuffled bytes in the result vector register in each clock cycle. In an example, on 128-bit wide SIMD, a vectorized table read instruction handles 16 sub-vector lookups (128/8=16) on 16 results (128/8=16) simultaneously, greatly simplifying table read and reducing overheads.” [The lookup table may be quantized (i.e., compressed) into integers in a certain range in the SIMD register, which may be based on the query vectors and quantized key vectors.]);
performing, by the computer, a parallel processing operation using a plurality of SIMD instructions and the lookup table in the SIMD register to determine a plurality of attention scores (see Wang, Paragraphs [0084]-[0085], “Since the lookup table may be quantized into integers in a certain range (e.g., INT8 in some examples), leverage shuffle instructions, which are widely supported in the instruction set, e.g., in the Single Instruction Multiple Data (SIMD) instruction set, can be leveraged to achieve parallel and efficient table read. The implementation of table read using shuffle instructions in FIG. 7 . In a vectorized table lookup step 716, the shuffle instruction permutes each byte of a vector based on an index vector and stores the shuffled bytes in the result vector register in each clock cycle. In an example, on 128-bit wide SIMD, a vectorized table read instruction handles 16 sub-vector lookups (128/8=16) on 16 results (128/8=16) simultaneously, greatly simplifying table read and reducing overheads. … The number of parallel processing units within a SIMD instruction is called SIMD lanes. In some implementations, the number of centroids may be set based on the number of SIMD lanes. For example, the number of centroids may be set to 16 (K=16) to maximize the utilization of all SIMD instruction lanes.” [The table reads may be performed in parallel (i.e., performing a parallel processing operation) in order to determine a plurality of results (i.e., a plurality of attention scores).]);
and generating, by the large language model, a predicted sequence of text based on the plurality of attention scores, wherein the large language model determines the predicted sequence of text using the plurality of attention scores by accessing the compressed lookup tables in the at least one SIMD register (see Hensley, Paragraphs [0072], [0143], [0259], “Some implementations may apply SIMD (single instruction, multiple data) operations or graphics processing unit (GPU) acceleration to enhance distance computations, particularly when working with large-scale embedding spaces. … a transformer architecture-based LLM may be used. … Subsequently, the system may construct separate encoder and decoder layers, each consisting of components like multi-headed self-attention and feed-forward neural networks. Encoder only, decoder only, or encoder-decoder LLMs may be used. Some embodiments may implement a multi-headed attention mechanism using scaled dot-product attention. This, in some embodiments, involves linear transformations and partitioning of queries, keys, and values across multiple heads. … For inference tasks, various sampling strategies, such as greedy, beam search, or top-k sampling may be implemented. The temperature parameter in softmax functions might be adjusted to influence the randomness in the model's predictions.” [A large language model may be used to predict a sequence of text based on the multi-headed attention mechanism using scaled dot-product attention (i.e., wherein the large language model determines the predicted sequence of text using the plurality of attention scores by accessing the compressed lookup tables in the at least one SIMD register).]),
and wherein the plurality of quantized key vectors is stored in a transposed block format (see Wang, Paragraphs [0081]-[0082], [0085], “Then, the execution result from each sub-codebook may be interleaved, to avoid data dependency in closest centroid search.” [The execution result from each sub-codebook may be interleaved (i.e., wherein the plurality of quantized key vectors is stored in a transposed block format).]).

Regarding claim 2, Hensley in view of Wang teaches all the limitations of claim 1. Wang further teaches:
dividing, by the computer, the plurality of key vectors into a plurality of sub-vectors; performing, by the computer, clustering on a set of a plurality of training vectors selected from the plurality of sub-vectors resulting in a plurality of cluster centroids; and for each sub-vector, using a sub-quantizer to: quantize, by the computer, the sub-vector based on the nearest cluster centroid of the plurality of cluster centroids, resulting in a codebook, comprising plurality of codes, for each sub-quantizer; wherein each code in the codebook represents a set of indices to the cluster centroids that are nearest to each sub-vector (see Wang, Paragraphs [0030], [0051], “Product quantization (PQ) is an effective vector quantization technique used for dataset compression. It compresses the dataset by clustering the vectors in the dataset first, and then learning the centroids to represent vectors in each cluster, which can reduce the cardinality of the dataset. The set of centroids is called a codebook. For vector quantization, the dataset is composed of D-dimension vectors. The vector quantizer can encode each vector to a centroid in the codebook. As such, PQ can decompose the high-dimensional vector space into the Cartesian product of sub-vector spaces and then quantize these sub-vector spaces separately. For example, the D-dimension vector may be split into C distinct V-dimension sub-vectors (D=C·V). The sub-vectors are quantized separately using C codebooks. The quantization result of the vector is the concatenation of the C centroids. … For centroid learning, it is expected that each centroid can represent a cluster of sub-vectors in the features 430, which have matched (or similar) feature information.” [Applying product quantization includes clustering vectors, and learning the centroids (i.e., resulting in a plurality of cluster centroids). The set of centroids is a codebook. The vectors are split into sub-vectors, and quantized using the codebook, which is based on the cluster centroids. The centroids may represent a cluster of sub-vectors (i.e., resulting in a codebook, comprising plurality of codes, for each sub-quantizer; wherein each code in the codebook represents a set of indices to the cluster centroids that are nearest to each sub-vector).]).

Regarding claim 3, Hensley in view of Wang teaches all the limitations of claim 2. Wang further teaches:
wherein each code in the codebook uses 8 bits of memory (see Wang, Paragraph [0086], “It would be appreciated that the byte lengths of the integers (e.g., INT8, INT16, INT32) illustrated in FIG. 7 are provided as examples, and any other suitable length may be configured.” [The byte lengths may be INT8 (i.e., 8 bits of memory).]).

Regarding claim 4, Hensley in view of Wang teaches all the limitations of claim 2. Wang further teaches:
wherein the lookup table for each query vector includes a plurality of dot product values computed, by the computer, between the query vector and the plurality of sub-vectors using the codes in the codebook of each sub-quantizer (see Wang, Paragraphs [0035]-[0036], “For a matrix multiplication A×BT, a and b are the rows of A and B, respectively, and can be considered as two vectors. For a layer of a neural network, the matrix A may be the input of this layer (which may be the model input for the first layer or an input feature from a preceding layer), and the matrix B may be the weights for the layer. Various types of layers in the neural network, including the convolution layer, can be considered as matrix multiplication. Since the matrix Bis constant in the context of model inference, the centroid codebooks P 220 are prepared for the matrix A and the multiplication of all the centroids and the matrix B 230 containing the weights may be precomputed to construct a lookup table 240, as shown in FIG. 2 . The cth codebook corresponding to a sub-vector Âc contains K centroids P0 c, P1 c . . . . PK-1 c. The table construction function hc(bc) for a weight sub-vector be corresponding to an input sub-vector ac is shown in Eq. (3). hc(bc)=[P0c·bc,P1c·bc,…,PK-1c·bc](3) … The matrix multiplication can then be approximated by looking up and aggregating the results of the nearest centroids in the precomputed lookup table, formulated in Eq. (4). a·b=∑cac⁢bc≈∑cgc(ac)·hc(bc)(4)” [The construction of the lookup table involves the multiplication (i.e., dot product) between the matrices using the codebook (i.e., using the codes in the codebook of each sub-quantizer).]).

Regarding claim 5, Hensley in view of Wang teaches all the limitations of claim 4. Wang further teaches:
wherein for each sub-quantizer, the dot product values between each query vector and cluster centroid in the codebook are dynamically compressed based on maximum and minimum dot product values (see Wang, Paragraphs [0072], “the third mechanism for the differentiable centroid learning is to apply scalar quantization during centroid learning. Lookup tables are generally the main disk and memory cost. In some implementations, the table size may be reduced by scalar quantization (e.g., FP32 to INT8). In some examples, the classic range-based linear quantization. The formula is r=s(q−z), in which r is the real value, s is the scaling factor, q is the quantized value, and z is the zero point. In some examples, the symmetric quantization is used, so z may be set as 0, and the quantized range is [−2n −1 , 2n −1 −1]. The scaling factor sis calculated as the max absolute value in the lookup table divided by half of the range, i.e., s=max⁡(value)2n-1-1, where “value” represents the max absolute value.” [The range-based linear quantization technique is implemented to dynamically compress the dot product values based on maximum and minimum dot product values.]).

Regarding claim 6, Hensley in view of Wang teaches all the limitations of claim 4. Wang further teaches:
wherein determining a plurality of attention scores comprises: retrieving dot product values stored in the lookup table for each query vector using codes in the codebook of each sub-quantizer, and accumulating the retrieved dot product values across the sub-quantizers (see Wang, Paragraphs [0083], “An output for corresponding to the input for Layeri is then determined based on aggregation of the respective target computation results. The table read and accumulation stage may include a table lookup step to read the pre-computed results from the corresponding lookup table through indices of the target centroids (the closest centroids) and an accumulation step to accumulate the pre-computed results by an accumulation operation. For example, convolution operators directly read out filter's outputs from lookup tables and accumulate each input channel's result for output channels.” [The pre-computed results (i.e., dot product values) are read from the corresponding lookup table through indices (i.e., codes in codebook of each sub-quantizer) of the target centroids, and the pre-computed results (i.e., dot product values) are accumulated across the sub-quantizers.]).

Regarding claim 7, Hensley in view of Wang teaches all the limitations of claim 2. Hensley further teaches:
wherein: the large language model uses multiple attention heads; and the codebook cluster centroids are learned through performing clustering separately on each attention head (see Hensley, Paragraph [0259], “the system may construct separate encoder and decoder layers, each consisting of components like multi-headed self-attention and feed-forward neural networks. Encoder only, decoder only, or encoder-decoder LLMs may be used. Some embodiments may implement a multi-headed attention mechanism using scaled dot-product attention.” Also, see Wang, Paragraph [0030], “Product quantization (PQ) is an effective vector quantization technique used for dataset compression. [The large language mode may use multiple attention heads, in which the codebook cluster centroids may be learned based on the attention head.]).

Regarding claim 8, Hensley in view of Wang teaches all the limitations of claim 6. Wang further teaches:
wherein compressing the lookup table into the SIMD register comprises dynamically quantizing the dot product values as 8-bit representations (see Wang, Paragraph [0084], “Since the lookup table may be quantized into integers in a certain range (e.g., INT8 in some examples), leverage shuffle instructions, which are widely supported in the instruction set, e.g., in the Single Instruction Multiple Data (SIMD) instruction set, can be leveraged to achieve parallel and efficient table read.” [The lookup table may be quantized into integers in a INT8 range (i.e., 8 bits of memory).]).

Regarding claim 10, Hensley in view of Wang teaches all the limitations of claim 1. Wang further teaches:
wherein the plurality of SIMD instructions include a shuffle command and an add command (see Wang, Paragraphs [0084]-[0085], “Since the lookup table may be quantized into integers in a certain range (e.g., INT8 in some examples), leverage shuffle instructions, which are widely supported in the instruction set, e.g., in the Single Instruction Multiple Data (SIMD) instruction set, can be leveraged to achieve parallel and efficient table read.  … the accumulation throughput may be maximized by a mixed precision accumulation step 718. It first accumulates computation results in INT16 to utilize more SIMD lanes and then gathers computation results in INT16 to computation results in INT32 to avoid overflow.” [The SIMD includes shuffling and accumulation (i.e., a shuffle command and an add command).]).

Regarding claims 11-18 and 20, Hensley in view of Wang teaches all of the limitations of claims 1-8 and 10, in method form rather than in non-transitory computer-readable medium form. Hensley also discloses a non-transitory computer-readable medium [0265]. Therefore, the supporting rationale of the rejection to claims 1-8 and 10, applies equally as well to those elements of claims 11-18 and 20.

Response to Arguments
Applicant’s Arguments, filed April 13th, 2026 have been fully considered, but are not persuasive. 

Applicant argues on pages 9-10 of Applicant's Remarks that the cited references do not teach or suggest “wherein the plurality of quantized key vectors are stored in a transposed block format.” The Examiner respectfully disagrees. 

Wang discloses in paragraph [0081], “it is proposed to apply intra-codebook parallelism to optimize the closest centroid search stage. Intra-codebook parallelism searches the nearest centroid for the input sub-vector on a codebook in parallel. Specifically, a codebook may be sliced into multiple sub-codebooks and each distance between the input sub-vector and centroids may be compared in each sub-codebook. For example, a plurality of candidate target centroids from the plurality of sub-codebooks based on parallel distance comparison for the plurality of sub-codebooks, where the parallel distance comparison is configured to compare distances between the first input sub-vector and respective centroids in the plurality of sub-codebooks. Then, the execution result from each sub-codebook may be interleaved, to avoid data dependency in closest centroid search. The compared distances may be merged by reduction and find the index corresponding to the closest centroid. For example, a target centroid for the first input sub-vector may be determined from the plurality of candidate target centroids by comparing distances between the first input sub-vector and the plurality of candidate target centroids. In this way, instruction-level parallelism is leveraged to improve hardware utilization and performance.”  As shown, the execution results from each sub-codebook are interleaved in order to avoid data dependency, which indicates the quantized key vectors are stored in a transposed block format. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the teachings of Wang into Hensley for the purposes of avoiding data dependency (see Wang, Paragraph [0081]). In addition, both the references (Hensley and Wang) teach features that are directed to analogous art and they are directed to the same field of endeavor, such as machine learning. The close relation between both the references highly suggests an expectation of success.

For the above reasons, it is believed that the rejections should be sustained.

Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HUSAM TURKI SAMARA whose telephone number is (571)272-6803.  The examiner can normally be reached on Monday - Thursday, Alternate Fridays.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Apu Mofiz can be reached on (571)-272-4080.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/HUSAM TURKI SAMARA/
Examiner, Art Unit 2161

/APU M MOFIZ/Supervisory Patent Examiner, Art Unit 2161

Read full office action

Prosecution Timeline

Feb 28, 2025

Application Filed

Jan 14, 2026

Non-Final Rejection mailed — §103

Apr 13, 2026

Response Filed

May 06, 2026

Final Rejection mailed — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

17/262,748

Patent 12619622

DATA COLLECTION DEVICE, PLANT MONITORING SYSTEM AND DATA COLLECTION METHOD

5y 3m to grant Granted May 05, 2026

18/587,402

Patent 12619653

REPAIR SEARCH SYSTEM

2y 2m to grant Granted May 05, 2026

17/838,299

Patent 12591581

PROGRAMMATIC DATA PROCESSING SYSTEM

3y 9m to grant Granted Mar 31, 2026

18/062,005

Patent 12591570

SYSTEMS AND METHODS FOR FINDING NEAREST NEIGHBORS

3y 3m to grant Granted Mar 31, 2026

18/162,723

Patent 12541523

CONTEXT DRIVEN ANALYTICAL QUERY ENGINE WITH VISUALIZATION INTELLIGENCE

3y 0m to grant Granted Feb 03, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

3-4

Expected OA Rounds

56%

Grant Probability

72%

With Interview (+16.5%)

3y 9m (~2y 6m remaining)

Median Time to Grant

Moderate

PTA Risk

Based on 169 resolved cases by this examiner. Grant probability derived from career allowance rate.