Office Action Analysis: 17740497 — ADAPTIVE SPARSE ATTENTION PATTERN

Office Action

§101 §103
DETAILED ACTION
	This Office Action is in response to communications filed on December 31, 2025 for Application No. 17/740,497, in which claims 1-20 are presented for examination. The amendments filed on December 31, 2025 have been entered, where claims 1 and 6 are amended. 

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 12/31/2025 has been entered.
 
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to abstract ideas without significantly more. 

Regarding Claim 1:
Step 1: Claim 1 is a method claim. Therefore, Claims 1-6 are directed to a statutory category of eligible subject matter.
Step 2A Prong 1: If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the "Mental Processes" grouping of abstract ideas. Here, steps of the claimed method are mental processes. Specifically, the claim recites 
“identifying a row or a column in an attention matrix with an importance score for a task that is above a threshold importance score, wherein the importance score is generated” (mental process – amounts to exercising judgment based on observed information to form an opinion, which may be aided by pen and paper);
“including the row or the column in an adaptive attention pattern” (mental process –  amounts to exercising judgment based on an observed column or row to form an opinion on an attention pattern, which may be aided by pen and paper); and
“in response to an input, generating a task-specific inference for the input” (mental process –  amounts to exercising judgment based on an observed data, which may be aided by pen and paper). 
Step 2A Prong 2: This judicial exception is not integrated into a practical application.
The claim recites the additional element:
“generating a task-specific machine-learning model by training a generic machine learning model on task specific training data . . . only during the training of the task-specific machine-learning model with the task-specific training data and without pre-training the model on generic training data to adapt to the attention matrix . . . used with the task-specific machine-learning model having a self-attention operation . . . using the task-specific machine-learning model with the adaptive attention pattern” (amounts to mere instructions to apply the judicial exception on generic and unspecialized computer components, which do not impose any meaningful limits on practicing the abstract idea).
Step 2B: The claim does not include additional elements considered individually and in combination that are sufficient to amount to significantly more than the judicial exception.
The claim recites the additional element:
“generating a task-specific machine-learning model by training a generic machine learning model on task specific training data . . . only during the training of the task-specific machine-learning model with the task-specific training data and without pre-training the model on generic training data to adapt to the attention matrix . . . used with the task-specific machine-learning model having a self-attention operation . . . using the task-specific machine-learning model with the adaptive attention pattern” (mere instructions to apply the exception using generic computer components does not provide an inventive concept).
For the reasons above, Claim 1 is rejected as being directed to an abstract idea without significantly more. This rejection applies equally to dependent claims 2-6. The additional limitations of the dependent claims are addressed below.

Regarding Claim 2:
Step 2A Prong 1: See the rejection of Claim 1 above, which Claim 2 depends on. Here, the claim recites additional elements that are mental processes. Specifically, the claim recites
“wherein the adaptive attention pattern is for a single layer” (mental process –  amounts to exercising to form an opinion on an attention pattern for a single layer of a model, which may be aided by pen and paper).
Step 2A Prong 2: This judicial exception is not integrated into a practical application.
The claim recites the additional element:
“of the machine-learning model” (amounts to mere instructions to apply the judicial exception on generic and unspecialized computer components, which do not impose any meaningful limits on practicing the abstract idea).
Step 2B: The claim does not include additional elements considered individually and in combination that are sufficient to amount to significantly more than the judicial exception.
The claim recites the additional element:
“of the machine-learning model” (mere instructions to apply the exception using generic computer components does not provide an inventive concept).
Accordingly, Claim 2 is rejected as being directed to an abstract idea without significantly more.

Regarding Claim 3:
Step 2A Prong 1: See the rejection of Claim 1 above, which Claim 3 depends on. Here, the claim recites additional elements that are mental processes. Specifically, the claim recites
“wherein the adaptive attention pattern assigns global attention to tokens in the row or the column” (mental process –  amounts to exercising judgment, based on evaluating global attention of selected tokens, to form an opinion on a pattern, which may be aided by pen and paper).
Step 2A Prong 2 & Step 2B: There are no elements left for consideration of implementation within a practical application or for consideration individually and in combination of significantly more.
Accordingly, Claim 3 is rejected as being directed to an abstract idea without significantly more.

Regarding Claim 4:
Step 2A Prong 1: See the rejection of Claim 1 above, which Claim 4 depends on. Here, the claim recites additional elements that are mental processes. Specifically, the claim recites
“wherein the adaptive attention pattern is a merger of the row or the column with a diagonal attention pattern” (mental process –  merger of observed column or row with a diagonal pattern amounts to exercising judgment to form an opinion, which may be aided by pen and paper).
Step 2A Prong 2 & Step 2B: There are no elements left for consideration of implementation within a practical application or for consideration individually and in combination of significantly more.
Accordingly, Claim 4 is rejected as being directed to an abstract idea without significantly more.

Regarding Claim 5:
Step 2A Prong 1: See the rejection of Claim 1 above, which Claim 5 depends on. Here, the claim recites additional elements that are mental processes. Specifically, the claim recites
“wherein the method further comprises controlling a sparsity of the adaptive attention pattern to a sparsity range” (mental process –  amounts to making determinations about what to include in the adaptive attention pattern with reference to a known range, which may be aided by pen and paper).
Step 2A Prong 2 & Step 2B: There are no elements left for consideration of implementation within a practical application or for consideration individually and in combination of significantly more.
Accordingly, Claim 5 is rejected as being directed to an abstract idea without significantly more.

Regarding Claim 6:
Step 2A Prong 1: See the rejection of Claim 1 above, which Claim 6 depends on.
Step 2A Prong 2: This judicial exception is not integrated into a practical application.
The claim recites the additional element:
“wherein the task-specific machine-learning model having the self-attention operation is a transformer model” (amounts to mere instructions to apply the judicial exception on generic and unspecialized computer components, which do not impose any meaningful limits on practicing the abstract idea).
Step 2B: The claim does not include additional elements considered individually and in combination that are sufficient to amount to significantly more than the judicial exception.
The claim recites the additional element:
“wherein the machine-learning model having the self-attention operation is a transformer model” (mere instructions to apply the exception using generic computer components does not provide an inventive concept).
Accordingly, Claim 6 is rejected as being directed to an abstract idea without significantly more.

Regarding Claim 7:
Step 1: Claim 7 is a product claim. Therefore, Claims 7-13 are directed to a statutory category of eligible subject matter.
Step 2A Prong 1: If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the "Mental Processes" grouping of abstract ideas. Here, steps of the claimed method are mental processes. Specifically, the claim recites 
“generating a sparse-attention model by adding a sparse attention pattern to a pre-trained machine-learning model having a self-attention operation” (mental process – apart from the models, amounts to observing a sparse-attention pattern) and
“generating a tuned sparse-attention model by fine tuning the sparse- attention model to perform a task with task-specific training . . . wherein the sparse- attention model is an adaptive attention pattern” (mental process –  apart from the models, amounts to exercising judgement to alter adaptive attention evaluation procedures based on a formed opinion, which may be aided by pen and paper).
Step 2A Prong 2: This judicial exception is not integrated into a practical application.
The claim recites the additional element:
“[a] non-transitory computer-readable medium storing computer-executable instructions that, when executed by a processing device, cause the processing device to perform operations . . . sparse-attention model . . . a pre-trained machine-learning model having a self-attention operation . . .the sparse-attention model . . . tuned sparse-attention model by fine tuning the sparse- attention model . . . with task-specific training . . . wherein the adaptive attention pattern is learned during fine-tuning of the sparse-attention model using task- specific training data” (amounts to mere instructions to apply the judicial exception on generic and unspecialized computer components, which do not impose any meaningful limits on practicing the abstract idea) and
“storing the tuned sparse-attention model” (amounts to insignificant extra-solution activity, merely storing the model incidental to the process).
Step 2B: The claim does not include additional elements considered individually and in combination that are sufficient to amount to significantly more than the judicial exception.
The claim recites the additional element:
“[a] non-transitory computer-readable medium storing computer-executable instructions that, when executed by a processing device, cause the processing device to perform operations . . . sparse-attention model . . . a pre-trained machine-learning model having a self-attention operation . . .the sparse-attention model . . . tuned sparse-attention model by fine tuning the sparse- attention model . . . with task-specific training . . . wherein the adaptive attention pattern is learned during fine-tuning of the sparse-attention model using task- specific training data” (mere instructions to apply the exception using generic computer components does not provide an inventive concept) and
“storing the tuned sparse-attention model” (storing and retrieving information in memory is well‐understood, routine, and conventional, see Versata Dev. Group, Inc. v. SAP Am., Inc., 793 F.3d 1306, 1334, 115 USPQ2d 1681, 1701 (Fed. Cir. 2015); OIP Techs., 788 F.3d at 1363, 115 USPQ2d at 1092-93; which is recited with a high level of generality, remains insignificant extra-solution activity even upon reconsideration).
For the reasons above, Claim 7 is rejected as being directed to an abstract idea without significantly more. This rejection applies equally to dependent claims 8-13. The additional limitations of the dependent claims are addressed below.

Regarding Claim 8, the claim recites substantially the same limitations as Claim 2, in the form of an apparatus. The claim is also directed to performing mental processes without integration into a practical component or significantly more. 
Accordingly, Claim 8 is rejected under the same rationale. 

Regarding Claim 9, the claim recites substantially the same limitations as Claim 5, in the form of an apparatus. The claim is also directed to performing mental processes without integration into a practical component or significantly more. 
Accordingly, Claim 9 is rejected under the same rationale. 

Regarding Claim 10:
Step 2A Prong 1: See the rejection of Claim 7 above, which Claim 10 depends on. Here, the claim recites additional elements that are mental processes. Specifically, the claim recites
“wherein the adaptive attention pattern includes a row or a column in an attention matrix with a task-specific importance score that is above a threshold importance score” (mental process –  amounts to exercising judgment to form opinions on the attention pattern with specific attributes, which may be aided by pen and paper).
Step 2A Prong 2 & Step 2B: There are no elements left for consideration of implementation within a practical application or for consideration individually and in combination of significantly more.
Accordingly, Claim 10 is rejected as being directed to an abstract idea without significantly more.

Regarding Claim 11:
Step 2A Prong 1: See the rejection of Claim 7 above, which Claim 11 depends on. Here, the claim recites additional elements that are mental processes. Specifically, the claim recites
“wherein the adaptive attention pattern assigns global attention to tokens in the row or the column” (mental process –  amounts to exercising judgment to form opinions on the attention pattern with specific attributes, which may be aided by pen and paper).
Step 2A Prong 2 & Step 2B: There are no elements left for consideration of implementation within a practical application or for consideration individually and in combination of significantly more.
Accordingly, Claim 11 is rejected as being directed to an abstract idea without significantly more.

Regarding Claim 12:
Step 2A Prong 1: See the rejection of Claim 7 above, which Claim 12 depends on.
Step 2A Prong 2: This judicial exception is not integrated into a practical application.
The claim recites the additional element:
“wherein the pre- trained machine-learning model is trained on a generic task” (amounts to mere instructions to apply the judicial exception on generic and unspecialized computer components, which do not impose any meaningful limits on practicing the abstract idea).
Step 2B: The claim does not include additional elements considered individually and in combination that are sufficient to amount to significantly more than the judicial exception.
The claim recites the additional element:
 	“wherein the pre- trained machine-learning model is trained on a generic task” (mere instructions to apply the exception using generic computer components does not provide an inventive concept).
Accordingly, Claim 12 is rejected as being directed to an abstract idea without significantly more.

Regarding Claim 13:
Step 2A Prong 1: See the rejection of Claim 7 above, which Claim 13 depends on.
Step 2A Prong 2: This judicial exception is not integrated into a practical application.
The claim recites the additional element:
“wherein the sparse-attention model is not retrained on a generic task after adding the adaptive attention pattern to the sparse-attention model” (amounts to mere instructions to apply the judicial exception on generic and unspecialized computer components, which do not impose any meaningful limits on practicing the abstract idea).
Step 2B: The claim does not include additional elements considered individually and in combination that are sufficient to amount to significantly more than the judicial exception.
The claim recites the additional element:
“wherein the sparse-attention model is not retrained on a generic task after adding the adaptive attention pattern to the sparse-attention model” (mere instructions to apply the exception using generic computer components does not provide an inventive concept).
Accordingly, Claim 13 is rejected as being directed to an abstract idea without significantly more.

Regarding Claim 14:
Step 1: Claim 1 is an apparatus claim. Therefore, Claims 14-20 are directed to a statutory category of eligible subject matter.
Step 2A Prong 1: If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the "Mental Processes" grouping of abstract ideas. Here, claim elements are mental processes. Specifically, the claim recites 
“identifying . . . a row or a column in an attention matrix with a task-specific importance score that is above a threshold importance score” (mental process – amounts to exercising judgment based on observed information to form an opinion, which may be aided by pen and paper);
“including the row or the column in an adaptive attention pattern  . . . to limit self-attention operations performed while making an inference” (mental process –  amounts to exercising judgment based on an observed column or row to form an opinion on an attention pattern, which may be aided by pen and paper); and
“in response to an input, generating a task-specific inference for the input” (mental process –  amounts to exercising judgment based on an observed data, which may be aided by pen and paper).
Step 2A Prong 2: This judicial exception is not integrated into a practical application.
The claim recites the additional element:
“[a] system comprising: a memory component; and a processing device coupled to the memory component, the processing device to perform operations . . . during a task-specific fine tuning operation of a generically trained machine- learning model having a self-attention operation . . . used with the machine-learning model . . using the machine-learning model with the adaptive attention pattern” (amounts to mere instructions to apply the judicial exception on generic and unspecialized computer components, which do not impose any meaningful limits on practicing the abstract idea).
Step 2B: The claim does not include additional elements considered individually and in combination that are sufficient to amount to significantly more than the judicial exception.
The claim recites the additional element:
“[a] system comprising: a memory component; and a processing device coupled to the memory component, the processing device to perform operations . . . during a task-specific fine tuning operation of a generically trained machine- learning model having a self-attention operation . . . used with the machine-learning model . . using the machine-learning model with the adaptive attention pattern” (mere instructions to apply the exception using generic computer components does not provide an inventive concept).
For the reasons above, Claim 14 is rejected as being directed to an abstract idea without significantly more. This rejection applies equally to dependent claims 15-20. The additional limitations of the dependent claims are addressed below.

Regarding Claim 15:
Step 2A Prong 1: See the rejection of Claim 14 above, which Claim 15 depends on.
Step 2A Prong 2: This judicial exception is not integrated into a practical application.
The claim recites the additional element:
“wherein the machine-learning model is not retrained on a generic task after adding the adaptive attention pattern to the machine-learning model” (amounts to mere instructions to apply the judicial exception on generic and unspecialized computer components, which do not impose any meaningful limits on practicing the abstract idea).
Step 2B: The claim does not include additional elements considered individually and in combination that are sufficient to amount to significantly more than the judicial exception.
The claim recites the additional element:
“wherein the machine-learning model is not retrained on a generic task after adding the adaptive attention pattern to the machine-learning model” (mere instructions to apply the exception using generic computer components does not provide an inventive concept).
Accordingly, Claim 15 is rejected as being directed to an abstract idea without significantly more.

Regarding Claim 16, the claim recites substantially the same limitations as Claim 3, in the form of an apparatus. The claim is also directed to performing mental processes without integration into a practical component or significantly more. 
Accordingly, Claim 16 is rejected under the same rationale. 

Regarding Claim 17, the claim recites substantially the same limitations as Claim 2, in the form of an apparatus. The claim is also directed to performing mental processes without integration into a practical component or significantly more. 
Accordingly, Claim 17 is rejected under the same rationale. 

Regarding Claim 18:
Step 2A Prong 1: See the rejection of Claim 14 above, which Claim 18 depends on. Here, the claim recites additional elements that are mental processes. Specifically, the claim recites
“wherein the operations further comprise learning different adaptive attention patterns for different layers” (mental process –  amounts to exercising judgment to form opinions on attention patterns multiple times).
Step 2A Prong 2: This judicial exception is not integrated into a practical application.
The claim recites the additional element:
“of the machine-learning model” (amounts to mere instructions to apply the judicial exception on generic and unspecialized computer components, which do not impose any meaningful limits on practicing the abstract idea).
Step 2B: The claim does not include additional elements considered individually and in combination that are sufficient to amount to significantly more than the judicial exception.
The claim recites the additional element:
“of the machine-learning model” (mere instructions to apply the exception using generic computer components does not provide an inventive concept).
Accordingly, Claim 18 is rejected as being directed to an abstract idea without significantly more.

Regarding Claim 19:
Step 2A Prong 1: See the rejection of Claim 14 above, which Claim 19 depends on. If a claim limitation, under its broadest reasonable interpretation, covers mathematical relationships, mathematical formulas or equations, or mathematical calculations, then it falls within the “Mathematical Concepts” grouping of abstract ideas. Here, the claim recites additional elements that are mathematical concepts and mental processes. Specifically, the claim recites
“wherein the operations further comprise: . . . generate an importance measure for individual tokens” (mental process –  amounts to exercising judgment to form opinions on attention patterns multiple times) and
“providing the importance measure to a sigmoid function to generate the task-specific importance score for the row or the column” (mathematical concept -  calculated using a mathematical equation, see Spec. Pg., 21, Equation 6).
Step 2A Prong 2: This judicial exception is not integrated into a practical application.
The claim recites the additional element:
“providing an output from a self-attention layer to a fully-connected layer” (amounts to mere instructions to apply the judicial exception on generic and unspecialized computer components, which do not impose any meaningful limits on practicing the abstract idea).
Step 2B: The claim does not include additional elements considered individually and in combination that are sufficient to amount to significantly more than the judicial exception.
The claim recites the additional element:
“providing an output from a self-attention layer to a fully-connected layer” (mere instructions to apply the exception using generic computer components does not provide an inventive concept).
Accordingly, Claim 19 is rejected as being directed to an abstract idea without significantly more.

Regarding Claim 20:
Step 2A Prong 1: See the rejection of Claim 14 above, which Claim 20 depends on. Here, the claim recites additional elements that are mental processes. Specifically, the claim recites
“wherein the operations further comprise controlling a sparsity of the adaptive attention pattern to a sparsity range” (mental process –  amounts to exercising judgment and evaluation to ensure feature within a range, which may be aided with pen and paper).
Step 2A Prong 2 & Step 2B: There are no elements left for consideration of implementation within a practical application or for consideration individually and in combination of significantly more.
Accordingly, Claim 20 is rejected as being directed to an abstract idea without significantly more.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 3-7, 10-11, 14-16, 18 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Lu et al. (hereinafter Lu) (“Sanger: A Co-Design Framework for Enabling Sparse Attention using Reconfigurable Architecture”) in view of Beltagy et al. (hereinafter Beltagy) (“Longformer: The Long-Document Transformer”) and Liu et al. (hereinafter Liu) (“Transformer Acceleration with Dynamic Sparse Attention”).

Regarding Claim 1, Lu teaches a method comprising: generating a task-specific machine-learning model by training a generic machine learning model on task specific training data (Pg. 984, Col. 2, Para. 2, “Since our method does not require pre-training, we directly fine-tune a pre-trained checkpoint on downstream tasks”, where the “pre-trained checkpoint” is a generic model, see also Pg. 988, Col. 2, Para. 2, “Model: We use BERT-Base-Uncased (420.1MB), GPT2-Small (522.7MB) and BART-Base (532.1MB). Our scripts download pre-trained checkpoints from the Hugging Face Model Hub https://huggingface.co/models automatically” where at least the model “BERT” is well-established as a generically trained machine-learning model because it is trained on data that is not task-specific, which is used to generate a task-specific model by “fine-tun[ing]” on “downstream tasks”; Pg. 989, Col. 1, Para. 5, “Data sets. We evaluate models on three datasets, namely GLUE, SQuAD, and CLOTH. They correspond to three different NLP tasks . . . Our script automatically downloads the GLUE and SQuAD datasets before training”, where the “data sets” used for “training” correspond to three specific “different NLP tasks”);
identifying . . . [aspects] in an attention matrix (Pg. 981, Col. 1, Para. 6, “after obtaining a quantized approximation Sˆ of the attention matrix, we generate a binary attention mask M according to the sparsity pattern it exhibits”, where the “binary attention mask” identifies aspects of the “attention matrix” by masking unidentified values; for more information see Pg. 980, Figure 2 and Pg. 980, Algorithm 1)
with an importance score for a task (Pg. 978, Col. 1, Para. 2, “The first step of computing attention is to obtain a score matrix . . . often referred to as the attention matrix”; Pg. 979, Col. 1, Para. 1, “the score matrix . . . represents the importance of each input token when producing an output element”) 
that is above a threshold importance score (Pg. 985, Col. 1, Para. 3, “We obtain the attention mask S by applying a binary threshold on a low-precision estimation Pˆ of the attention matrix”; Pg. 981, Equation 
    PNG
    media_image1.png
    78
    430
    media_image1.png
    Greyscale
), 
. . .  training of the task-specific machine-learning model with the task-specific training data (Pg. 988, Col. 2, Para. 2, “Model: We use BERT-Base-Uncased (420.1MB), GPT2-Small (522.7MB) and BART-Base (532.1MB). Our scripts download pre-trained checkpoints from the Hugging Face Model Hub https://huggingface.co/models automatically” where at least the model “BERT” is well-established as a generically trained machine-learning model because it is trained on data that is not task-specific, which is used to generate a task-specific model by “fine-tun[ing]” on “downstream tasks”; Pg. 989, Col. 1, Para. 5, “Data sets. We evaluate models on three datasets, namely GLUE, SQuAD, and CLOTH. They correspond to three different NLP tasks . . . Our script automatically downloads the GLUE and SQuAD datasets before training”, where the “data sets” used for “training” correspond to three specific “different NLP tasks”; Notably, training of the task-specific model is interpreted as the training that generated the task specific model)
and without pre-training the model on generic training data to adapt to the attention matrix (Pg. 984, Col. 2, Para. 2, “Since our method does not require pre-training, we directly fine-tune a pre-trained checkpoint on downstream tasks”);
including the . . . [aspect] in an adaptive attention pattern (Pg. 980, Col. 1, Para.  2, “The resulting attention mask exhibits an unstructured sparsity pattern”; Pg. 985, Col. 1, Para. 3, “the sparsity patterns we generate are conditioned on individual input samples. Such dynamic patterns”, where “dynamic patterns” “conditioned on individual input[s]”, which is within the broadest reasonable interpretation of adaptive; Pg. 979, Table 1, “Existing sparse attention patterns”, where the “sparsity pattern” is an attention pattern) 
used with the task-specific machine-learning model (Pg. 989, Col. 1, Para. 8, “Experimental workflow . . . Train a model with Sanger sparse attention”, where, as discussed above, the “model” is a task-specific machine learning model, see Pg. 984, Col. 2, Para. 2, “Since our method does not require pre-training, we directly fine-tune a pre-trained checkpoint on downstream tasks”) 
having a self-attention operation (Pg. 984, Col. 1, Para. 4, “Implementation Details The first stage of the attention mechanism is to calculate queries”, where the “Sanger” model is implemented with a “self-attention (abbreviated as attention)” operation, see Pg. 979, Col. 1, Para. 1, “The attention mechanism is the key operation in the Transformer models . . . Figure 1 (a) depicts the computation stages of the self-attention (abbreviated as attention) mechanism”); and 
in response to an input, generating a task-specific inference for the input using the task-specific machine-learning model (Pg. 985, Table 3, where the model’s “accuracy”, which requires output in response to an input, for ten specific tasks is displayed; where the tasks are inferencing tasks, see Pg. 980, Col. 2, Para. 1, “In the inference phase”; and where, as discussed above, the “model” is a task-specific machine learning model, see Pg. 984, Col. 2, Para. 2, “Since our method does not require pre-training, we directly fine-tune a pre-trained checkpoint on downstream tasks”) 
with the adaptive attention pattern (Pg. 989, Col. 1, Para. 8, “Experimental workflow . . . Train a model with Sanger sparse attention”).
	Lu does not explicitly disclose . . . a row or column . . . wherein the importance score is generated only during . . . row or column . . . 
However, Beltagy teaches . . . [identifying] a row or column [in an attention an attention matrix, for inclusion of the] . . . row or column [in an attention pattern used by a machine learning model] (Pg. 3, figure 2(d), where the “Longformer” “attention pattern” includes selected rows and columns, identified by green shading; for more information see Pg. 3, Col. 2, Para. 2, “The original Transformer model has a self-attention component with O(n2) time and memory complexity . . . To address this challenge, we sparsify the full self-attention matrix according to an attention pattern specifying pairs of input locations”)
. . . [wherein the identifying and selection of a row or column occurs] only [prior to inference] (Beltagy, Pg. 4, Col. 1, para. 5, “Accordingly, we add “global attention”on few pre-selected input locations”).
	Before the effective filing date of the invention, it would have been obvious to one of ordinary skill in the art to combine the identifying, based on a threshold importance score, aspects of an attention matrix for inclusion in an adaptive attention pattern for training and use by a machine learning model of Lu, with the identifying rows or columns of the attention matrix for pre-selected inclusion in an attention pattern of Beltagy, in order to maintain performance improvements of both methods (Lu, Pg. 985, Table 3, where “Sanger” has improved “accuracy” and “sparsity”; Beltagy, Pg. 10, Col. 2, “pretrained, Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA.”), while preserving the model’s ability to handle long input sequences (compare Beltagy, Pg. 10, Col. 1, Para. 2, “Longformer . . . perform[s] . . . NLP tasks without chunking/shortening the long input . . . while also scaling linearly with the sequence length” with Lu, Pg. 985, Col. 2, Para. 2, “While [longformer and bigbird] are originally proposed for processing long sequences (e.g., text length 4096), we scale them for standard benchmarks with shorter contexts”), and allowing rows or columns to be preselected, which allows for the easy and simple inclusion of inductive information (Beltagy, Pg. 4, col. 1-2, Para. 5-1, “While specifying global attention is task specific, it is a easy way to add inductive bias to the model’s attention, and it is much simpler than existing task specific approaches that use complex architecture to combine information across smaller input chunks”).
	Additionally, Liu teaches . . . wherein the importance score is generated . . . during [the training of the task-specific machine-learning model] . . . (Pg. 6, Col. 2, Para. 2-3, “we want the predictor to accurately capture dynamic sparse patterns . . . we use predictor to indicate the positions of the important attention weights”; Pg. 13, Col. 1, Para. 4, “During fine-tuning, parameters from both original model and the predictor are updated simultaneously”; Pg. 4, Col. 1, Para. 3, “Given a pre-trained model, our method jointly fine-tunes the model parameters and parameters of the prediction path”; see also Pg. 3, Col. 2, Para. 2, “From S~, we can predict sparse attention masks M”, where “M” corresponds to the attention pattern and “S~” corresponds to the importance scores used to learn the attention pattern mechanism, which occurs during fine-tuning, see Pg. 4, Col. 1, Para. 2, “we propose to fine-tune model parameters with dynamic sparse constraints . . . When training the model with loss function in Eq. 6, the gradient from LMSE will be passed to both the low-rank approximation S~ and the original attention score S . . . the joint optimization of LModel and LMSE implicitly learns a low-rank S with a learnable rank depending on the difficulty of the task”). 
	Before the effective filing date of the invention, it would have been obvious to one of ordinary skill in the art to combine the generation of a task-specific machine-learning model by training a generic model on task specific data, identifying a row or column in an attention matrix with an importance score above a threshold only prior to inference, and the inclusion of the row or column in an adaptive attention pattern for use in inferencing by the machine learning model having a self-attention operation of Lu in view of Beltagy with the generation of the importance score during training of the task-specific machine-learning model of Liu in order to improve accuracy and speedup on difficult tasks by optimizing aspects of the model prior to inference (Liu, Pg. 4, Col. 1, Para. 5, “joint optimization . . . can potentially achieve higher accuracy on difficult tasks and higher speedup on simple tasks compared with low-rank approximation methods using fixed rank”), while allowing the row or column to be selected only during training, which will allow for important rows or columns to be determined when not previously known (compare Beltagy, Pg. 4, Col. 1, Para. 5, “Accordingly, we add “global attention” on few pre-selected input locations”, where the selections must be known in advance, with Liu, Pg. 6, Col. 2, Para. 2-3, “we want the predictor to accurately capture dynamic sparse patterns . . . we use predictor to indicate the positions of the important attention weights”, where importance can be determined).

Regarding Claim 3, Lu in view of Beltagy and Liu teach the method of claim 1, wherein the adaptive attention pattern (Lu, Pg. 985, Col. 1, Para. 3, “the sparsity patterns we generate are conditioned on individual input samples. Such dynamic patterns”; where, in view of Beltagy, adaptive during training to determine rows and columns, see Beltagy, Pg. 3, figure 2(d)) 
assigns global attention to tokens in the row or the column (Beltagy, Pg. 4, Col. 1, Para. 5, “we add global attention on few pre-selected input locations . . . Fig. 2d shows an example of a sliding window attention with global attention at a few tokens at custom locations”; Beltagy, Pg. 3, Fig. 2d, where example rows and columns with “global attention” tokens are shaded green).
	The reasons of obviousness have been noted in the rejection of Claim 1 above and remain applicable here.

Regarding Claim 4, Lu in view of Beltagy and Liu teach the method of claim 1, wherein the adaptive attention pattern is a merger of the row or the column with a diagonal attention pattern (Beltagy, Pg. 3, Fig. 2d, where the “attention pattern” includes rows and columns merged with “Sliding window attention”, which is a diagonal attention pattern;  Beltagy, Pg. 2, Col. 1, Para. 3, “Longformer’s attention mechanism is a combination of a windowed local-context self-attention and an end task motivated global attention that encodes inductive bias about the task”). 
	The reasons of obviousness have been noted in the rejection of Claim 1 above and remain applicable here.	

Regarding Claim 5, Lu in view of Beltagy and Liu teach the method of claim 1, wherein the method further comprises controlling a sparsity of the adaptive attention pattern to a sparsity range (Liu, Pg. 5, Col. 2, Para. 1, “Different percentage numbers indicate the sparsity ratio that we applied
to the DSA models. For instance, DSA-90% means that we only keep 10% of the attention weights in each row of the attention matrix, while masking out all the other 90% of the weights”, where a sparsity ratio contains a range, such as 0.945 – 0.954 for “DSA-95%”, which varies based on level of precision and rounding practices; and where the adaptive attention pattern is based on the attention matrix, see Lu, Pg. 981, Col. 1, Para. 6). 
Before the effective filing date of the invention, it would have been obvious to one of ordinary skill in the art to combine the identification of rows in columns in an attention matrix based on an importance threshold, inclusion of the rows or columns in an attention matrix, and use of the attention matrix by a machine learning model with self-attention of Lu in view of Beltagy and Liu, with the sparsity range controlling for the adaptive attention pattern in further view of Liu in order to balance sparsity ranges with accuracy requirements, which may vary from by task (Liu, Pg. 5, Col. 2, Para. 2, “DSA delivers slightly higher performance with 90% and 95% sparsity ratio. Even with up to 99% of sparsity, DSA still demonstrates promising performance”; Liu, Pg. 5, Figure 3).

Regarding Claim 6, Lu in view of Beltagy and Liu teach the method of claim 1, wherein the task-specific machine-learning model having the self-attention operation is a transformer model (Lu, Pg. 984, Col. 2, Para. 2, “Our code is based on the BERT implementation by NVIDIA [39] and the evaluation code
is from Hugging Face’s Transformers library [60]”, where “BERT” is Bidirectional Encoder Representations from Transformers; Lu, Pg. 979, Col. 1, Para. 1, “The attention mechanism is the key operation in the Transformer models . . . Figure 1 (a) depicts the computation stages of the self-attention (abbreviated as attention) mechanism”, where, as discussed above, the “model” is a task-specific machine learning model, see Pg. 984, Col. 2, Para. 2, “Since our method does not require pre-training, we directly fine-tune a pre-trained checkpoint on downstream tasks”).

Regarding Claim 7, Lu in view of Beltagy and Liu teach a non-transitory computer-readable medium storing computer-executable instructions (Lu, Pg. 988, Col. 2, Section “A.2 Artifact check-list (meta-information)”, “How much disk space are required (approximately)?: The codebase and downloaded datasets take up about 1.5GB in total”, where “the codebase” is computer-executable instructions that must be downloaded to a “disk”, which is a non-transitory computer-readable medium) 
that, when executed by a processing device, cause the processing device to perform operations (Lu, Pg. 988, Col. 2, Section “A.2 Artifact check-list (meta-information)”, “Hardware: NVIDIA Tesla V100-PCIE-16GB GPU, AMD Ryzen Threadripper 3970X CPU”) comprising: 
generating a sparse-attention model by adding a sparse attention pattern (Lu, Pg. 989, Col. 1, Para. 8, “Train a model with Sanger sparse attention”) 
to a pre-trained machine-learning model having a self-attention operation (Lu, Pg. 984, Col. 1, Para. 4, “Implementation Details The first stage of the attention mechanism is to calculate queries”, where the “Sanger” model is implemented with a “self-attention (abbreviated as attention)” operation, see Pg. 979, Col. 1, Para. 1, and is “pre-trained”, see Lu, Pg. 984, Col. 2, Para. 2); 
generating a tuned sparse-attention model by fine tuning the sparse- attention model to perform a task with task-specific training (Lu, Pg. 994, Col. 2, Para. 2, “we directly fine-tune a pre-trained checkpoint on downstream tasks”; Beltagy, Pg. 9, Col. 1, Para. 1, “Longformer can learn to use long range context in task specific fine-tuning with large training datasets such as WikiHop”),
wherein the sparse- attention model is an adaptive attention pattern (Lu, Pg. 980, Col. 1, Para.  2, “The resulting attention mask exhibits an unstructured sparsity pattern”; Lu, Pg. 985, Col. 1, Para. 3, “the sparsity patterns we generate are conditioned on individual input samples. Such dynamic patterns”, where “dynamic patterns” “conditioned on individual input[s]” is within the broadest reasonable interpretation of adaptive), 
and wherein the adaptive attention pattern is learned during fine-tuning of the sparse-attention model using task- specific training data (Lu, Pg. 985, Col. 1, Para. 3, “the sparsity patterns we generate are conditioned on individual input samples. Such dynamic patterns”, where, in view of Liu, the adaptive attention pattern would be “conditioned on individual samples”, but learned during fine-tuning of the model, see Liu, Pg. 3, Col. 2, Para. 2, “From S~, we can predict sparse attention masks M”, where “M” corresponds to the attention pattern and “S~” corresponds to the importance scores used to learn the attention pattern mechanism, which occurs during fine-tuning, see Liu Pg. 4, Col. 1, Para. 2, “we propose to fine-tune model parameters with dynamic sparse constraints . . . When training the model with loss function in Eq. 6, the gradient from LMSE will be passed to both the low-rank approximation S~ and the original attention score S . . . the joint optimization of LModel and LMSE implicitly learns a low-rank S with a learnable rank depending on the difficulty of the task”; Lu, Pg. 989, Col. 1, Para. 5, “Data sets. We evaluate models on three datasets, namely GLUE, SQuAD, and CLOTH. They correspond to three different NLP tasks . . . Our script automatically downloads the GLUE and SQuAD datasets before training”, where the “data sets” used for fine-tuning are task-specific because they correspond to three specific “different NLP tasks”); and 
storing the tuned sparse-attention model (Lu, Pg. 988, Col. 2, Section “A.2 Artifact check-list (meta-information)”, “you need to make sure you have enough space for the checkpoints. Each checkpoint takes up about 500 MB of space”, where “checkpoints” include the turned sparse-attention model, see Lu, Pg. 989, Col. 1, Para. 6, “Models . . . you can also download fine-tuned checkpoints and evaluate them directly”).
The reasons of obviousness have been noted in the rejection of Claim 1 above and remain applicable here.

Regarding Claim 10, Lu in view of Beltagy and Liu teach the non-transitory computer-readable medium of claim 7, wherein the adaptive attention pattern includes a row or a column (Beltagy, Pg. 3, figure 2(d), where the “Longformer” “attention pattern” includes selected rows and columns, identified by green shading) 
in an attention matrix (Lu, Pg. 981, Col. 1, Para. 6, “after obtaining a quantized approximation Sˆ of the attention matrix, we generate a binary attention mask M according to the sparsity pattern it exhibits”, where the “binary attention mask” identifies aspects of the “attention matrix” by masking unidentified values) 
with a task-specific importance score (Lu, Pg. 978, Col. 1, Para. 2, “The first step of computing attention is to obtain a score matrix . . . often referred to as the attention matrix”; Lu, Pg. 979, Col. 1, Para. 1, “the score matrix . . . represents the importance of each input token when producing an output element”, which is task specific) 
that is above a threshold importance score (Lu, Pg. 985, Col. 1, Para. 3, “We obtain the attention mask S by applying a binary threshold on a low-precision estimation Pˆ of the attention matrix”; Lu, Pg. 981, Equation 
    PNG
    media_image1.png
    78
    430
    media_image1.png
    Greyscale
);
	The reasons of obviousness have been noted in the rejection of Claim 1 above and remain applicable here.	

Regarding Claim 11, the additional elements of the dependent claim are substantially the same as the limitations of Claim 3, therefore it is rejected under the same rationale.

Regarding Claim 14, Lu in view of Beltagy and Liu teach a system (Lu, Pg. 988, Col. 2, Section “A.2 Artifact check-list (meta-information)”, “Run-time environment: . . . Hardware: . . .”; for more information see Lu, Pg. 982 – 983, Section “Hardware Dataflow”) comprising: 
a memory component (Lu, Pg. 984, Table 2, “Memory 128KB query buffer, 128KB key buffer, 128KB value buffer, 128KB output buffer”, where the “buffers” must be stored in a memory component); and 
a processing device coupled to the memory component (Lu, Pg. 988, Col. 2, Section “A.2 Artifact check-list (meta-information)”, “Hardware: NVIDIA Tesla V100-PCIE-16GB GPU, AMD Ryzen Threadripper 3970X CPU”, which is known to be coupled to memory), the processing device to perform operations comprising: 
identifying, during a task-specific fine tuning operation of a generically trained machine- learning model (Beltagy, Pg. 9, Col. 1, Para. 1, “Longformer can learn to use long range context in task specific fine-tuning with large training datasets such as WikiHop”; Liu, Pg. 13, Col. 1, Para. 4, “During fine-tuning, parameters from both original model and the predictor are updated simultaneously”, where the “fine-tuning” is of generically trained machine learning models, see Lu, Pg. 984, Col. 2, Para. 1, “We evaluate our method on BERT [16], GPT-2 [45], and BART [30]”, where at least the model “BERT” is well-established as a generically trained machine-learning model because it is trained on data that is not task-specific) 
having a self-attention operation (Lu, Pg. 984, Col. 1, Para. 4, “Implementation Details The first stage of the attention mechanism is to calculate queries”, where the “Sanger” model is implemented with a “self-attention (abbreviated as attention)” operation, see Lu, Pg. 979, Col. 1, Para. 1), 
a row or a column (Beltagy, Pg. 3, figure 2(d), where the “Longformer” “attention pattern” includes selected rows and columns, identified by green shading) 
in an attention matrix (Lu, Pg. 981, Col. 1, Para. 6, “after obtaining a quantized approximation Sˆ of the attention matrix, we generate a binary attention mask M according to the sparsity pattern it exhibits”, where the “binary attention mask” identifies aspects of the “attention matrix” by masking unidentified values) 
with a task-specific importance score (Lu, Pg. 978, Col. 1, Para. 2, “The first step of computing attention is to obtain a score matrix . . . often referred to as the attention matrix”; Lu, Pg. 979, Col. 1, Para. 1, “the score matrix . . . represents the importance of each input token when producing an output element”, which is task specific) 
that is above a threshold importance score (Lu, Pg. 985, Col. 1, Para. 3, “We obtain the attention mask S by applying a binary threshold on a low-precision estimation Pˆ of the attention matrix”; Lu, Pg. 981, Equation 
    PNG
    media_image1.png
    78
    430
    media_image1.png
    Greyscale
);
including the row or the column in an adaptive attention pattern (Beltagy, Pg. 3, figure 2(d), where the “Longformer” “attention pattern” includes selected rows and columns, identified by green shading) 
used with the machine-learning model to limit self-attention operations performed while making an inference (Lu, Pg. 987, Col. 1, Para. 1, “we obtain the attention mask by applying a binary threshold T to the predicted attention matrix. Naturally, the larger T becomes, the more connections are pruned in the attention mechanism”, which is used for an inferencing task, see Lu, Pg. 980, Col. 2, Para. 1, “In the inference phase”); and 
in response to an input, generating a task-specific inference for the input using the machine-learning model (Lu, Pg. 985, Table 3, where the model’s “accuracy”, which requires output in response to an input, for ten specific tasks is displayed; and where the tasks are inferencing tasks, see Lu, Pg. 980, Col. 2, Para. 1, “In the inference phase”) 
with the adaptive attention pattern (Lu, Pg. 989, Col. 1, Para. 8, “Experimental workflow . . . Train a model with Sanger sparse attention”).
The reasons of obviousness have been noted in the rejection of Claim 1 remain applicable here. 

Regarding Claim 15, Lu in view of Beltagy and Lui teach the system of claim 14, wherein the machine-learning model is not retrained on a generic task after adding the adaptive attention pattern to the machine-learning model (Lu, Pg. 984, Col. 2, Para. 2, “Since our method does not require pre-training, we directly fine-tune a pre-trained checkpoint on downstream tasks”).

Regarding Claim 16, the additional elements of the dependent claim are substantially the same as the limitations of Claim 3, therefore it is rejected under the same rationale.

Regarding Claim 18, Lu in view of Beltagy and Liu teach the system of claim 14, wherein the operations further comprise learning different adaptive attention patterns for different layers of the machine-learning model (Beltagy, Pg. 5, Col. 1, Para. 2-3, “we use differing window sizes across the layers. In particular, we use small window sizes for the lower layers and increase window sizes as we move to higher layers”; Beltagy, Pg. 15, Table 12, “Dilation (small model)”, where the window differs across layers, which is a known component of “attention patterns”, see Beltagy Pg. 3, Figure 2). 
	The reasons of obviousness have been noted in the rejection of Claim 1 above and remain applicable here.	

Regarding Claim 20, the additional elements of the dependent claim are substantially the same as the limitations of Claim 5, therefore it is rejected under the same rationale.

Claims 2, 8-9, 13, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Lu in view of Beltagy, Liu, and Chang et al. (hereinafter Chang) (“End-to-End ASR with Adaptive Span Self-Attention”).

Regarding Claim 2, Lu in view of Beltagy and Liu teach the method of claim 1 . . . . 
Lu in view of Beltagy and Liu do not teach . . . wherein the adaptive attention pattern is for a single layer of the machine-learning model. 
However, Chang teaches [a method] . . . where an adaptive attention pattern is for a single layer of the machine-learning model (Pg. 3595, Abstract, “we propose to use a technique called adaptive span self-attention . . . [which] enables the network to learn an appropriate size and position of the window for each layer and head”, where “each” indicates a given attention pattern is for a single layer).  
 	Before the effective filing date of the invention, it would have been obvious to one of ordinary skill in the art to combine the identification of an important row or column in an attention matrix, inclusion of the row or column in an adaptive attention pattern and use of the adaptive attention pattern to generate an inference of Lu in view of Beltagy and Liu, with the adaptive attention pattern for a single layer of Chang, in order to account for behavioral differences across model layers, which impacts performance (Chang, Pg.3596, Col. 2, Para. 5-6, “the behavior of each head at every layer is not necessarily the same, and using a single span size hyperparameter W (or Wl and Wr) for all the self-attention computations is not appropriate . . . the motivation is to learn the appropriate span size at each self-attention head and layer during training”; Chang, Pg. 3595, Abstract, “the proposed adaptive span methods consistently improved the performance from the conventional fixed span methods”). 

Regarding Claim 8, the additional elements of the dependent claim are substantially the same as the limitations of Claim 2, therefore it is rejected under the same rationale.

Regarding Claim 9, the additional elements of the dependent claim are substantially the same as the limitations of Claim 5, therefore it is rejected under the same rationale.
Regarding Claim 13, Lu in view of Beltagy and Liu teach the non-transitory computer-readable medium of claim 8, wherein the sparse-attention model is not retrained on a generic task after adding the adaptive attention pattern to the sparse-attention model (Lu, Pg. 984, Col. 2, Para. 2, “Since our method does not require pre-training, we directly fine-tune a pre-trained checkpoint on downstream tasks”, where, as discussed above, the model is a sparse-attention model, see Lu, Pg. 989, Col. 1, Para. 8, “Train a model with Sanger sparse attention”; see also Lu, Pg. 984, Col. 1, Para. 4, “Implementation Details The first stage of the attention mechanism is to calculate queries”, where the “Sanger” model is implemented with a “self-attention (abbreviated as attention)” operation).

Regarding Claim 17, the additional elements of the dependent claim are substantially the same as the limitations of Claim 2, therefore it is rejected under the same rationale.

Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Lu in view of Beltagy, Liu, and Merle (“Effortless NLP using pre-trained Hugging Face pipelines (with just 3 lines of code!)”). 

Regarding Claim 12, Lu in view of Beltagy and Liu teach the non-transitory computer-readable medium of claim 7, wherein the pre- trained machine-learning model is trained on a . . . task (Lu, pg. 988, col. 2, “Artifact check-list (meta-information)”, “Model”, “Our scripts download pre-trained checkpoints from the Hugging Face Model Hub”). 
Lu in view of Beltagy and Liu do not specifically teach the task should . . .  generic . . . .
	However, Merle teaches [the]  . . . task [is generic] (Pg. 4, para. 2, “Pre-training should be generic”; Pg. 7, Para. 1, “the Hugging Face model hub . . . I will use in this article”).
	Before the effective filing date of the invention, it would have been obvious to one of ordinary skill in the art to further combine the pre-trained machine learning model trained on data from the Hugging Face Model Hub of Lu in view of Beltagy and Liu, with training on generic tasks on the Hugging Face Model Hub of Merle in order to obtain models that can be used for multiple tasks (Merle, Pg. 4, Para. 2, “Pre-training should be generic, in order to use the model for a wide range of objectives”).
	
Claim 19 is rejected under 35 U.S.C. 103 as being unpatentable over Lu in view of Beltagy, Liu, and Xu et al. (hereinafter Xu) (“Transformer Empowered CSI Feedback for Massive MIMO Systems”). 

Regarding Claim 19, Lu in view of Beltagy and Liu teach the system of claim 14, wherein the operations further comprise: . . . generat[ing] an importance measure for individual tokens (Lu, Pg. 979, Col. 1, Para. 1, “the score matrix is calculated by multiplying the query matrix and key matrix, which represents the importance of each input token when producing an output element”, where “each input token” indicates the importance measure is for individual tokens); and
and providing the importance measure to a . . . function (Lu, Pg. 979, Col. 1, Para. 1, “We then normalize the score matrix with a row-wise softmax function”) 
to generate the task-specific importance score (Lu, Pg. 978, Col. 1, Para. 2, “The first step of computing attention is to obtain a score matrix . . . often referred to as the attention matrix”; Lu, Pg. 979, Col. 1, Para. 1, “the score matrix . . . represents the importance of each input token when producing an output element”, which is task specific) 
for the row or the column (Beltagy, Pg. 3, figure 2(d), where the “Longformer” “attention pattern” includes selected rows and columns, identified by green shading).
The reasons of obviousness have been noted in the rejection of Claim 1 above and remain applicable here.	
Lu in view of Beltagy and Liu do not teach . . . providing an output from a self-attention layer to a fully-connected layer to . . . sigmoid . . . .
However, Xu teaches . . . providing an output from a self-attention layer to a fully-connected layer to . . . [generate a matrix] (Pg. 158-159, Section “III. Proposed Schemes”, Para. 2, “the input and output of the self-attention layer are added together and subsequently normalized. The normalized data is then fed into a fully-connected layer for linear transformation”)
. . . [providing the output of the transformer layer to a] sigmoid [function] (Pg. 159, Section “III. Proposed Schemes”, Para. 3, “The output of the transformer layer is scaled to [0, 1] by a sigmoid function”). 
	Before the effective filing date of the invention, it would have been obvious to one of ordinary skill in the art to combine the generating of matrix of importance measures for tokens, providing the matrix to a function to generating task-specific importance scores for rows and columns of Lu in view of Beltagy and Liu, with the generating a matrix by passing self-attention layer output to a fully connected layer and scaling with a sigmoid function of Xu, in order to improve model performance (Xu. Pg. 161, Para. 1, “CsiTransformer . . . achieved significantly better performance than the original CNN-based CsiNet at all compression ratios we tested. In particular, our experiment results have suggested that the proposed CsiTransformer can achieve higher recovery accuracy”, where the above scheme was for “CsiTransformer”, see Pg. 158-159, Section “III. Proposed Schemas, A. CsiTransformer”).

Response to Arguments
Applicant's arguments filed on 12/31/2025 have been fully considered. Each argument is addressed in detail below.

I.	Applicant argues the rejections to the claims, under 35 U.S.C. § 112(b), should be withdrawn (Applicant’s Remarks, 12/31/2025, Pg. 7, Section “Rejections based on 35 U.S.C. § 112(b) or 35 U.S.C. § 112 (pre-AIA )).

	Applicant’s amendments to the claims have overcome each and every rejection to the claims, under 35 U.S.C. § 112(b), previously communicated in the 10/09/2025 Office Action. As a result, the rejections to the claims, under 35 U.S.C. § 112(b), have been withdrawn.

II.	Applicant argues the rejections to the claims, under 35 U.S.C. § 101, should be withdrawn (Applicant’s Remarks, 12/31/2025, Pg. 7-11, Section “Rejections based on 35 U.S.C. § 101”).

	First, Applicant argues the claimed steps cannot be performed in the mind or with the aid of pen and paper (Step 2A, Prong One). Specifically, Applicant argues the claims, as amended, are not directed to a mere abstract idea or mental process because they are directed to specific improvements to computer technology. Wherein the asserted improvement is the training and deployment of machine learning models with adaptive sparse attention patterns that are learned only during fine-tuning with task-specific data. 
	In support of this position, Applicant references Federal Circuit decisions, such as McRO, Inc. v. Bandai Namco Games America Inc., 837 F.3d 1299, 1314-16 and Enfish, LLC v. Microsoft Corp., 822 F.3d 1327, 1335-36 (Fed. Cir. 2016), which stand for the position that claims directed to specific improvements in the functionality of a computer are patent eligible.
	Furthermore, Applicant argues the steps of identifying or including rows or columns in an attention matrix “require generating a task-specific machine-learning model by training a generic model on task-specific data, and identifying important rows or columns in an attention matrix only during fine-tuning with task-specific data, and without pre-training the model on generic data to adapt to the attention matrix”, which Applicant characterizes as high-dimensional structures and computational resources beyond human capability. 
According to MPEP 2106.04(a)(2)(III)(C), “Claims can recite a mental process even if they are claimed as being performed on a computer . . . In evaluating whether a claim that requires a computer recites a mental process, examiners should carefully consider the broadest reasonable interpretation of the claim in light of the specification. For instance, examiners should review the specification to determine if the claimed invention is described as a concept that is performed in the human mind and applicant is merely claiming that concept performed 1) on a generic computer, or 2) in a computer environment, or 3) is merely using a computer as a tool to perform the concept.”
Additionally, according to MPEP 2106.05(a)(II), “The McRO court also noted that the claims at issue described a specific way (use of particular rules to set morph weights and transitions through phonemes) to solve the problem of producing accurate and realistic lip synchronization and facial expressions in animated characters, rather than merely claiming the idea of a solution or outcome, and thus were not directed to an abstract idea” (See also 837 F.3d at 1313, 120 USPQ2d at 1101).
Furthermore, according to MPEP 2145 (VI), “Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims” (See also In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993)).
Here, the claims recite abstract ideas such as identifying and including, which do require computer resources, such as model parameters and self-attention machine learning mechanisms, and high dimensional data structures to be performed. However, these data structures and computer resources are described generically (see Claim 1, ln. 4 and 9, “an attention matrix” and “an adaptive attention pattern”; see also Claim 7 and 14 where the machines are described as comprising standard computer components) and function merely as a tool to perform the mental concepts. Additionally, the elements of the claims fail to particularly point out specific processes that rise to the level required to be comparable to the claims of McRo, where the rules for animation tasks were particularly described. Instead, the claims recite steps of identifying and generating at a high-level, such as determining rows or columns for using in generating an adaptive attention pattern, that are within the abilities of the human mind. This remains the case for instances where the determining is based on importance scores generated during generic training processes.
As a result, the arguments are not persuasive.

	Second, Applicant argues the claimed process is integrated into a practical application (Step 2A, Prong 2). Specifically, Applicant asserts the claims are directed to specific improvements in the operation of machine-learning models, such as “the creation and use of adaptive sparse attention patterns that are learned only during fine- tuning, which results in improved accuracy and efficiency for task-specific inference”. 
	In support of this position, Applicate cites excerpts from the specification, which are asserted to show that the approach allows for “the model to "customize" its attention pattern for each task, leading to better performance and reduced computational cost compared to prior art methods that use static or pre-defined attention patterns” (see Spec. [0003], [0015], [0017], [0019]). 
Additionally, Applicant cited Federal Circuit decisions that stand for the position that claims are patent eligible if directed to a specific implementation of a solution in the software arts, such as DDR Holdings, LLC v. Hotels.com, L.P., 773 F.3d 1245, 1259 (Fed. Cir. 2014), to argue the claims recite “a specific process for generating and using adaptive attention patterns in a machine-learning model”.
Furthermore, Applicant cited USPTO’s Example 39 and 40 in order to argue the claims analogously recite specific technological improvements. Specifically, “the present claims recite a specific improvement to the functioning of machine-learning models-namely, the ability to adapt attention patterns during fine-tuning based on task-specific importance scores” and “the claims similarly recite a method of training a neural network (a transformer model) using a novel technique: identifying important rows or columns in the attention matrix during fine-tuning, and using those to construct an adaptive attention pattern”, which Applicant argues contributes to increased accuracy, increased generalization to specific tasks, and reduced computational overhead. 
According to MPEP 2106.04(d)(1), “A claim reciting a judicial exception is not directed to the judicial exception if it also recites additional elements demonstrating that the claim as a whole integrates the exception into a practical application. One way to demonstrate such integration is when the claimed invention improves the functioning of a computer or improves another technology or technical field . . . if the specification explicitly sets forth an improvement but in a conclusory manner (i.e., a bare assertion of an improvement without the detail necessary to be apparent to a person of ordinary skill in the art), the examiner should not determine the claim improves technology”.
According to MPEP 2106.05(f), “Another consideration when determining whether a claim integrates a judicial exception into a practical application in Step 2A Prong Two . . . is whether the additional elements amount to more than a recitation of the words "apply it" (or an equivalent) . . . A claim having broad applicability across many fields of endeavor may not provide meaningful limitations that integrate a judicial exception into a practical application”.
	Here, the specification recites technological improvements in a conclusory manner. Specifically, the benefits of layer-specific customization and increased accuracy with reduced computational needs are asserted to be achievable benefit of the claimed subject matter. However, these conclusory statements are not accompanied by sufficient detail for the improvements to be fully apparent to one of ordinary skill in the art. Similarly, the additional elements recited in the claims, such as a task-specific machine learning model, an attention matrix, and an attention pattern, have broad applicability across many transformer-based machine learning approaches. This is substantially different from the purported improvements to the functioning of a computer technology in USPTO Example 39 and 40, which provide for specific improvements over prior systems. Instead, these generic recitations of transformer-based computer components amount to merely reciting the words “apply it”, which does not sufficiently integrate the abstract ideas into a practical application. 
As a result, the arguments are not persuasive.

	Finally, Applicant argues the claims recite significantly more than the judicial exception (Step 2B). Specifically, Applicant argues the claims recite a specific sequence of steps for generating a task-specific model, identifying important rows or columns in an attention matrix during fine-tuning, and using the resulting adaptive attention pattern in a self-attention operation to generate a task-specific inference. As a result, Applicant asserts the claims recite a specific improvement of enabling more efficient and accurate adaptation of attention patterns for different tasks, which was not previously possible in the field of machine learning and artificial intelligence.
	In support of this position, Applicant cites The USPTO's 2019 Revised Patent Subject Matter Eligibility Guidance (84 Fed. Reg. 50, Jan. 7, 2019), which recognizes that claims reciting an improvement to the functioning of a computer or another technology, are patent eligible.
	According to MPEP 2106.05(a)(I), “Examples that the courts have indicated may not be sufficient to show an improvement in computer-functionality [include] . . . Accelerating a process . . . when the increased speed comes solely from the capabilities of a general-purpose computer . . . Mere automation of manual processes, such as using a generic computer”. 
Additionally, according to MPEP 2106.05(I)(B), “Limitations that the courts have found not to be enough to qualify as "significantly more" when recited in a claim with a judicial exception include: i. Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer . . . ii. Simply appending well-understood, routine, conventional activities previously known to the industry, specified at a high level of generality, to the judicial exception . . . iii. Adding insignificant extra-solution activity to the judicial exception, iv. Generally linking the use of the judicial exception to a particular technological environment or field of use”.
	Here, as discussed above in regard to integration into a practical application, the additional elements of the claims do not amount to a technological improvement. This remains true upon reconsideration to determine whether the additional elements amount to significantly more than the judicial except. Specifically, asserted benefits of efficient and accurate adaptation of attention patterns for different are achieved by the recited steps for adapting patterns during fine-tuning based on importance scores. These limitations, which recite the elements of fine-tuning and attention pattern generation at a high-level, merely automates manual processes for determining attention patterns, using computer components that are described generically. Furthermore, the use of task-specific machine learning models, an attention matrix, and attention patterns are generic do not provide a level of specificity required to constitute a specific improvement to a problem because they are generically used in transformer-based machine learning applications.
Instead, the claims, as discussed above, recite the elements of fine-tuning and attention pattern generation at a high-level. This amounts to mere recitation of generic computer components to apply the judicial exception. Alternatively, the requirement that the importance scores be generated during fine tuning, recited at its current level of generality, could be interpreted as generally linking the use of the judicial exception to a particular technological environment or field of use.
As a result, the arguments are not persuasive.

III. 	Applicant argues the rejections to the independent claims, under 35 U.S.C. § 103, should be withdrawn (Applicant’s Remarks, 12/31/2025, Pg. 11-15, Section “Rejections based on 35 U.S.C. § 103”). 
	
	Specifically, Applicant asserts that the prior art of record fails to teach or suggest limitations of the independent claims. 
Regarding Claim 1, Applicant asserts the prior art of record fails to teach or suggest “identifying a row or a column in an attention matrix with an importance score for a task that is above a threshold importance score, wherein the importance score is generated only during the training of the task-specific machine- learning model with the task-specific training data and without pre-training the model on generic training data to adapt to the attention matrix”.
As to Claim 7, Applicant asserts the prior art of record fails to teach or suggest "generating a tuned sparse- attention model by fine tuning the sparse-attention model to perform a task with task-specific training, wherein the sparse-attention model is an adaptive attention pattern, and wherein the adaptive attention pattern is learned during fine-tuning of the sparse-attention model using task- specific training data".
Regarding Claim 14, Applicant asserts the prior art of record fails to teach or suggest "generating a tuned sparse- attention model by fine tuning the sparse-attention model to perform a task with task-specific training, wherein the sparse-attention model is an adaptive attention pattern, and wherein the adaptive attention pattern is learned during fine-tuning of the sparse-attention model using task- specific training data".
In support of this assertion, Applicant argues 1) Liu fails to teach or suggest limitations that both the 10/09/2025 Office Action and this Office Action rely on it to teach or suggest and 2) the combination of Lu in view of Beltagy and Liu, as articulated in both the 10/09/2025 Office Action and this Office Action, is insufficient.
Each argument is addressed below.

1.	Applicant argues Liu fails to teach or suggest limitations that both the 10/09/2025 Office Action and this Office Action rely on it to teach or suggest.

According to MPEP 2111, “During patent examination, the pending claims must be given their broadest reasonable interpretation consistent with the specification” (internal quotation marks omitted) (see also Phillips v. AWH Corp., 415 F.3d 1303, 1316, 75 USPQ2d 1321, 1329 (Fed. Cir. 2005)). 
Additionally, according to MPEP 2111.01, “Under a broadest reasonable interpretation (BRI), words of the claim must be given their plain meaning, unless such meaning is inconsistent with the specification. The plain meaning of a term means the ordinary and customary meaning given to the term by those of ordinary skill in the art at the relevant time. The ordinary and customary meaning of a term may be evidenced by a variety of sources, including the words of the claims themselves, the specification, drawings, and prior art”).
Furthermore, according to MPEP 2111.01, “II. IT IS IMPROPER TO IMPORT CLAIM LIMITATIONS FROM THE SPECIFICATION Though understanding the claim language may be aided by explanations contained in the written description, it is important not to import into a claim limitations that are not part of the claim” (internal quotation marks omitted) (see also Superguide Corp. v. DirecTV Enterprises, Inc., 358 F.3d 870, 875, 69 USPQ2d 1865, 1868 (Fed. Cir. 2004)).
Finally, according to MPEP 2145, “One cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references . . . Where a rejection of a claim is based on two or more references, a reply that is limited to what a subset of the applied references teaches or fails to teach, or that fails to address the combined teaching of the applied references may be considered to be an argument that attacks the reference(s) individually” (see also In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981)).
Here, Liu is relied upon to teach “wherein the importance score is generated . . . during the training of the task-specific machine-learning model”. Applicant asserts Liu does not teach this limitation. 
First, Applicant asserts that Liu never generates an importance score because the generated “S~” does not comprise importance scores. Specifically, Applicant argues the generated “S~” does not comprise importance scores because it is allegedly a low-rank and low-precision approximation of the full attention matrix S = QKT by applying a sparse random projection and learned projection matrices (WQ' and W_K'). Additionally, Applicant alleges the goal of the prediction path is simply "to obtain an approximation of attention scores with low computational costs" (Liu, p. 3), where the predicted scores are computational surrogates that are not described as task-dependent metrics, relevance indicators, or importance measures.
However, even if Applicant’s characterization of the generated “S~” were assumed to be true, there are no positively recited limitations in the claims that would preclude the importance scores from being low-rank and low-precision approximations of an attention matrix, which is applied using projections. This is similarly the case for the goal of obtaining the score with low computational costs, where the scores serve only as computational surrogates for full attention values. Instead, the relevant inquiry is whether the scores comprised in the “S~” are within the broadest reasonable interpretation consistent with the specification of importance scores. Further, it is not required that the importance scores be described as task-dependent metrics, relevance indicators, or importance measure if they are within the ordinary and customary meaning given to importance scores. The scores comprised in the “S~” are within the broadest reasonable interpretation of importance scores because the scores are used to search for “important attention weights” for sparse attention masks, where “threshold values” and “top-k searching” can use the importance scores to determine importance (Pg. 3, Col. 2, Para. 3, “From ~ S, we can predict sparse attention masks M using thresholds, where the threshold values are either fixed by tuning from the validation set or determined by top-k searching”; see also Pg. 1, Col. 2, Para. 2, “The challenge is to efficiently search for sparse patterns close to oracle sparse patterns that keep all the important attention weights”).
As a result, the argument is not persuasive. 
Second, Applicant argues that “S~” is not generated during training of any kind because it is generated at inference. However, Liu is relied upon in combination with Beltagy to teach the “generated only during the training” (Claim 1) and Applicant has not argued any failures of Beltagy for its teaching of “only”. As a result, the question of whether Liu teaches generating the “S~” during inference is not relevant to the question of whether it teaches generating the “S~” during training. 
As mentioned by the Applicant,  “S~” is computed from input data (Pg. 10, Col. 1, Para. 3, “the sparse attention patterns are inherently dynamic depending on input sequences”; see also Pg. 3, Col. 2, Para. 1, “We construct a pair of approximate query and key transformations in the prediction path to compute for approximate score S~, as in Q~; K~ = XP W~Q;XPW~K”). Subsequently, the “S~” is used to “Optimiz[e] . . . the trainable parameters, W~Q and W~K, through minimizing the mean squared error (MSE)” (Pg. 3, Col. 2, Para. 5, “Optimization of Approximation. The random projection matrix P is constant after initialization and shared by two approximate transformations. We obtain the trainable parameters, W~Q and W~K, through minimizing the mean squared error (MSE) as the criterion to optimize for approximation: 
    PNG
    media_image2.png
    44
    328
    media_image2.png
    Greyscale
”; Pg. 4, Col. 1, Para. 5, “learns a low-rank S with a learnable rank depending on the difficulty of the task”). The plain meaning of training of a task-specific machine learning model is the process of inputting data into a model to generate an output, comparing the outputs with an expected result, and using the comparison information to adjust the model.
A person of ordinary skill in the art would understand Liu to teach that “S~” is generated during the training of the task-specific machine-learning model because it is generated in response to input, which occurs during the forward pass of model training, and it is used to generate the loss that is used to optimize trainable  model parameters. 
As a result, the argument is not persuasive. 
Third, Applicant alleges the training of Liu is misunderstood in the 10/09/2025 Office Action and this Office Action. Specifically, Applicant alleges Liu's training process does not produce a score that is used later to guide attention selection. Instead, Applicant argues the optimization of Liu is the use of combined loss values to adjust the weights in order to better approximate S. First, it is worth noting that the full claim limitation referenced by Applicant is taught by a combination of references. Therefore, Applicant’s attack on only Liu to argue that the limitations relating to use of the importance scores to guide attention selection are necessarily unpersuasive. However, even if our analysis were constrained to Liu and Applicant’s assertions of the teachings of Liu were accepted as fact, the use of “S~” to optimize model parameters during training in order to improve the generation of “S~” during inference would be within the broadest reasonable interpretation of use of the scores to guide later attention selection. 
Furthermore, Applicant alleges “no scoring function is produced, no task-dependent importance values are computed, and no thresholding of scores occurs during training”. However, the claim, as currently formulated, only requires generation during training and does not positively recite any limitations that would require a final output of a score instead of updated parameters, a scoring function, task-dependent importance values, or thresholding to occur during training. As a result, the question of whether Liu teaches these elements is not relevant to the claim as currently formulated. 
As a result, the argument is not persuasive. 
Fourth, Applicant argues the “S~” are not tied to task semantics or training signals and the values do not arise from a supervised learning objective that specifically instills the significance or importance. However, there are no positively recited limitations in the claims that would specifically require these elements to apply to the “S~”. Additionally, as discussed in detail above, even if it were conceded that “S~” is simply an approximate computation of QKT and that the purpose was not explicitly described as evaluating importance, the scores comprised within “S~” would still be within the broadest reasonable interpretation of importance scores because the scores are used to search for “important attention weights” for sparse attention masks, where “threshold values” and “top-k searching” can use the importance scores to determine importance (Pg. 3, Col. 2, Para. 3, “From ~ S, we can predict sparse attention masks M using thresholds, where the threshold values are either fixed by tuning from the validation set or determined by top-k searching”; see also Pg. 1, Col. 2, Para. 2, “The challenge is to efficiently search for sparse patterns close to oracle sparse patterns that keep all the important attention weights”).
As a result, the argument is not persuasive. 

2.	Applicant argues the combination of Lu in view of Beltagy and Liu, as articulated in both the 10/09/2025 Office Action and this Office Action, is insufficient.

	First, Applicant argues Liu teaches away from the combination because it disparages the static patterns and course adaptive patterns based on rows or columns of Beltagy. Specifically, Applicant argues Liu criticizes the static sparse patterns of Beltagy for "restricting viable attention connections" (Liu, p. 1) and argues static approaches cannot capture dynamic attention behavior. Therefore, Applicant concludes there cannot be a reasonable motive to combine these references.  
	According to MPEP 2143, “The courts have made clear that the teaching, suggestion, or motivation test is flexible and an explicit suggestion to combine the prior art is not necessary. The motivation to combine may be implicit and may be found in the knowledge of one of ordinary skill in the art, or, in some cases, from the nature of the problem to be solved”.
Additionally, according to MPEP 2141.01(a), “A prior art reference must be considered in its entirety, i.e., as a whole, including portions that would lead away from the claimed invention . . . However, "the prior art’s mere disclosure of more than one alternative does not constitute a teaching away from any of these alternatives because such disclosure does not criticize, discredit, or otherwise discourage the solution claimed” (see also In re Fulton, 391 F.3d 1195, 1201, 73 USPQ2d 1141, 1146 (Fed. Cir. 2004)).
	Here, as discussed in detail above, before the effective filing date of the invention, it would have been obvious to one of ordinary skill in the art to combine the generation of a task-specific machine-learning model by training a generic model on task specific data, identifying a row or column in an attention matrix with an importance score above a threshold only prior to inference, and the inclusion of the row or column in an adaptive attention pattern for use in inferencing by the machine learning model having a self-attention operation of Lu in view of Beltagy with the generation of the importance score during training of the task-specific machine-learning model of Liu in order to improve accuracy and speedup on difficult tasks by optimizing aspects of the model prior to inference (Liu, Pg. 4, Col. 1, Para. 5, “joint optimization . . . can potentially achieve higher accuracy on difficult tasks and higher speedup on simple tasks compared with low-rank approximation methods using fixed rank”), while allowing the row or column to be selected only during training, which will allow for important rows or columns to be determined when not previously known (compare Beltagy, Pg. 4, Col. 1, Para. 5, “Accordingly, we add “global attention” on few pre-selected input locations”, where the selections must be known in advance, with Liu, Pg. 6, Col. 2, Para. 2-3, “we want the predictor to accurately capture dynamic sparse patterns . . . we use predictor to indicate the positions of the important attention weights”, where importance can be determined).
In sum, the combination of Lu in view of Beltagy and Liu, as articulated in both the 10/09/2025 Office Action and this Office Action, is the combination of the identifying a row or column in an attention matrix with an importance score above a threshold only prior to inference, as taught by Beltagy, and the generation of the importance score during training of the task-specific machine-learning model, as taught by Liu. As a result, even if it were assumed that Liu disparages the overall approach of Beltagy, it would not follow that Liu criticized, discredited, or otherwise discouraged the solution claimed. Liu’s advocacy for the benefits of its approach over popular alternatives, such as Beltagy, amounts to establishing a preference. Whereas, regardless of whether Liu contains a demonstrated preference for its adaptive approach to attention selection, the question is whether Liu criticizes, discredits, or otherwise discourages the solution claimed. Liu never states that its attention optimization approach to model training is incompatible with the selection of rows or tokens in an attention matrix. Conversely, a person of ordinary skill in the art would understand that the optimization approach to attention selection could be used to supplement the approach of Beltagy so that the identification of rows or columns could be based, at least in part, on inputs, instead of solely domain knowledge (Pg. 1, Col. 2, Para. 3, “We observe important tokens that attract a large portion of attention weights from other tokens, similar to the global attention method (Beltagy et al., 2020; Zaheer et al., 2020). However, the positions of global tokens are input-dependent, and our method can effectively identify such varieties, instead of relying on domain knowledge to predetermine certain global tokens in fixed positions”). 
As a result, the argument is not persuasive. 
Second, Applicant argues a person of ordinary skill in the art would not be motivated to combine Liu and Beltagy because the references allegedly pursue different objectives and use incompatible paradigms. Specifically, Applicant draws a distinction between the static methods of Beltagy and the dynamic methods of Liu. As a result, Applicant argues their combination would undermine the advantages of each method and contradict the stated principles of Liu.
According to MPEP 2143.01, “If a proposed modification would render the prior art invention being modified unsatisfactory for its intended purpose, there may be no suggestion or motivation to make the proposed modification”
Here, as discussed in detail above, the proposed modification is to utilize the optimization approach of Liu in order to allow the identification of rows or columns in an attention matrix, as taught by Beltagy, to be more accurate. As a result, other aspects of Liu, such as its dynamic selection methods that occur downstream of its optimization approach, are not combined with Lu in view of Beltagy. Instead, the fact that it is used for downstream attention selection is only relevant insofar as it establishes demonstrated compatibility with attention selection. As a result, it would not render Lu in view of Beltagy unsatisfactory for its intended purpose. 
As a result, the argument is not persuasive. 

IV. 	Applicant argues the rejections to the dependent claims, under 35 U.S.C. § 103, should be withdrawn (Applicant’s Remarks, 12/31/2025, Pg. 14-17, Section “Rejections based on 35 U.S.C. § 103”).
	
Applicant argues that the dependent claims are allowable because they are dependent upon an allowable independent claim. Additionally, in instances where an additional reference is used to teach an additional element of a dependent claim, Applicant asserts that the additional reference fails to make up for alleged deficiencies with the combination of references relied upon in the rejections of the independent claims. Specifically, Chang (Pg. 16), Merle (Pg. 16), and Xu (Pg. 16-17) are discussed.
According to MPEP 2145, “One cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references . . . Where a rejection of a claim is based on two or more references, a reply that is limited to what a subset of the applied references teaches or fails to teach, or that fails to address the combined teaching of the applied references may be considered to be an argument that attacks the reference(s) individually” (see also In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981)).
Here, as discussed above, the arguments for the patentability of the independent claims were not persuasive. As a result, arguments for the patentability of the dependent claims that are based solely on the asserted patentability of the independent claims are not persuasive. Additionally, the alleged failures of Chang, Merle, and Xu to teach limitations of the independent claims amount to attacking references individually where the rejections are based on combinations of references. Notably, each of Chang, Merle, and Xu are only relied upon to teach additional limitations of the dependent. Whereas, as discussed in detail above, the combination of Lu in view of Beltagy and Liu fully teach the elements of the independent claims. 
As a result, the arguments are not persuasive.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MATTHEW BRYCE GOLAN whose telephone number is (571)272-5159. The examiner can normally be reached Monday through Friday, 8:00 AM to 5:00 PM ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Alexey Shmatov can be reached at (571) 270-3428. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MATTHEW BRYCE GOLAN/Examiner, Art Unit 2123                                                                                                                                                                                                        
/ALEXEY SHMATOV/Supervisory Patent Examiner, Art Unit 2123
Read full office action
ADAPTIVE SPARSE ATTENTION PATTERN

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

ADAPTIVE SPARSE ATTENTION PATTERN

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email