Last updated: May 29, 2026
Application No. 17/841,577
ADAPTIVE TOKEN DEPTH ADJUSTMENT IN TRANSFORMER NEURAL NETWORKS

Non-Final OA §101§103
Filed
Jun 15, 2022
Priority
Dec 09, 2021 — provisional 63/287,938
Examiner
HALES, BRIAN J
Art Unit
2125
Tech Center
2100 — Computer Architecture & Software
Assignee
Nvidia Corporation
OA Round
2 (Non-Final)
Interview Optional

— +31.3% interview lift. Examiner has a relatively high allowance rate (78%); +31.3% interview lift. A written response may suffice.
Based on 87 resolved cases, 2023–2026
Examiner Intelligence

HALES, BRIAN J View full profile →
Grants 78% — above average
Career Allowance Rate
68 granted / 87 resolved
+23.2% vs TC avg
Strong +31% interview lift
Without
With
+31.3%
Interview Lift
resolved cases with interview
Typical timeline
3y 10m
Avg Prosecution
12 currently pending
Career history
109
Total Applications
across all art units
Statute-Specific Performance

§101
29.4%
-10.6% vs TC avg
§103
56.2%
+16.2% vs TC avg
§102
4.5%
-35.5% vs TC avg
§112
10.0%
-30.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 87 resolved cases
Office Action

§101 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This action is in response to amendments and remarks filed on 10/28/2025. In the current amendments, claim 18 is amended. Claims 1-20 are pending and have been examined.
In response to amendments and remarks filed on 10/28/2025, the 35 U.S.C. 112(b) rejections made in the previous office action are withdrawn.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. 
Regarding Claim 1,
Claim 1 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 1 is directed to a method, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“computing a first set of halting scores for a first set of tokens that has been input into a first layer of the transformer neural network”
“determining that a first halting score included in the first set of halting scores exceeds a threshold value”
“in response to the first halting score exceeding the threshold value, causing a first token that is included in the first set of tokens and is associated with the first halting score not to be processed by one or more layers within the transformer neural network that are subsequent to the first layer”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) and mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)). The above limitations in the context of this claim encompass computing a first set of halting scores for a first set of tokens input to the first layer of the transformer neural network (corresponds to mathematical calculations); determining that a first halting score of the set of halting scores exceeds a threshold value (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can determine if a first halting score exceeds a threshold value); and if the first halting score exceeds the threshold value, the first token associated with the first halting score to not be processed in subsequent layers of the transformer neural network (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can cause the first token associated with the first halting score that exceeds a threshold to not be used for subsequent processing).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)) or insignificant extra-solution activity (See MPEP 2106.05(g)). The limitations:
“a computer”
“a transformer neural network”
As drafted, are additional elements that amount to no more than mere instructions to apply the exception for the abstract ideas. See MPEP 2106.05(f). Therefore, the additional elements do not integrate the abstract ideas into a practical application. 
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe a generic computer and neural network for applying the abstract ideas) Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 2,
Claim 2 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 2 is directed to a method, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“computing one or more losses based on a second set of halting scores computed for a second set of tokens”
“modifying at least one layer included in the transformer neural network based on the one or more losses as part of training the transformer neural network”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) and mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)). The above limitations in the context of this claim encompass computing one or more losses based on a second set of halting scores computed for a second set of tokens (corresponds to mathematical calculations); and modifying a layer of the transformer neural network based on the one or more losses as part of neural network training (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can use the one or more losses to modify layers of the transformer neural network).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)).The recitation of additional elements in claim 1 of a generic computer and neural network, as drafted, are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe a generic computer and neural network for applying the abstract ideas) Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 3,
Claim 3 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 3 is directed to a method, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein computing the one or more losses comprises computing a ponder loss based on the second set of halting scores and a set of layers included in the transformer neural network associated with halting the second set of tokens”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) and mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)). The above limitations in the context of this claim encompass computing a ponder loss based on the second set of halting scores and layers of the transformer neural network associated with halting the second set of tokens (corresponds to mathematical calculations).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)).The recitation of additional elements in claim 2 of a generic computer and neural network, as drafted, are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe a generic computer and neural network for applying the abstract ideas) Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 4,
Claim 4 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 4 is directed to a method, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein computing the one or more losses comprises: aggregating the second set of halting scores into a distribution of halting scores across a series of layers included in the transformer neural network”
“computing a distributional loss based on a divergence of the distribution of halting scores from a target distribution”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) and mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)). The above limitations in the context of this claim encompass aggregating the second set of halting scores into a distribution across a series of layers of the transformer neural network (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can aggregate the second set of halting scores into a distribution of halting scores); and computing a distributional loss based on a divergence of the distribution of halting scores from a target distribution (corresponds to mathematical calculations).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)).The recitation of additional elements in claim 2 of a generic computer and neural network, as drafted, are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. Therefore, the additional elements do not integrate the abstract ideas into a practical application. 
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe a generic computer and neural network for applying the abstract ideas) Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 5,
Claim 5 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 5 is directed to a method, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein computing the one or more losses comprises computing a task loss associated with a prediction generated by a task network based on the second set of tokens”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) and mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)). The above limitations in the context of this claim encompass computing a task loss associated with a prediction from a task network based on the second set of tokens (corresponds to mathematical calculations).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)).The recitation of additional elements in claim 2 of a generic computer and neural network, as drafted, are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe a generic computer and neural network for applying the abstract ideas) Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 6,
Claim 6 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 6 is directed to a method, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein modifying the at least one layer of the transformer neural network comprises updating parameters associated with the first layer and the one or more layers based on a weighted combination of the one or more losses”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) and mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)). The above limitations in the context of this claim encompass updating parameters of the layers based on a weighted combination of the one or more losses (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can use a weighted combination of the one or more losses to update parameters associated with layers of the neural network).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)).The recitation of additional elements in claim 2 of a generic computer and neural network, as drafted, are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe a generic computer and neural network for applying the abstract ideas) Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 7,
Claim 7 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 7 is directed to a method, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein computing the first set of halting scores comprises applying a nonlinear function to a dimension of a token”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) and mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)). The above limitations in the context of this claim encompass applying a nonlinear function to a dimension of a token to compute the first set of halting scores (corresponds to mathematical calculations).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)).The recitation of additional elements in claim 1 of a generic computer and neural network, as drafted, are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe a generic computer and neural network for applying the abstract ideas) Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 8,
Claim 8 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 8 is directed to a method, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein computing the first set of halting scores comprises shifting and scaling the dimension prior to applying the nonlinear function”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) and mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)). The above limitations in the context of this claim encompass computing the first set of halting scores by shifting and scaling the dimension of a token before applying a nonlinear function to the dimension (corresponds to mathematical calculations).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)).The recitation of additional elements in claim 7 of a generic computer and neural network, as drafted, are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe a generic computer and neural network for applying the abstract ideas) Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 9,
Claim 9 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 9 is directed to a method, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein computing the first set of halting scores for the first set of tokens comprises aggregating a second set of halting scores computed for the first set of tokens by a layer included in the transformer neural network that precedes the first layer and a third set of halting scores computed for the first set of tokens by the first layer”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) and mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f))). The above limitations in the context of this claim encompass aggregating a second set of halting scores computed for the first set of tokens by a layer preceding the first layer and a third set of halting scores computed for the first set of tokens by the first layer (corresponds to mathematical calculations).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)).The recitation of additional elements in claim 1 of a generic computer and neural network, as drafted, are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe a generic computer and neural network for applying the abstract ideas) Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 10,
Claim 10 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 10 is directed to a method, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein causing the first token to not be processed by the one or more layers comprises removing the first token from the first set of tokens prior to inputting the first set of tokens into the one or more layers that are subsequent to the first layer”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) and mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)). The above limitations in the context of this claim encompass removing the first token from the first set of tokens prior to inputting the first set of tokens into subsequent layers (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can remove the first token from the first set of tokens).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)).The recitation of additional elements in claim 1 of a generic computer and neural network, as drafted, are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe a generic computer and neural network for applying the abstract ideas) Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 11,
Claim 11 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 11 is directed to non-transitory computer-readable media, which is directed to an article of manufacture, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“computing a first set of halting scores for a first set of tokens that has been input into a first layer of a transformer neural network”
“determining that a first halting score included in the first set of halting scores exceeds a threshold value”
“in response to the first halting score exceeding the threshold value, causing a first token that is included in the first set of tokens and is associated with the first halting score not to be processed by one or more layers within the transformer neural network that are subsequent to the first layer”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) and mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)). The above limitations in the context of this claim encompass computing a first set of halting scores for a first set of tokens input to the first layer of the transformer neural network (corresponds to mathematical calculations); determining that a first halting score of the set of halting scores exceeds a threshold value (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can determine if a first halting score exceeds a threshold value); and if the first halting score exceeds the threshold value, the first token associated with the first halting score to not be processed in subsequent layers of the transformer neural network (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can cause the first token associated with the first halting score that exceeds a threshold to not be used for subsequent processing).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)) or insignificant extra-solution activity (See MPEP 2106.05(g)). The limitations:
“one or more processors”
“a transformer neural network”
As drafted, are additional elements that amount to no more than mere instructions to apply the exception for the abstract ideas. See MPEP 2106.05(f). Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe generic processors and neural network for applying the abstract ideas) Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 12,
Claim 12 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 12 is directed to non-transitory computer-readable media, which is directed to an article of manufacture, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“computing one or more losses based on a second set of halting scores for a second set of tokens”
“modifying at least one layer included in the transformer neural network based on the one or more losses as part of training the transformer neural network”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) and mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)). The above limitations in the context of this claim encompass computing one or more losses based on a second set of halting scores computed for a second set of tokens (corresponds to mathematical calculations); and modifying a layer of the transformer neural network based on the one or more losses as part of neural network training (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can use the one or more losses to modify layers of the transformer neural network).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)). The limitation:
“the one or more processors”
As drafted, is an additional element that amounts to no more than mere instructions to apply the exception for the abstract ideas. See MPEP 2106.05(f). In addition, the recitation of additional elements in claim 11 of generic processors and neural network, as drafted, are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe generic processors and neural network for applying the abstract ideas) Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 13,
Claim 13 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 13 is directed to non-transitory computer-readable media, which is directed to an article of manufacture, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein computing the one or more losses comprises computing a ponder loss based on the second set of halting scores and a set of layers included in the transformer neural network associated with halting the second set of tokens”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) and mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)). The above limitations in the context of this claim encompass computing a ponder loss based on the second set of halting scores and layers of the transformer neural network associated with halting the second set of tokens (corresponds to mathematical calculations).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)).The recitation of additional elements in claim 12 of generic processors and neural network, as drafted, are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe generic processors and neural network for applying the abstract ideas) Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 14,
Claim 14 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 14 is directed to non-transitory computer-readable media, which is directed to an article of manufacture, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein computing the one or more losses comprises: aggregating the second set of halting scores into a distribution of halting scores across a series of layers included in the transformer neural network”
“computing a distributional loss based on a Kullback-Leibler divergence of the distribution of halting scores from a target distribution”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) and mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)). The above limitations in the context of this claim encompass aggregating the second set of halting scores into a distribution across a series of layers of the transformer neural network (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can aggregate the second set of halting scores into a distribution of halting scores); and computing a distributional loss based on a Kullback-Leibler divergence of the distribution of halting scores from a target distribution (corresponds to mathematical calculations).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)).The recitation of additional elements in claim 12 of generic processors and neural network, as drafted, are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe generic processors and neural network for applying the abstract ideas) Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 15,
Claim 15 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 15 is directed to non-transitory computer-readable media, which is directed to an article of manufacture, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein computing the one or more losses comprises computing a task loss associated with a prediction generated by a task network based on a weighted sum of values of a class token included in the second set of tokens”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) and mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)). The above limitations in the context of this claim encompass computing a task loss associated with a prediction from a task network based on a weighted sum of values of a class token included in the second set of tokens (corresponds to mathematical calculations).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)).The recitation of additional elements in claim 12 of generic processors and neural network, as drafted, are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe generic processors and neural network for applying the abstract ideas) Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 16,
Claim 16 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 16 is directed to non-transitory computer-readable media, which is directed to an article of manufacture, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein computing the first set of halting scores for the first set of tokens comprises applying a sigmoid function to a combination of a dimension of a token, a shifting parameter, and a scaling parameter”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) and mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)). The above limitations in the context of this claim encompass computing the first set of halting scores by applying a sigmoid function to a combination of a dimension of a token, a shifting parameter, and a scaling parameter (corresponds to mathematical calculations).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)).The recitation of additional elements in claim 11 of generic processors and neural network, as drafted, are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe generic processors and neural network for applying the abstract ideas) Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 17,
Claim 17 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 17 is directed to non-transitory computer-readable media, which is directed to an article of manufacture, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein computing the first set of halting scores for the first set of tokens comprises summing a second set of halting scores computed for the first set of tokens by a layer included in the transformer neural network that precedes the first layer and a third set of halting scores computed for the first set of tokens by the first layer”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) and mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)). The above limitations in the context of this claim encompass computing the first set of halting scores by summing a second set of halting scores computed for the first set of tokens by a layer preceding the first layer and a third set of halting scores computed for the first set of tokens by the first layer (corresponds to mathematical calculations).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)).The recitation of additional elements in claim 11 of generic processors and neural network, as drafted, are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe generic processors and neural network for applying the abstract ideas) Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 18,
Claim 18 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 18 is directed to non-transitory computer-readable media, which is directed to an article of manufacture, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein causing the first token not to be processed by the one or more layers comprises omitting computation of one or more attention scores associated with the first token by the one or more layers that are subsequent to the first layer”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) and mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)). The above limitations in the context of this claim encompass omitting the computation of one or more attention scores associated with the first token by subsequent layers (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can omit computations of attention scores associated with the first token).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)).The recitation of additional elements in claim 11 of generic processors and neural network, as drafted, are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe generic processors and neural network for applying the abstract ideas) Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 19,
Claim 19 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 19 is directed to non-transitory computer-readable media, which is directed to an article of manufacture, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“converting a set of patches included in an input image into the first set of tokens”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) and mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)). The above limitations in the context of this claim encompass converting a set of patches from an image into the first set of tokens (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can convert a set of patches into the first set of tokens).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)). The limitation:
“the one or more processors”
As drafted, is an additional element that amounts to no more than mere instructions to apply the exception for the abstract ideas. See MPEP 2106.05(f). In addition, the recitation of additional elements in claim 11 of generic processors and neural network, as drafted, are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe generic processors and neural network for applying the abstract ideas) Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 20,
Claim 20 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 20 is directed to a system, which is directed to a machine, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“compute a first set of halting scores for a first set of tokens that has been input into a first layer of a transformer neural network”
“determine that a first halting score included in the first set of halting scores exceeds a threshold value”
“in response to the first halting score exceeding the threshold value, cause a first token that is included in the first set of tokens and is associated with the first halting score not to be processed by one or more layers within the transformer neural network that are subsequent to the first layer”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) and mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)). The above limitations in the context of this claim encompass computing a first set of halting scores for a first set of tokens input to the first layer of the transformer neural network (corresponds to mathematical calculations); determining that a first halting score of the set of halting scores exceeds a threshold value (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can determine if a first halting score exceeds a threshold value); and if the first halting score exceeds the threshold value, the first token associated with the first halting score to not be processed in subsequent layers of the transformer neural network (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can cause the first token associated with the first halting score that exceeds a threshold to not be used for subsequent processing).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)). The limitations:
“one or more memories that store instructions”
“one or more processors that are coupled to the one or more memories”
“a transformer neural network”
As drafted, are additional elements that amount to no more than mere instructions to apply the exception for the abstract ideas. See MPEP 2106.05(f). Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe generic memories, processors, and neural network for applying the abstract ideas) Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1-2, 6-8, 10-12, 16, 18, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Fayyaz et al. (US 2023/0153379 A1) in view of Elbayad et al. ("Depth-Adaptive Transformer").
Regarding Claim 1,
	Fayyaz et al. teaches a computer-implemented method for executing a transformer neural network ([0098]: "According to another illustrative aspect, another method (e.g., the processes 802 and 902) is described for processing a data item (e.g., the image 116) using a transformer neural network" teaches a method for using a transformer neural network to process a data item (execute a transformer neural network)), the method comprising: 
in response to the first halting score exceeding the threshold value, causing a first token that is included in the first set of tokens and is associated with the first halting score not to be processed by one or more layers within the transformer neural network that are subsequent to the first layer (Fig. 9; Fig. 10; [0071]: "FIGS. 9 and 10 together show a process 902 that explains one manner of operation of the transformer neural network 106 of FIG. 1. In block 904, the transformer neural network 106 receives embedding vectors 110 that represent a plurality of item tokens (e.g., image tokens 114) generated based on a data item (e.g., the image 116), and the classification token 118. In blocks 906, 908, and 1002, the transformer neural network 106 performs an attention operation. The attention operation includes, in block 906, generating original attention information 212 based on the embedding vectors 110, the original attention information 212 having a plurality of attention values, each attention value describing an importance that a particular token plays in an interpretation of another particular token. In block 908, the transformer neural network 106 generates score information 216 based on attention values in the original attention information 212 that pertain to the classification token 118. In block 1002, the transformer neural network 106 generates modified attention information 218 by removing attention values from the original attention information 212 based on the score information 216. In block 804, the transformer neural network 106 performs subsequent operations based on the modified attention information 218. The subsequent operations perform fewer operations by using the modified attention information 218 rather than the original attention information 212" teaches that based on score information (e.g. halting score) for a token (e.g. the score exceeding a threshold), attention information for the token is removed from processing subsequent operations for the transformer neural network (from use in subsequent layers)).
	Fayyaz et al. does not appear to explicitly teach computing a first set of halting scores for a first set of tokens that has been input into a first layer of the transformer neural network; and determining that a first halting score included in the first set of halting scores exceeds a threshold value.
	However, Elbayad et al. teaches computing a first set of halting scores for a first set of tokens that has been input into a first layer of the transformer neural network (Eq. 12; Section 3.2, first - third paragraphs: "The token-specific approach can choose a different exit at every time-step. We consider two options for the exit distribution qt at time-step t: a multinomial with a classifier conditioned on the first decoder hidden state                         
                            
                                    h
                                
                                    t
                                
                                    1
                                
                     and a geometric-like where an exit probability                         
                            
                                    X
                                
                                    t
                                
                                    n
                                
                     is estimated after each block based on the activations of the current block                         
                            
                                    h
                                
                                    t
                                
                                    n
                                
                    . … 

    PNG
    media_image1.png
    156
    814
    media_image1.png
    Greyscale

where, d is the dimension of the decoder states, Wh ∈ RN×d and wh ∈ Rd are the weights of the halting mechanisms, and bh their biases. During inference the decoder exits when the halting signal                         
                            
                                    X
                                
                                    t
                                
                                    n
                                
                     exceeds a threshold τn which we tune on the valid set to achieve a better accuracy-speed trade-off. If thresholds (τn)1≤n<N have not been exceeded, then we default to exiting at block N" teaches that the halting signals/exit probabilities X (halting scores) are computed for the tokens (including first halting scores for first set of tokens). Section 1, fourth paragraph: "We encode the input sequence using a standard Transformer encoder to generate the output sequence with a varying amount of computation in the decoder network. Dynamic computation poses a challenge for self-attention because omitted layers in prior time-steps may be required in the future. We experiment with two approaches to address this and show that a simple approach works well (§2). Next, we investigate different mechanisms to control the amount of computation in the decoder network, either for the entire sequence or on a per-token basis" teaches that the input sequence input (encoded as tokens) to the transformer (input to first layer of transformer neural network) controls the amount of computation on a per-token basis (e.g. halting probability is computed for each token input to the transformer)); and
determining that a first halting score included in the first set of halting scores exceeds a threshold value (Eq. 12; Section 3.2, first - third paragraphs: "The token-specific approach can choose a different exit at every time-step. We consider two options for the exit distribution qt at time-step t: a multinomial with a classifier conditioned on the first decoder hidden state                         
                            
                                    h
                                
                                    t
                                
                                    1
                                
                     and a geometric-like where an exit probability                         
                            
                                    X
                                
                                    t
                                
                                    n
                                
                     is estimated after each block based on the activations of the current block                         
                            
                                    h
                                
                                    t
                                
                                    n
                                
                    . … 

    PNG
    media_image1.png
    156
    814
    media_image1.png
    Greyscale

where, d is the dimension of the decoder states, Wh ∈ RN×d and wh ∈ Rd are the weights of the halting mechanisms, and bh their biases. During inference the decoder exits when the halting signal                         
                            
                                    X
                                
                                    t
                                
                                    n
                                
                     exceeds a threshold τn which we tune on the valid set to achieve a better accuracy-speed trade-off. If thresholds (τn)1≤n<N have not been exceeded, then we default to exiting at block N" teaches that the halting signals/exit probabilities X (halting scores) for the tokens are compared to a threshold to see if they exceed the threshold).
	Fayyaz et al. and Elbayad et al. are analogous to the claimed invention because they are directed to efficient implementation of transformer neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate computing a first set of halting scores for a first set of tokens that has been input into a first layer of the transformer neural network; and determining that a first halting score included in the first set of halting scores exceeds a threshold value as taught by Elbayad et al. to the disclosed invention of Fayyaz et al.
One of ordinary skill in the art would have been motivated to make this modification to "adapt the number of layers to each input in order to achieve a good speed-accuracy trade off at inference time" (Elbayad et al. Section 1, second paragraph).
Regarding Claim 2,
	Fayyaz et al. in view of Elbayad et al. teaches the computer-implemented method of claim 1.
In addition, Elbayad et al. further teaches further comprising: computing one or more losses based on a second set of halting scores computed for a second set of tokens (Eq. 6; Eq. 7; Section 3, second paragraph: " We model the distribution of exiting at time-step t with a parametric distribution qt where qt(n) is the probability of computing block1, . . . , blockn and then emitting a prediction with Cn. The parameters of qt are optimized to match an oracle distribution                         
                            
                                    q
                                
                                    t
                                
                                    *
                                
                     with cross-entropy:

    PNG
    media_image2.png
    66
    558
    media_image2.png
    Greyscale

The exit loss (Lexit) is back-propagated to the encoder-decoder parameters. We simultaneously optimize the decoding loss (Eq. (4)) and the exit loss (Eq. (6)) balanced by a hyper-parameter α to ensure that the model maintains good generation accuracy. The final loss takes the form:

    PNG
    media_image3.png
    52
    566
    media_image3.png
    Greyscale
" teaches computing losses based on a parametric distribution qt. Eq. 12; Eq. 13; Section 3.2, first - third paragraphs: "The token-specific approach can choose a different exit at every time-step. We consider two options for the exit distribution qt at time-step t: a multinomial with a classifier conditioned on the first decoder hidden state                         
                            
                                    h
                                
                                    t
                                
                                    1
                                
                     and a geometric-like where an exit probability                         
                            
                                    X
                                
                                    t
                                
                                    n
                                
                     is estimated after each block based on the activations of the current block                         
                            
                                    h
                                
                                    t
                                
                                    n
                                
                    . … 

    PNG
    media_image1.png
    156
    814
    media_image1.png
    Greyscale

where, d is the dimension of the decoder states, Wh ∈ RN×d and wh ∈ Rd are the weights of the halting mechanisms, and bh their biases. During inference the decoder exits when the halting signal                         
                            
                                    X
                                
                                    t
                                
                                    n
                                
                     exceeds a threshold τn which we tune on the valid set to achieve a better accuracy-speed trade-off. If thresholds (τn)1≤n<N have not been exceeded, then we default to exiting at block N" teaches that the parametric distribution qt is based on halting signals/exit probabilities X (halting scores) are computed for the tokens (including second halting scores for second set of tokens)); and 
modifying at least one layer included in the transformer neural network based on the one or more losses as part of training the transformer neural network (Eq. 4; Fig. 10; Algorithm 1; Appendix A, second paragraph: "Adding intermediate supervision at different levels of the decoder results in richer gradients for lower blocks compared to upper blocks. This is because earlier layers affect more loss terms in the compound loss of Eq. (4). To balance the gradients of each block in the decoder, we scale up the gradients of each loss term (− LLn) when it is updating the parameters of its associated block (blockn with parameters θn) and revert it back to its normal scale before back-propagating it to the previous blocks. Figure 10 and Algorithm 1 illustrate this gradient scaling procedure. The θn are updated with γn-amplified gradients from the block’s supervision and (N−n) gradients from the subsequent blocks" teaches updating parameters θ for the blocks (layers) of the transformer based on gradients of each loss term of the compound loss (weighted combination of the one or more losses) and performing back-propagation (training) on the blocks (layers) of the transformer).
	Fayyaz et al. and Elbayad et al. are analogous to the claimed invention because they are directed to efficient implementation of transformer neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate further comprising: computing one or more losses based on a second set of halting scores computed for a second set of tokens; and modifying at least one layer included in the transformer neural network based on the one or more losses as part of training the transformer neural network as taught by Elbayad et al. to the disclosed invention of Fayyaz et al.
One of ordinary skill in the art would have been motivated to make this modification to "adapt the number of layers to each input in order to achieve a good speed-accuracy trade off at inference time" (Elbayad et al. Section 1, second paragraph).
Regarding Claim 6,
	Fayyaz et al. in view of Elbayad et al. teaches the computer-implemented method of claim 2.
	In addition, Elbayad et al. further teaches wherein modifying the at least one layer of the transformer neural network comprises updating parameters associated with the first layer and the one or more layers based on a weighted combination of the one or more losses (Eq. 4; Fig. 10; Algorithm 1; Appendix A, second paragraph: "Adding intermediate supervision at different levels of the decoder results in richer gradients for lower blocks compared to upper blocks. This is because earlier layers affect more loss terms in the compound loss of Eq. (4). To balance the gradients of each block in the decoder, we scale up the gradients of each loss term (− LLn) when it is updating the parameters of its associated block (blockn with parameters θn) and revert it back to its normal scale before back-propagating it to the previous blocks. Figure 10 and Algorithm 1 illustrate this gradient scaling procedure. The θn are updated with γn-amplified gradients from the block’s supervision and (N−n) gradients from the subsequent blocks" teaches updating parameters θ for the blocks (layers) of the transformer based on gradients of each loss term of the compound loss (weighted combination of the one or more losses) and performing back-propagation on the blocks (layers) of the transformer).
	Fayyaz et al. and Elbayad et al. are analogous to the claimed invention because they are directed to efficient implementation of transformer neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein modifying the at least one layer of the transformer neural network comprises updating parameters associated with the first layer and the one or more layers based on a weighted combination of the one or more losses as taught by Elbayad et al. to the disclosed invention of Fayyaz et al.
One of ordinary skill in the art would have been motivated to make this modification to "adapt the number of layers to each input in order to achieve a good speed-accuracy trade off at inference time" (Elbayad et al. Section 1, second paragraph).
Regarding Claim 7,
	Fayyaz et al. in view of Elbayad et al. teaches the computer-implemented method of claim 1.
In addition, Elbayad et al. further teaches wherein computing the first set of halting scores comprises applying a nonlinear function to a dimension of a token (Equation 12; Section 3.2, first - third paragraphs: "The token-specific approach can choose a different exit at every time-step. We consider two options for the exit distribution qt at time-step t: a multinomial with a classifier conditioned on the first decoder hidden state                         
                            
                                    h
                                
                                    t
                                
                                    1
                                
                     and a geometric-like where an exit probability                         
                            
                                    X
                                
                                    t
                                
                                    n
                                
                     is estimated after each block based on the activations of the current block                         
                            
                                    h
                                
                                    t
                                
                                    n
                                
                    . … 

    PNG
    media_image1.png
    156
    814
    media_image1.png
    Greyscale

where, d is the dimension of the decoder states, Wh ∈ RN×d and wh ∈ Rd are the weights of the halting mechanisms, and bh their biases. During inference the decoder exits when the halting signal                         
                            
                                    X
                                
                                    t
                                
                                    n
                                
                     exceeds a threshold τn which we tune on the valid set to achieve a better accuracy-speed trade-off. If thresholds (τn)1≤n<N have not been exceeded, then we default to exiting at block N" teaches that the halting signals/exit probabilities X (halting scores) for the tokens are computed based on applying a sigmoid function (nonlinear function) to a dimension of the token).
	Fayyaz et al. and Elbayad et al. are analogous to the claimed invention because they are directed to efficient implementation of transformer neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein computing the first set of halting scores comprises applying a nonlinear function to a dimension of a token as taught by Elbayad et al. to the disclosed invention of Fayyaz et al.
One of ordinary skill in the art would have been motivated to make this modification to "adapt the number of layers to each input in order to achieve a good speed-accuracy trade off at inference time" (Elbayad et al. Section 1, second paragraph).
Regarding Claim 8,
	Fayyaz et al. in view of Elbayad et al. teaches the computer-implemented method of claim 7.
In addition, Elbayad et al. further teaches wherein computing the first set of halting scores comprises shifting and scaling the dimension prior to applying the nonlinear function (Equation 12; Section 3.2, first - third paragraphs: "The token-specific approach can choose a different exit at every time-step. We consider two options for the exit distribution qt at time-step t: a multinomial with a classifier conditioned on the first decoder hidden state                         
                            
                                    h
                                
                                    t
                                
                                    1
                                
                     and a geometric-like where an exit probability                         
                            
                                    X
                                
                                    t
                                
                                    n
                                
                     is estimated after each block based on the activations of the current block                         
                            
                                    h
                                
                                    t
                                
                                    n
                                
                    . … 

    PNG
    media_image1.png
    156
    814
    media_image1.png
    Greyscale

where, d is the dimension of the decoder states, Wh ∈ RN×d and wh ∈ Rd are the weights of the halting mechanisms, and bh their biases. During inference the decoder exits when the halting signal                         
                            
                                    X
                                
                                    t
                                
                                    n
                                
                     exceeds a threshold τn which we tune on the valid set to achieve a better accuracy-speed trade-off. If thresholds (τn)1≤n<N have not been exceeded, then we default to exiting at block N" teaches that the halting signals/exit probabilities X (halting scores) for the tokens are computed based on multiplying the dimension of the token by the weights (scaling) and adding the biases (shifting) prior to applying the sigmoid function (nonlinear function)).
	Fayyaz et al. and Elbayad et al. are analogous to the claimed invention because they are directed to efficient implementation of transformer neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein computing the first set of halting scores comprises shifting and scaling the dimension prior to applying the nonlinear function as taught by Elbayad et al. to the disclosed invention of Fayyaz et al.
One of ordinary skill in the art would have been motivated to make this modification to "adapt the number of layers to each input in order to achieve a good speed-accuracy trade off at inference time" (Elbayad et al. Section 1, second paragraph).
Regarding Claim 10,
	Fayyaz et al. in view of Elbayad et al. teaches the computer-implemented method of claim 1.
	In addition, Fayyaz et al. further teaches wherein causing the first token to not be processed by the one or more layers comprises removing the first token from the first set of tokens prior to inputting the first set of tokens into the one or more layers that are subsequent to the first layer (Fig. 9; Fig. 10; [0071]: "FIGS. 9 and 10 together show a process 902 that explains one manner of operation of the transformer neural network 106 of FIG. 1. In block 904, the transformer neural network 106 receives embedding vectors 110 that represent a plurality of item tokens (e.g., image tokens 114) generated based on a data item (e.g., the image 116), and the classification token 118. In blocks 906, 908, and 1002, the transformer neural network 106 performs an attention operation. The attention operation includes, in block 906, generating original attention information 212 based on the embedding vectors 110, the original attention information 212 having a plurality of attention values, each attention value describing an importance that a particular token plays in an interpretation of another particular token. In block 908, the transformer neural network 106 generates score information 216 based on attention values in the original attention information 212 that pertain to the classification token 118. In block 1002, the transformer neural network 106 generates modified attention information 218 by removing attention values from the original attention information 212 based on the score information 216. In block 804, the transformer neural network 106 performs subsequent operations based on the modified attention information 218. The subsequent operations perform fewer operations by using the modified attention information 218 rather than the original attention information 212" teaches that based on score information (e.g. halting score) for a token (e.g. the score exceeding a threshold), attention information for the token is removed (e.g. token is removed from processing subsequent operations for the transformer neural network (from use in subsequent layers)).
Regarding Claim 11,
	Fayyaz et al. teaches one or more non-transitory computer-readable media storing instructions that, when executed by one or more processors ([0098]: "According to another illustrative aspect, another method (e.g., the processes 802 and 902) is described for processing a data item (e.g., the image 116) using a transformer neural network" teaches a method for using a transformer neural network to process a data item (execute a transformer neural network). Fig. 12; [0101]: "In yet another aspect, some implementations of the technology described herein include a computer-readable storage medium (e.g., the computer-readable storage media 1206) for storing computer-readable instructions (e.g., 1208). The computer-readable instructions, when executed by one or more hardware processors (e.g., 1204), perform any of the methods described herein (e.g., methods of A1-A11 and B1-B2)" teaches a computer-readable storage medium storing computer-readable instructions for execution by one or more hardware processors to perform the method for executing a transformer neural network), cause the one or more processors to perform the steps of: 
in response to the first halting score exceeding the threshold value, causing a first token that is included in the first set of tokens and is associated with the first halting score not to be processed by one or more layers within the transformer neural network that are subsequent to the first layer (Fig. 9; Fig. 10; [0071]: "FIGS. 9 and 10 together show a process 902 that explains one manner of operation of the transformer neural network 106 of FIG. 1. In block 904, the transformer neural network 106 receives embedding vectors 110 that represent a plurality of item tokens (e.g., image tokens 114) generated based on a data item (e.g., the image 116), and the classification token 118. In blocks 906, 908, and 1002, the transformer neural network 106 performs an attention operation. The attention operation includes, in block 906, generating original attention information 212 based on the embedding vectors 110, the original attention information 212 having a plurality of attention values, each attention value describing an importance that a particular token plays in an interpretation of another particular token. In block 908, the transformer neural network 106 generates score information 216 based on attention values in the original attention information 212 that pertain to the classification token 118. In block 1002, the transformer neural network 106 generates modified attention information 218 by removing attention values from the original attention information 212 based on the score information 216. In block 804, the transformer neural network 106 performs subsequent operations based on the modified attention information 218. The subsequent operations perform fewer operations by using the modified attention information 218 rather than the original attention information 212" teaches that based on score information (e.g. halting score) for a token (e.g. the score exceeding a threshold), attention information for the token is removed from processing subsequent operations for the transformer neural network (from use in subsequent layers)).
	Fayyaz et al. does not appear to explicitly teach computing a first set of halting scores for a first set of tokens that has been input into a first layer of a transformer neural network; and determining that a first halting score included in the first set of halting scores exceeds a threshold value.
	However, Elbayad et al. teaches computing a first set of halting scores for a first set of tokens that has been input into a first layer of a transformer neural network (Eq. 12; Section 3.2, first - third paragraphs: "The token-specific approach can choose a different exit at every time-step. We consider two options for the exit distribution qt at time-step t: a multinomial with a classifier conditioned on the first decoder hidden state                         
                            
                                    h
                                
                                    t
                                
                                    1
                                
                     and a geometric-like where an exit probability                         
                            
                                    X
                                
                                    t
                                
                                    n
                                
                     is estimated after each block based on the activations of the current block                         
                            
                                    h
                                
                                    t
                                
                                    n
                                
                    . … 

    PNG
    media_image1.png
    156
    814
    media_image1.png
    Greyscale

where, d is the dimension of the decoder states, Wh ∈ RN×d and wh ∈ Rd are the weights of the halting mechanisms, and bh their biases. During inference the decoder exits when the halting signal                         
                            
                                    X
                                
                                    t
                                
                                    n
                                
                     exceeds a threshold τn which we tune on the valid set to achieve a better accuracy-speed trade-off. If thresholds (τn)1≤n<N have not been exceeded, then we default to exiting at block N" teaches that the halting signals/exit probabilities X (halting scores) are computed for the tokens (including first halting scores for first set of tokens). Section 1, fourth paragraph: "We encode the input sequence using a standard Transformer encoder to generate the output sequence with a varying amount of computation in the decoder network. Dynamic computation poses a challenge for self-attention because omitted layers in prior time-steps may be required in the future. We experiment with two approaches to address this and show that a simple approach works well (§2). Next, we investigate different mechanisms to control the amount of computation in the decoder network, either for the entire sequence or on a per-token basis" teaches that the input sequence input (encoded as tokens) to the transformer (input to first layer of transformer neural network) controls the amount of computation on a per-token basis (e.g. halting probability is computed for each token input to the transformer)); and 
determining that a first halting score included in the first set of halting scores exceeds a threshold value (Eq. 12; Section 3.2, first - third paragraphs: "The token-specific approach can choose a different exit at every time-step. We consider two options for the exit distribution qt at time-step t: a multinomial with a classifier conditioned on the first decoder hidden state                         
                            
                                    h
                                
                                    t
                                
                                    1
                                
                     and a geometric-like where an exit probability                         
                            
                                    X
                                
                                    t
                                
                                    n
                                
                     is estimated after each block based on the activations of the current block                         
                            
                                    h
                                
                                    t
                                
                                    n
                                
                    . … 

    PNG
    media_image1.png
    156
    814
    media_image1.png
    Greyscale

where, d is the dimension of the decoder states, Wh ∈ RN×d and wh ∈ Rd are the weights of the halting mechanisms, and bh their biases. During inference the decoder exits when the halting signal                         
                            
                                    X
                                
                                    t
                                
                                    n
                                
                     exceeds a threshold τn which we tune on the valid set to achieve a better accuracy-speed trade-off. If thresholds (τn)1≤n<N have not been exceeded, then we default to exiting at block N" teaches that the halting signals/exit probabilities X (halting scores) for the tokens are compared to a threshold to see if they exceed the threshold).
	Fayyaz et al. and Elbayad et al. are analogous to the claimed invention because they are directed to efficient implementation of transformer neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate computing a first set of halting scores for a first set of tokens that has been input into a first layer of a transformer neural network; and determining that a first halting score included in the first set of halting scores exceeds a threshold value as taught by Elbayad et al. to the disclosed invention of Fayyaz et al.
One of ordinary skill in the art would have been motivated to make this modification to "adapt the number of layers to each input in order to achieve a good speed-accuracy trade off at inference time" (Elbayad et al. Section 1, second paragraph).
Regarding Claim 12,
	Fayyaz et al. in view of Elbayad et al. teaches the one or more non-transitory computer-readable media of claim 11.
In addition, Elbayad et al. further teaches wherein the instructions further cause the one or more processors to perform the steps of: computing one or more losses based on a second set of halting scores for a second set of tokens (Eq. 6; Eq. 7; Section 3, second paragraph: "We model the distribution of exiting at time-step t with a parametric distribution qt where qt(n) is the probability of computing block1, . . . , blockn and then emitting a prediction with Cn. The parameters of qt are optimized to match an oracle distribution                         
                            
                                    q
                                
                                    t
                                
                                    *
                                
                     with cross-entropy:

    PNG
    media_image2.png
    66
    558
    media_image2.png
    Greyscale

The exit loss (Lexit) is back-propagated to the encoder-decoder parameters. We simultaneously optimize the decoding loss (Eq. (4)) and the exit loss (Eq. (6)) balanced by a hyper-parameter α to ensure that the model maintains good generation accuracy. The final loss takes the form:

    PNG
    media_image3.png
    52
    566
    media_image3.png
    Greyscale
" teaches computing losses based on a parametric distribution qt. Eq. 12; Eq. 13; Section 3.2, first - third paragraphs: "The token-specific approach can choose a different exit at every time-step. We consider two options for the exit distribution qt at time-step t: a multinomial with a classifier conditioned on the first decoder hidden state                         
                            
                                    h
                                
                                    t
                                
                                    1
                                
                     and a geometric-like where an exit probability                         
                            
                                    X
                                
                                    t
                                
                                    n
                                
                     is estimated after each block based on the activations of the current block                         
                            
                                    h
                                
                                    t
                                
                                    n
                                
                    . … 

    PNG
    media_image1.png
    156
    814
    media_image1.png
    Greyscale

where, d is the dimension of the decoder states, Wh ∈ RN×d and wh ∈ Rd are the weights of the halting mechanisms, and bh their biases. During inference the decoder exits when the halting signal                         
                            
                                    X
                                
                                    t
                                
                                    n
                                
                     exceeds a threshold τn which we tune on the valid set to achieve a better accuracy-speed trade-off. If thresholds (τn)1≤n<N have not been exceeded, then we default to exiting at block N" teaches that the parametric distribution qt is based on halting signals/exit probabilities X (halting scores) are computed for the tokens (including second halting scores for second set of tokens)); and 
modifying at least one layer included in the transformer neural network based on the one or more losses as part of training the transformer neural network (Eq. 4; Fig. 10; Algorithm 1; Appendix A, second paragraph: "Adding intermediate supervision at different levels of the decoder results in richer gradients for lower blocks compared to upper blocks. This is because earlier layers affect more loss terms in the compound loss of Eq. (4). To balance the gradients of each block in the decoder, we scale up the gradients of each loss term (− LLn) when it is updating the parameters of its associated block (blockn with parameters θn) and revert it back to its normal scale before back-propagating it to the previous blocks. Figure 10 and Algorithm 1 illustrate this gradient scaling procedure. The θn are updated with γn-amplified gradients from the block’s supervision and (N−n) gradients from the subsequent blocks" teaches updating parameters θ for the blocks (layers) of the transformer based on gradients of each loss term of the compound loss (weighted combination of the one or more losses) and performing back-propagation (training) on the blocks (layers) of the transformer).
	Fayyaz et al. and Elbayad et al. are analogous to the claimed invention because they are directed to efficient implementation of transformer neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the instructions further cause the one or more processors to perform the steps of: computing one or more losses based on a second set of halting scores for a second set of tokens; and modifying at least one layer included in the transformer neural network based on the one or more losses as part of training the transformer neural network as taught by Elbayad et al. to the disclosed invention of Fayyaz et al.
One of ordinary skill in the art would have been motivated to make this modification to "adapt the number of layers to each input in order to achieve a good speed-accuracy trade off at inference time" (Elbayad et al. Section 1, second paragraph).
Regarding Claim 16,
	Fayyaz et al. in view of Elbayad et al. teaches the one or more non-transitory computer-readable media of claim 11.
In addition, Elbayad et al. further teaches wherein computing the first set of halting scores for the first set of tokens comprises applying a sigmoid function to a combination of a dimension of a token, a shifting parameter, and a scaling parameter (Equation 12; Section 3.2, first - third paragraphs: "The token-specific approach can choose a different exit at every time-step. We consider two options for the exit distribution qt at time-step t: a multinomial with a classifier conditioned on the first decoder hidden state                         
                            
                                    h
                                
                                    t
                                
                                    1
                                
                     and a geometric-like where an exit probability                         
                            
                                    X
                                
                                    t
                                
                                    n
                                
                     is estimated after each block based on the activations of the current block                         
                            
                                    h
                                
                                    t
                                
                                    n
                                
                    . … 

    PNG
    media_image1.png
    156
    814
    media_image1.png
    Greyscale

where, d is the dimension of the decoder states, Wh ∈ RN×d and wh ∈ Rd are the weights of the halting mechanisms, and bh their biases. During inference the decoder exits when the halting signal                         
                            
                                    X
                                
                                    t
                                
                                    n
                                
                     exceeds a threshold τn which we tune on the valid set to achieve a better accuracy-speed trade-off. If thresholds (τn)1≤n<N have not been exceeded, then we default to exiting at block N" teaches that the halting signals/exit probabilities X (halting scores) for the tokens are computed based on multiplying the dimension of the token by the weights (scaling parameter) and adding the biases (shifting parameter) and applying the sigmoid function).
	Fayyaz et al. and Elbayad et al. are analogous to the claimed invention because they are directed to efficient implementation of transformer neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein computing the first set of halting scores for the first set of tokens comprises applying a sigmoid function to a combination of a dimension of a token, a shifting parameter, and a scaling parameter as taught by Elbayad et al. to the disclosed invention of Fayyaz et al.
One of ordinary skill in the art would have been motivated to make this modification to "adapt the number of layers to each input in order to achieve a good speed-accuracy trade off at inference time" (Elbayad et al. Section 1, second paragraph).
Regarding Claim 18,
	Fayyaz et al. in view of Elbayad et al. teaches the one or more non-transitory computer-readable media of claim 11.
	In addition, Fayyaz et al. further teaches wherein causing the first token not to be processed by the one or more layers comprises omitting computation of one or more attention scores associated with the first token by the one or more layers that are subsequent to the first layer (Fig. 9; Fig. 10; [0071]: "FIGS. 9 and 10 together show a process 902 that explains one manner of operation of the transformer neural network 106 of FIG. 1. In block 904, the transformer neural network 106 receives embedding vectors 110 that represent a plurality of item tokens (e.g., image tokens 114) generated based on a data item (e.g., the image 116), and the classification token 118. In blocks 906, 908, and 1002, the transformer neural network 106 performs an attention operation. The attention operation includes, in block 906, generating original attention information 212 based on the embedding vectors 110, the original attention information 212 having a plurality of attention values, each attention value describing an importance that a particular token plays in an interpretation of another particular token. In block 908, the transformer neural network 106 generates score information 216 based on attention values in the original attention information 212 that pertain to the classification token 118. In block 1002, the transformer neural network 106 generates modified attention information 218 by removing attention values from the original attention information 212 based on the score information 216. In block 804, the transformer neural network 106 performs subsequent operations based on the modified attention information 218. The subsequent operations perform fewer operations by using the modified attention information 218 rather than the original attention information 212" teaches that based on score information (e.g. halting score) for a token (e.g. the score exceeding a threshold), attention information for the token (e.g. computation of attention scores associated with the token) is removed from processing subsequent operations for the transformer neural network (from use in subsequent layers)).
Regarding Claim 20,
	Fayyaz et al. teaches a system, comprising: one or more memories that store instructions, and one or more processors that are coupled to the one or more memories and, when executing the instructions ([0098]: "According to another illustrative aspect, another method (e.g., the processes 802 and 902) is described for processing a data item (e.g., the image 116) using a transformer neural network" teaches a method for using a transformer neural network to process a data item (execute a transformer neural network). Fig. 12; [0080]: "The computing system 1202 may perform any of the functions described above when the hardware processor(s) 1204 carry out computer-readable instructions stored in any instance of the computer-readable storage media 1206. For instance, the computing system 1202 may carry out computer-readable instructions to perform each block of the processes described in Section B" teaches a system comprising hardware processors (one or more processors) coupled to computer-readable storage media (one or more memories) storing executable instructions for performing the method for executing a transformer neural network. Fig. 12; [0079]: "The computing system 1202 can utilize any instance of the computer-readable storage media 1206 in different ways. For example, any instance of the computer-readable storage media 1206 may represent a hardware memory unit (such as Random Access Memory (RAM)) for storing transient information during execution of a program by the computing system 1202, and/or a hardware storage unit (such as a hard disk) for retaining/archiving information on a more permanent basis" teaches that the computer-readable storage media may represent hardware memory units), are configured to: 
in response to the first halting score exceeding the threshold value, cause a first token that is included in the first set of tokens and is associated with the first halting score not to be processed by one or more layers within the transformer neural network that are subsequent to the first layer (Fig. 9; Fig. 10; [0071]: "FIGS. 9 and 10 together show a process 902 that explains one manner of operation of the transformer neural network 106 of FIG. 1. In block 904, the transformer neural network 106 receives embedding vectors 110 that represent a plurality of item tokens (e.g., image tokens 114) generated based on a data item (e.g., the image 116), and the classification token 118. In blocks 906, 908, and 1002, the transformer neural network 106 performs an attention operation. The attention operation includes, in block 906, generating original attention information 212 based on the embedding vectors 110, the original attention information 212 having a plurality of attention values, each attention value describing an importance that a particular token plays in an interpretation of another particular token. In block 908, the transformer neural network 106 generates score information 216 based on attention values in the original attention information 212 that pertain to the classification token 118. In block 1002, the transformer neural network 106 generates modified attention information 218 by removing attention values from the original attention information 212 based on the score information 216. In block 804, the transformer neural network 106 performs subsequent operations based on the modified attention information 218. The subsequent operations perform fewer operations by using the modified attention information 218 rather than the original attention information 212" teaches that based on score information (e.g. halting score) for a token (e.g. the score exceeding a threshold), attention information for the token is removed from processing subsequent operations for the transformer neural network (from use in subsequent layers)).
	Fayyaz et al. does not appear to explicitly teach compute a first set of halting scores for a first set of tokens that has been input into a first layer of a transformer neural network; and determine that a first halting score included in the first set of halting scores exceeds a threshold value.
	However, Elbayad et al. teaches compute a first set of halting scores for a first set of tokens that has been input into a first layer of a transformer neural network (Eq. 12; Section 3.2, first - third paragraphs: "The token-specific approach can choose a different exit at every time-step. We consider two options for the exit distribution qt at time-step t: a multinomial with a classifier conditioned on the first decoder hidden state                         
                            
                                    h
                                
                                    t
                                
                                    1
                                
                     and a geometric-like where an exit probability                         
                            
                                    X
                                
                                    t
                                
                                    n
                                
                     is estimated after each block based on the activations of the current block                         
                            
                                    h
                                
                                    t
                                
                                    n
                                
                    . … 

    PNG
    media_image1.png
    156
    814
    media_image1.png
    Greyscale

where, d is the dimension of the decoder states, Wh ∈ RN×d and wh ∈ Rd are the weights of the halting mechanisms, and bh their biases. During inference the decoder exits when the halting signal                         
                            
                                    X
                                
                                    t
                                
                                    n
                                
                     exceeds a threshold τn which we tune on the valid set to achieve a better accuracy-speed trade-off. If thresholds (τn)1≤n<N have not been exceeded, then we default to exiting at block N" teaches that the halting signals/exit probabilities X (halting scores) are computed for the tokens (including first halting scores for first set of tokens). Section 1, fourth paragraph: "We encode the input sequence using a standard Transformer encoder to generate the output sequence with a varying amount of computation in the decoder network. Dynamic computation poses a challenge for self-attention because omitted layers in prior time-steps may be required in the future. We experiment with two approaches to address this and show that a simple approach works well (§2). Next, we investigate different mechanisms to control the amount of computation in the decoder network, either for the entire sequence or on a per-token basis" teaches that the input sequence input (encoded as tokens) to the transformer (input to first layer of transformer neural network) controls the amount of computation on a per-token basis (e.g. halting probability is computed for each token input to the transformer)); and 
determine that a first halting score included in the first set of halting scores exceeds a threshold value (Eq. 12; Section 3.2, first - third paragraphs: "The token-specific approach can choose a different exit at every time-step. We consider two options for the exit distribution qt at time-step t: a multinomial with a classifier conditioned on the first decoder hidden state                         
                            
                                    h
                                
                                    t
                                
                                    1
                                
                     and a geometric-like where an exit probability                         
                            
                                    X
                                
                                    t
                                
                                    n
                                
                     is estimated after each block based on the activations of the current block                         
                            
                                    h
                                
                                    t
                                
                                    n
                                
                    . … 

    PNG
    media_image1.png
    156
    814
    media_image1.png
    Greyscale

where, d is the dimension of the decoder states, Wh ∈ RN×d and wh ∈ Rd are the weights of the halting mechanisms, and bh their biases. During inference the decoder exits when the halting signal                         
                            
                                    X
                                
                                    t
                                
                                    n
                                
                     exceeds a threshold τn which we tune on the valid set to achieve a better accuracy-speed trade-off. If thresholds (τn)1≤n<N have not been exceeded, then we default to exiting at block N" teaches that the halting signals/exit probabilities X (halting scores) for the tokens are compared to a threshold to see if they exceed the threshold).
	Fayyaz et al. and Elbayad et al. are analogous to the claimed invention because they are directed to efficient implementation of transformer neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate compute a first set of halting scores for a first set of tokens that has been input into a first layer of a transformer neural network; and determine that a first halting score included in the first set of halting scores exceeds a threshold value as taught by Elbayad et al. to the disclosed invention of Fayyaz et al.
One of ordinary skill in the art would have been motivated to make this modification to "adapt the number of layers to each input in order to achieve a good speed-accuracy trade off at inference time" (Elbayad et al. Section 1, second paragraph).

Claims 3, 9, 13, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Fayyaz et al. (US 2023/0153379 A1) in view of Elbayad et al. ("Depth-Adaptive Transformer") and further in view of Graves et al. ("Adaptive Computation Time for Recurrent Neural Networks").
Regarding Claim 3,
Fayyaz et al. in view of Elbayad et al. teaches the computer-implemented method of claim 2.
Fayyaz et al. in view of Elbayad et al. does not appear to explicitly teach wherein computing the one or more losses comprises computing a ponder loss based on the second set of halting scores and a set of layers included in the transformer neural network associated with halting the second set of tokens.
However, Graves et al. teaches wherein computing the one or more losses comprises computing a ponder loss based on the second set of halting scores and a set of layers included in the transformer neural network associated with halting the second set of tokens (Equation 11; Equation 12; Section 2.1: "If no constraints are placed on the number of updates R can take at each step it will naturally tend to ‘ponder’ each input for as long as possible (so as to avoid making predictions and incurring errors). We therefore require a way of limiting the amount of computation the network performs. Given a length T input sequence x, define the ponder sequence (ρ1, . . . , ρT ) of R as

    PNG
    media_image4.png
    98
    638
    media_image4.png
    Greyscale

Since R(t) ∈ (0, 1), P(x) is an upper bound on the (non-differentiable) property we ultimately want to reduce, namely the total computation                         
                            
                                    ∑
                                    
                                        t
                                        =
                                        1
                                    
                                        T
                                    
                                    N
                                    (
                                    t
                                    )
                                
                     during the sequence. We can encourage the network to minimize P(x) by modifying the sequence loss function L(x, y) used for training: 

    PNG
    media_image5.png
    34
    408
    media_image5.png
    Greyscale

where τ is a time penalty parameter that weights the relative cost of computation versus error" teaches computing a ponder loss based on the halting probabilities (halting scores) and the associated layers at each step. Equations 5-8; Section 2, third paragraph: "To determine how many updates R performs at each input step an extra sigmoidal halting unit h is added to the network output, with associated weight matrix Wh and bias bh:

    PNG
    media_image6.png
    36
    388
    media_image6.png
    Greyscale

As with the output weights, some columns of Wh may be fixed to zero to give selective access to the network state. The activation of the halting unit is then used to determine the halting probability                         
                            
                                    p
                                
                                    t
                                
                                    n
                                
                     of the intermediate steps: 

    PNG
    media_image7.png
    212
    656
    media_image7.png
    Greyscale

and ε is a small constant (0.01 for the experiments in this paper), whose purpose is to allow computation to halt after a single update if                         
                            
                                    h
                                
                                    t
                                
                                    1
                                
                     >= 1 − ε, as otherwise a minimum of two updates would be required for every input step. It follows directly from the definition that                         
                            
                                    ∑
                                    
                                        n
                                        =
                                        1
                                    
                                        N
                                        (
                                        t
                                        )
                                    
                                            p
                                        
                                            t
                                        
                                            n
                                        
                                    =
                                    1
                                
                     and 0 ≤                         
                            
                                    p
                                
                                    t
                                
                                    n
                                
                     ≤ 1 ∀n, so this is a valid probability distribution" teaches that the functions N(t) and R(t) used to compute the ponder loss are based on the halting probabilities (halting scores)).
	Fayyaz et al. and Elbayad et al. are analogous to the claimed invention because they are directed to efficient implementation of transformer neural networks.
	Graves et al. is analogous to the claimed invention because it is directed to the use of halting scores for tokens for neural network efficiency.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein computing the one or more losses comprises computing a ponder loss based on the second set of halting scores and a set of layers included in the transformer neural network associated with halting the second set of tokens as taught by Graves et al. to the disclosed invention of Fayyaz et al. in view of Elbayad et al.
One of ordinary skill in the art would have been motivated to make this modification "to dynamically adapt the amount of computation it uses to the demands of the data" (Graves et al. Section 4, first paragraph).
Regarding Claim 9,
Fayyaz et al. in view of Elbayad et al. teaches the computer-implemented method of claim 1.
Fayyaz et al. in view of Elbayad et al. does not appear to explicitly teach wherein computing the first set of halting scores for the first set of tokens comprises aggregating a second set of halting scores computed for the first set of tokens by a layer included in the transformer neural network that precedes the first layer and a third set of halting scores computed for the first set of tokens by the first layer.
However, Graves et al. teaches wherein computing the first set of halting scores for the first set of tokens comprises aggregating a second set of halting scores computed for the first set of tokens by a layer included in the transformer neural network that precedes the first layer and a third set of halting scores computed for the first set of tokens by the first layer (Equations 5-8; Section 2, third paragraph: "To determine how many updates R performs at each input step an extra sigmoidal halting unit h is added to the network output, with associated weight matrix Wh and bias bh:

    PNG
    media_image6.png
    36
    388
    media_image6.png
    Greyscale

As with the output weights, some columns of Wh may be fixed to zero to give selective access to the network state. The activation of the halting unit is then used to determine the halting probability                         
                            
                                    p
                                
                                    t
                                
                                    n
                                
                     of the intermediate steps: 

    PNG
    media_image7.png
    212
    656
    media_image7.png
    Greyscale

and ε is a small constant (0.01 for the experiments in this paper), whose purpose is to allow computation to halt after a single update if                         
                            
                                    h
                                
                                    t
                                
                                    1
                                
                     >= 1 − ε, as otherwise a minimum of two updates would be required for every input step. It follows directly from the definition that                         
                            
                                    ∑
                                    
                                        n
                                        =
                                        1
                                    
                                        N
                                        (
                                        t
                                        )
                                    
                                            p
                                        
                                            t
                                        
                                            n
                                        
                                    =
                                    1
                                
                     and 0 ≤                         
                            
                                    p
                                
                                    t
                                
                                    n
                                
                     ≤ 1 ∀n, so this is a valid probability distribution" teaches that the halting probabilities (halting scores) are computed based on aggregating previous (preceding layer) halting scores (second set of halting scores) with the current halting score (third set of halting scores)).
	Fayyaz et al. and Elbayad et al. are analogous to the claimed invention because they are directed to efficient implementation of transformer neural networks.
	Graves et al. is analogous to the claimed invention because it is directed to the use of halting scores for tokens for neural network efficiency.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein computing the first set of halting scores for the first set of tokens comprises aggregating a second set of halting scores computed for the first set of tokens by a layer included in the transformer neural network that precedes the first layer and a third set of halting scores computed for the first set of tokens by the first layer as taught by Graves et al. to the disclosed invention of Fayyaz et al. in view of Elbayad et al.
One of ordinary skill in the art would have been motivated to make this modification "to dynamically adapt the amount of computation it uses to the demands of the data" (Graves et al. Section 4, first paragraph).
Regarding Claim 13,
	Fayyaz et al. in view of Elbayad et al. teaches the one or more non-transitory computer-readable media of claim 12.
Fayyaz et al. in view of Elbayad et al. does not appear to explicitly teach wherein computing the one or more losses comprises computing a ponder loss based on the second set of halting scores and a set of layers included in the transformer neural network associated with halting the second set of tokens.
However, Graves et al. teaches wherein computing the one or more losses comprises computing a ponder loss based on the second set of halting scores and a set of layers included in the transformer neural network associated with halting the second set of tokens (Equation 11; Equation 12; Section 2.1: "If no constraints are placed on the number of updates R can take at each step it will naturally tend to ‘ponder’ each input for as long as possible (so as to avoid making predictions and incurring errors). We therefore require a way of limiting the amount of computation the network performs. Given a length T input sequence x, define the ponder sequence (ρ1, . . . , ρT ) of R as

    PNG
    media_image4.png
    98
    638
    media_image4.png
    Greyscale

Since R(t) ∈ (0, 1), P(x) is an upper bound on the (non-differentiable) property we ultimately want to reduce, namely the total computation                         
                            
                                    ∑
                                    
                                        t
                                        =
                                        1
                                    
                                        T
                                    
                                    N
                                    (
                                    t
                                    )
                                
                     during the sequence. We can encourage the network to minimize P(x) by modifying the sequence loss function L(x, y) used for training: 

    PNG
    media_image5.png
    34
    408
    media_image5.png
    Greyscale

where τ is a time penalty parameter that weights the relative cost of computation versus error" teaches computing a ponder loss based on the halting probabilities (halting scores) and the associated layers at each step. Equations 5-8; Section 2, third paragraph: "To determine how many updates R performs at each input step an extra sigmoidal halting unit h is added to the network output, with associated weight matrix Wh and bias bh:

    PNG
    media_image6.png
    36
    388
    media_image6.png
    Greyscale

As with the output weights, some columns of Wh may be fixed to zero to give selective access to the network state. The activation of the halting unit is then used to determine the halting probability                         
                            
                                    p
                                
                                    t
                                
                                    n
                                
                     of the intermediate steps: 

    PNG
    media_image7.png
    212
    656
    media_image7.png
    Greyscale

and ε is a small constant (0.01 for the experiments in this paper), whose purpose is to allow computation to halt after a single update if                         
                            
                                    h
                                
                                    t
                                
                                    1
                                
                     >= 1 − ε, as otherwise a minimum of two updates would be required for every input step. It follows directly from the definition that                         
                            
                                    ∑
                                    
                                        n
                                        =
                                        1
                                    
                                        N
                                        (
                                        t
                                        )
                                    
                                            p
                                        
                                            t
                                        
                                            n
                                        
                                    =
                                    1
                                
                     and 0 ≤                         
                            
                                    p
                                
                                    t
                                
                                    n
                                
                     ≤ 1 ∀n, so this is a valid probability distribution" teaches that the functions N(t) and R(t) used to compute the ponder loss are based on the halting probabilities (halting scores)).
	Fayyaz et al. and Elbayad et al. are analogous to the claimed invention because they are directed to efficient implementation of transformer neural networks.
	Graves et al. is analogous to the claimed invention because it is directed to the use of halting scores for tokens for neural network efficiency.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein computing the one or more losses comprises computing a ponder loss based on the second set of halting scores and a set of layers included in the transformer neural network associated with halting the second set of tokens as taught by Graves et al. to the disclosed invention of Fayyaz et al. in view of Elbayad et al.
One of ordinary skill in the art would have been motivated to make this modification "to dynamically adapt the amount of computation it uses to the demands of the data" (Graves et al. Section 4, first paragraph).
Regarding Claim 17,
	Fayyaz et al. in view of Elbayad et al. teaches the one or more non-transitory computer-readable media of claim 11.
Fayyaz et al. in view of Elbayad et al. does not appear to explicitly teach wherein computing the first set of halting scores for the first set of tokens comprises summing a second set of halting scores computed for the first set of tokens by a layer included in the transformer neural network that precedes the first layer and a third set of halting scores computed for the first set of tokens by the first layer.
However, Graves et al. teaches wherein computing the first set of halting scores for the first set of tokens comprises summing a second set of halting scores computed for the first set of tokens by a layer included in the transformer neural network that precedes the first layer and a third set of halting scores computed for the first set of tokens by the first layer (Equations 5-8; Section 2, third paragraph: "To determine how many updates R performs at each input step an extra sigmoidal halting unit h is added to the network output, with associated weight matrix Wh and bias bh:

    PNG
    media_image6.png
    36
    388
    media_image6.png
    Greyscale

As with the output weights, some columns of Wh may be fixed to zero to give selective access to the network state. The activation of the halting unit is then used to determine the halting probability                         
                            
                                    p
                                
                                    t
                                
                                    n
                                
                     of the intermediate steps: 

    PNG
    media_image7.png
    212
    656
    media_image7.png
    Greyscale

and ε is a small constant (0.01 for the experiments in this paper), whose purpose is to allow computation to halt after a single update if                         
                            
                                    h
                                
                                    t
                                
                                    1
                                
                     >= 1 − ε, as otherwise a minimum of two updates would be required for every input step. It follows directly from the definition that                         
                            
                                    ∑
                                    
                                        n
                                        =
                                        1
                                    
                                        N
                                        (
                                        t
                                        )
                                    
                                            p
                                        
                                            t
                                        
                                            n
                                        
                                    =
                                    1
                                
                     and 0 ≤                         
                            
                                    p
                                
                                    t
                                
                                    n
                                
                     ≤ 1 ∀n, so this is a valid probability distribution" teaches that the halting probabilities (halting scores) are computed based on aggregating previous (preceding layer) halting scores (second set of halting scores) with the current halting score (third set of halting scores)).
	Fayyaz et al. and Elbayad et al. are analogous to the claimed invention because they are directed to efficient implementation of transformer neural networks.
	Graves et al. is analogous to the claimed invention because it is directed to the use of halting scores for tokens for neural network efficiency.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein computing the first set of halting scores for the first set of tokens comprises summing a second set of halting scores computed for the first set of tokens by a layer included in the transformer neural network that precedes the first layer and a third set of halting scores computed for the first set of tokens by the first layer as taught by Graves et al. to the disclosed invention of Fayyaz et al. in view of Elbayad et al.
One of ordinary skill in the art would have been motivated to make this modification "to dynamically adapt the amount of computation it uses to the demands of the data" (Graves et al. Section 4, first paragraph).

Claims 4 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Fayyaz et al. (US 2023/0153379 A1) in view of Elbayad et al. ("Depth-Adaptive Transformer") and further in view of Banino et al. ("PonderNet: Learning to Ponder").
Regarding Claim 4,
Fayyaz et al. in view of Elbayad et al. teaches the computer-implemented method of claim 2.
Fayyaz et al. in view of Elbayad et al. does not appear to explicitly teach wherein computing the one or more losses comprises: aggregating the second set of halting scores into a distribution of halting scores across a series of layers included in the transformer neural network; and computing a distributional loss based on a divergence of the distribution of halting scores from a target distribution.
However, Banino et al. teaches wherein computing the one or more losses comprises: aggregating the second set of halting scores into a distribution of halting scores across a series of layers included in the transformer neural network (Eq. 1; Eq. 2; Section 2.2: "The PonderNet architecture requires a step function s of the form ˆyn, hn+1, λn = s(x, hn), as well as an initial state h0. The output ˆyn and λn are respectively the network’s prediction and scalar probability of halting at step n. The step function s can be any neural network, such as MLPs, LSTMs, or encoder-decoder architectures such as transformers. We apply the step function recurrently up to N times.
The output ˆyn is a learned prediction conditioned on the dynamic number of steps n ∈ {1, . . . , N}. We rely on the value of λn to learn the optimal value of n. We define a Bernoulli random variable Λn in order to represent a Markov process for the halting with two states “continue” (Λn = 0) and “halt” (Λn = 1). The decision process starts from state “continue” (Λ0 = 0). We set the transition probability:

    PNG
    media_image8.png
    40
    468
    media_image8.png
    Greyscale

that is the conditional probability of entering state “halt” at step n conditioned that there has been no previous halting. Note that “halt” is a terminal state. We can then estimate the unconditioned probability that the halting happened in steps 0, 1, 2, ..., N where N is the maximum number of steps allowed before halting. We derive this probability distribution pn as a generalization of the geometric distribution:

    PNG
    media_image9.png
    62
    398
    media_image9.png
    Greyscale

which is a valid probability distribution" teaches aggregating the halting probabilities (halting scores) across each step N (each layer) for the transformer neural network into a distribution of halting probabilities (halting scores). Section D.2, first paragraph: "we set an upper bound N to the number of layers" teaches that N corresponds to the number of layers); and 
computing a distributional loss based on a divergence of the distribution of halting scores from a target distribution (Eq. 3; Section 2.4: " The total loss is composed of reconstruction LRec and regularization LReg terms: 

    PNG
    media_image10.png
    92
    466
    media_image10.png
    Greyscale

where L is a pre-defined loss for the prediction (usually mean squared error, or cross-entropy); and λp is a hyper-parameter that defines a geometric prior distribution pG(λp) on the halting policy (truncated at N). LRec is the expectation of the pre-defined reconstruction loss L across halting steps. LReg is the KL divergence between the distribution of halting probabilities pn and the prior (a geometric distribution truncated at N, parameterized by λp). This hyper-parameter defines a prior on how likely it is that the network will halt at each step. This regularization serves two purposes. First, it biases the network towards the expected prior number of steps 1/λp. Second, it provides an incentive to give a non-zero probability to all possible number of steps, thus promoting exploration" teaches computing a loss LReg (distributional loss) based on a KL divergence between the distribution of halting probabilities (halting scores) and a geometric prior distribution (target distribution)" teaches computing a loss LReg (distributional loss) based on a KL divergence between the distribution of halting probabilities (halting scores) and a geometric prior distribution (target distribution)).
	Fayyaz et al., Elbayad et al., and Banino et al. are analogous to the claimed invention because they are directed to efficient implementation of transformer neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein computing the one or more losses comprises: aggregating the second set of halting scores into a distribution of halting scores across a series of layers included in the transformer neural network; and computing a distributional loss based on a divergence of the distribution of halting scores from a target distribution as taught by Banino et al. to the disclosed invention of Fayyaz et al. in view of Elbayad et al.
One of ordinary skill in the art would have been motivated to make this modification "to achieve an effective compromise between training prediction accuracy, computational cost and generalization" (Banino et al. Abstract).
Regarding Claim 14,
	Fayyaz et al. in view of Elbayad et al. teaches the one or more non-transitory computer-readable media of claim 12.
Fayyaz et al. in view of Elbayad et al. does not appear to explicitly teach wherein computing the one or more losses comprises: aggregating the second set of halting scores into a distribution of halting scores across a series of layers included in the transformer neural network; and computing a distributional loss based on a Kullback-Leibler divergence of the distribution of halting scores from a target distribution.
However, Banino et al. teaches wherein computing the one or more losses comprises: aggregating the second set of halting scores into a distribution of halting scores across a series of layers included in the transformer neural network (Eq. 1; Eq. 2; Section 2.2: "The PonderNet architecture requires a step function s of the form ˆyn, hn+1, λn = s(x, hn), as well as an initial state h0. The output ˆyn and λn are respectively the network’s prediction and scalar probability of halting at step n. The step function s can be any neural network, such as MLPs, LSTMs, or encoder-decoder architectures such as transformers. We apply the step function recurrently up to N times.
The output ˆyn is a learned prediction conditioned on the dynamic number of steps n ∈ {1, . . . , N}. We rely on the value of λn to learn the optimal value of n. We define a Bernoulli random variable Λn in order to represent a Markov process for the halting with two states “continue” (Λn = 0) and “halt” (Λn = 1). The decision process starts from state “continue” (Λ0 = 0). We set the transition probability:

    PNG
    media_image8.png
    40
    468
    media_image8.png
    Greyscale

that is the conditional probability of entering state “halt” at step n conditioned that there has been no previous halting. Note that “halt” is a terminal state. We can then estimate the unconditioned probability that the halting happened in steps 0, 1, 2, ..., N where N is the maximum number of steps allowed before halting. We derive this probability distribution pn as a generalization of the geometric distribution:

    PNG
    media_image9.png
    62
    398
    media_image9.png
    Greyscale

which is a valid probability distribution" teaches aggregating the halting probabilities (halting scores) across each step N (each layer) for the transformer neural network into a distribution of halting probabilities (halting scores). Section D.2, first paragraph: "we set an upper bound N to the number of layers" teaches that N corresponds to the number of layers); and 
computing a distributional loss based on a Kullback-Leibler divergence of the distribution of halting scores from a target distribution (Eq. 3; Section 2.4: "The total loss is composed of reconstruction LRec and regularization LReg terms: 

    PNG
    media_image10.png
    92
    466
    media_image10.png
    Greyscale

where L is a pre-defined loss for the prediction (usually mean squared error, or cross-entropy); and λp is a hyper-parameter that defines a geometric prior distribution pG(λp) on the halting policy (truncated at N). LRec is the expectation of the pre-defined reconstruction loss L across halting steps. LReg is the KL divergence between the distribution of halting probabilities pn and the prior (a geometric distribution truncated at N, parameterized by λp). This hyper-parameter defines a prior on how likely it is that the network will halt at each step. This regularization serves two purposes. First, it biases the network towards the expected prior number of steps 1/λp. Second, it provides an incentive to give a non-zero probability to all possible number of steps, thus promoting exploration" teaches computing a loss LReg (distributional loss) based on a KL divergence between the distribution of halting probabilities (halting scores) and a geometric prior distribution (target distribution)).
	Fayyaz et al., Elbayad et al., and Banino et al. are analogous to the claimed invention because they are directed to efficient implementation of transformer neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein computing the one or more losses comprises: aggregating the second set of halting scores into a distribution of halting scores across a series of layers included in the transformer neural network; and computing a distributional loss based on a Kullback-Leibler divergence of the distribution of halting scores from a target distribution as taught by Banino et al. to the disclosed invention of Fayyaz et al. in view of Elbayad et al.
One of ordinary skill in the art would have been motivated to make this modification "to achieve an effective compromise between training prediction accuracy, computational cost and generalization" (Banino et al. Abstract).

Claims 5 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Fayyaz et al. (US 2023/0153379 A1) in view of Elbayad et al. ("Depth-Adaptive Transformer") and further in view of Jiang et al. ("All Tokens Matter: Token Labeling for Training Better Vision Transformers").
Regarding Claim 5,
Fayyaz et al. in view of Elbayad et al. teaches the computer-implemented method of claim 2.
Fayyaz et al. in view of Elbayad et al. does not appear to explicitly teach wherein computing the one or more losses comprises computing a task loss associated with a prediction generated by a task network based on the second set of tokens.
However, Jiang et al. teaches wherein computing the one or more losses comprises computing a task loss associated with a prediction generated by a task network based on the second set of tokens (Eq. 1; Section 3.1, second paragraph: "In loss computing, the class token from the output tokens of the last transformer block is usually selected and sent into a linear layer for the classification score prediction. Mathematically, given an image I, denote the output of the last transformer block as [Xcls, X1, ..., XN], where N is the total number of patch tokens, and Xcls and X1, ..., XN correspond to the class token and the patch tokens, respectively. The classification loss for image I can be written as

    PNG
    media_image11.png
    42
    424
    media_image11.png
    Greyscale

where H(·, ·) is the SoftMax cross-entropy loss and ycls is the class label" teaches computing a classification loss (task loss) associated with a classification score prediction in a linear layer (task network) based on a SoftMax cross entropy loss (weighted sum) using values of class tokens).
	Fayyaz et al., Elbayad et al., and Jiang et al. are analogous to the claimed invention because they are directed to efficient implementation of transformer neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein computing the one or more losses comprises computing a task loss associated with a prediction generated by a task network based on the second set of tokens as taught by Jiang et al. to the disclosed invention of Fayyaz et al. in view of Elbayad et al.
One of ordinary skill in the art would have been motivated to make this modification "to improve the object grounding and recognition capabilities of vision transformers with negligible computation overhead" (Jiang et al. Section 1, third paragraph).
Regarding Claim 15,
	Fayyaz et al. in view of Elbayad et al. teaches the one or more non-transitory computer-readable media of claim 12.
Fayyaz et al. in view of Elbayad et al. does not appear to explicitly teach wherein computing the one or more losses comprises computing a task loss associated with a prediction generated by a task network based on a weighted sum of values of a class token included in the second set of tokens.
However, Jiang et al. teaches wherein computing the one or more losses comprises computing a task loss associated with a prediction generated by a task network based on a weighted sum of values of a class token included in the second set of tokens (Eq. 1; Section 3.1, second paragraph: "In loss computing, the class token from the output tokens of the last transformer block is usually selected and sent into a linear layer for the classification score prediction. Mathematically, given an image I, denote the output of the last transformer block as [Xcls, X1, ..., XN], where N is the total number of patch tokens, and Xcls and X1, ..., XN correspond to the class token and the patch tokens, respectively. The classification loss for image I can be written as

    PNG
    media_image11.png
    42
    424
    media_image11.png
    Greyscale

where H(·, ·) is the SoftMax cross-entropy loss and ycls is the class label" teaches computing a classification loss (task loss) associated with a classification score prediction in a linear layer (task network) based on a SoftMax cross entropy loss (weighted sum) using values of class tokens).
	Fayyaz et al., Elbayad et al., and Jiang et al. are analogous to the claimed invention because they are directed to efficient implementation of transformer neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein computing the one or more losses comprises computing a task loss associated with a prediction generated by a task network based on a weighted sum of values of a class token included in the second set of tokens as taught by Jiang et al. to the disclosed invention of Fayyaz et al. in view of Elbayad et al.
One of ordinary skill in the art would have been motivated to make this modification "to improve the object grounding and recognition capabilities of vision transformers with negligible computation overhead" (Jiang et al. Section 1, third paragraph).

Claim 19 is rejected under 35 U.S.C. 103 as being unpatentable over Fayyaz et al. (US 2023/0153379 A1) in view of Elbayad et al. ("Depth-Adaptive Transformer") and further in view of Touvron et al. ("Training data-efficient image transformers & distillation through attention").
Regarding Claim 19,
	Fayyaz et al. in view of Elbayad et al. teaches the one or more non-transitory computer-readable media of claim 11.
Fayyaz et al. in view of Elbayad et al. does not appear to explicitly teach wherein the instructions further cause the one or more processors to perform the step of converting a set of patches included in an input image into the first set of tokens.
However, Touvron et al. teaches wherein the instructions further cause the one or more processors to perform the step of converting a set of patches included in an input image into the first set of tokens (Section 3, sixth-seventh paragraphs: "In order to get a transformer to process images, our work builds upon the ViT model [15]. It is a simple and elegant architecture that processes input images as if they were a sequence of input tokens. The fixed-size input RGB image is decomposed into a batch of N patches of a fixed size of 16 × 16 pixels (N = 14 × 14). Each patch is projected with a linear layer that conserves its overall dimension 3 × 16 × 16 = 768. The transformer block described above is invariant to the order of the patch embeddings, and thus does not consider their relative position. The positional information is incorporated as fixed [52] or trainable [18] positional embeddings. They are added before the first transformer block to the patch tokens, which are then fed to the stack of transformer blocks" teaches converting patches of an input image into a first set of tokens).
	Fayyaz et al., Elbayad et al., and Touvron et al. are analogous to the claimed invention because they are directed to efficient implementation of transformer neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the instructions further cause the one or more processors to perform the step of converting a set of patches included in an input image into the first set of tokens as taught by Touvron et al. to the disclosed invention of Fayyaz et al. in view of Elbayad et al.
One of ordinary skill in the art would have been motivated to make this modification "In order to get a transformer to process images" that "is invariant to the order of the patch embeddings" (Touvron et al. Section 3, sixth-seventh paragraphs).
Response to Arguments
Applicant’s arguments, filed 10/28/2025, with respect to the claim rejections under 35 U.S.C. 112(b) have been fully considered and are persuasive. Therefore, the 35 U.S.C. 112(b) rejections have been withdrawn.
Applicant's arguments, filed 10/28/2025, with respect to the 35 U.S.C. 101 abstract idea rejections to the claims have been fully considered but they are not persuasive. Applicant asserts “The Examiner rejected claims 1-20 under 35 U.S.C. § 101 as being directed to an abstract idea without reciting significantly more. See Office Action at 2-30. Applicant traverses the rejections with respect to the pending claims. A. 35 U.S.C. @ 101 Guidance - Step 2A: 
Applicant submits that the present claims are patent eligible under 35 U.S.C. § 101 for at least two reasons based on the 2019 Revised Patent Subject Matter Eligibility Guidance issued by the United States Patent and Trademark Office ("2019 Guidance"). See Manual of Patent Examining Procedure (MPEP) § 2106(1) (9th Ed., Rev. 10.2019, Last Revised Jun. 2020). 
First, according to the 2019 Guidance, for a claim to be an abstract idea, the claim must recite limitations that incorporate mathematical concepts or constitute mental processes or certain methods or techniques of organizing human activity. See MPEP § 2106.04(a). Applicant submits that the present claims do not recite any limitations falling within any of these enumerated groupings. 
In that regard, the present claims do not recite any methods or techniques for organizing human activities, such as fundamental economic principles or practices, commercial or legal interactions, or personal behaviors or relationships or interactions between people. See MPEP § 2106.04(a)(2)(II). 
In addition, the present claims are not directed towards mental processes because the claimed steps are computer-implemented techniques that cannot be practically performed in the human mind or using pen/paper. See MPEP § 2106.04(a)(2)(III). 
Lastly, the present claims do not recite any mathematical relations, formulas, or calculations. See MPEP § 2106.04(a)(2)(1). Importantly, the MPEP makes clear that in order for a claim to recite a mathematical concept, the claim must recite the mathematical concept itself and not merely recite limitations that are based on or involve a mathematical concept. See MPEP § 2106.04(a)(2)(I). Here, the claims recite only limitations that are based on or involve mathematical concepts - including computing certain values, such as computing a first set of halting scores for a first set of tokens - and do not constitute mathematical concepts themselves. Accordingly, because the independent claims do not recite any mathematical relationships, mathematical formulas, or mathematical calculations, the independent claims cannot be properly interpreted as reciting a mathematical concept under the MPEP. 
Because none of the limitations recited in the present claims are directed towards any of the enumerated categories of abstract ideas, the present claims cannot be properly interpreted as being abstract. 
Second, the present claims recite limitations that integrate any purported abstract idea into a practical application. See MPEP § 2106.04(d). In that regard, the claimed approach is directed towards the practical application of adaptive token depth adjustment in transformer neural networks. See Application at paragraphs [0077]- [0079]. A set of tokens is generated from discrete portions of input data and iteratively processed by a series of transformer blocks included in the transformer neural network. See id. A halting module after each transformer block computes a halting score for each token from one or more dimensions in the token. See id. The halting module also computes a cumulative halting score for each token as a sum or another aggregation of existing halting scores for the token. See id. When the cumulative halting score exceeds a threshold, the token is not processed by subsequent transformer blocks or halting modules. See id. 
Through this practical application, the claimed approach imparts the technological improvement of reducing the number of tokens processed by a transformer neural network as inferencing operations proceed. See id. Accordingly, with the disclosed techniques, the transformer neural network can execute more quickly and efficiently than a conventional transformer neural network that processes all input tokens using all layers. See id. The improvements in execution speed and efficiency additionally enable transformer neural networks to be deployed on mobile phones, autonomous vehicles, or other edge devices with limited computational capabilities, memory, power, and/or network bandwidth. See id. The disclosed techniques also enable a transformer neural network to be trained in a way that balances the accuracy of the transformer neural network in performing a task with the efficiency with which the transformer neural network performs the task. See id. 
As the foregoing illustrates, any purported abstract idea recited in the present claims is integrated into a practical application. Accordingly, the present claims are subject-matter eligible. 
Because the present claims do not recite an abstract idea, and because the present claims recite limitations that integrate any purported abstract idea into a practical application, the present claims are subject-matter eligible under Step 2A of the 2019 Guidance. 
B. Federal Circuit Case Law: 
The Federal Circuit has ruled in numerous cases that claims directed towards technological solutions to technological problems are not abstract under the two-step Alice test. Applicant submits that the present claims are similarly directed towards a technological solution to a technological problem. 
In that regard, the present Application makes clear that a technical problem that existed in the prior art prior to the development of the claimed approach was that transformers incur more latency and resource overhead than other neural network architectures. See Application at paragraph [0005]. In particular, the series of matrix multiplication operations used to compute attention scores at each transformer block is usually more computationally intensive than the operations involved in executing convolutional neural networks, recurrent neural networks, or other non-transformer neural network architectures. See id. For example, the computational cost of the attention unit in a transformer neural network architecture could scale quadratically with the number of tokens, while the computational cost of a recurrent neural network could scale only linearly with the number of inputs. See id. The high computational costs associated with transformers limits the usefulness of transformers in devices or environments with limited computational capabilities, power, memory, and/or network bandwidth. See id. 
The present Application also makes clear that one of technical advantages of the claimed approach is that the claimed approach reduces the number of tokens processed by a transformer neural network as inferencing operations proceed. See Application at paragraph [0008]. Accordingly, with the disclosed techniques, the transformer neural network can execute more quickly and efficiently than a conventional transformer neural network that processes all input tokens using all layers. See id. The improvements in execution speed and efficiency additionally enable transformer neural networks to be deployed on mobile phones, autonomous vehicles, or other edge devices with limited computational capabilities, memory, power, and/or network bandwidth. See id. Further, with the disclosed techniques, the transformer neural network can be trained in a way that balances the accuracy of the transformer neural network in performing a task with the efficiency with which the transformer neural network performs the task. See id. 
Thus, among other things, the claimed approach solves the above technical problem that existed in the prior art. Further, each of the independent claims recites the limitations of computing a first set of halting scores for a first set of tokens that has been input into a first layer of the transformer neural network, determining that a first halting score included in the first set of halting scores exceeds a threshold value, and in response to the first halting score exceeding the threshold value, causing a first token that is included in the first set of tokens and is associated with the first halting score not to be processed by one or more layers within the transformer neural network that are subsequent to the first layer. These limitations are specific to imparting the technological improvement of the claimed approach. See Application at Paragraphs [0008], [0077]- [0079]. Accordingly, the present claims are subject-matter eligible under the legal rule set forth in Ex Parte Desjardins et al., Appeal 2024-000567 (Decided September 26, 2025) (finding claims directed to improvements to how a machine learning model operates to be patent eligible because the claims, when considered as a whole, integrate an abstract idea into a practical application), Finian, Inc. v. Blue Coat Sys., Inc, 879 F.3d 1299 (Fed. Cir. 2018) and McRO, Inc. v. Bandai Namco Games America Inc, 837 F.3d 1299 (Fed. Cir. 2016) (claims that recite specifically limited steps or elements that effect a technological improvement or useful result are not abstract), under the legal rule set forth in Visual Memory LLC v. NVIDIA Corp., 867 F.3d 1253 (Fed. Cir. 2017) (claims directed towards a technological improvement are not abstract), the rule set forth in Data Engine Techs. LLC v. Google LLC, 906 F.3d 999 (Fed. Cir. 2018) and Enfish LLC v. Microsoft Corp., 822 F.3d 1327 (Fed. Cir. 2016) (claims directed towards an improvement in the functioning or operation of a computer or computer network are not abstract), and the legal rule set forth in Weisner v. GoogleLLC, No. 2021-2228 (Fed. Cir. 2022) and DDR Holdings, LLC v. Hotels.com, L.P., 773 F.3d 1245 (Fed. Cir. 2014) (claims directed towards a technical solution to a technical problem necessarily recite more than an abstract idea). 
In addition, Federal Circuit case law has provided that claims that recite a technical improvement and details of the claimed approach that do not preempt all approaches that achieve the technical improvement are not directed towards an abstract idea. See PowerBlock Holdings, Inc. v. iFit, Inc., No. 24-1177 (Fed. Cir. Aug. 11, 2025). Similar to the claims at issue in the PowerBlock, as discussed above, the present claims impart the technical improvement of reducing the number of tokens processed by a transformer neural network as inferencing operations proceed, thereby allowing the transformer neural network to execute more quickly and efficiently than a conventional transformer neural network that processes all input tokens using all layers. Further, the present claims impart the technical improvement of enabling transformer neural networks to be deployed on edge devices with limited computational capabilities, memory, power, and/or network bandwidth. Further, with the disclosed techniques, the transformer neural network can be trained in a way that balances the accuracy of the transformer neural network in performing a task with the efficiency with which the transformer neural network performs the task. The present claims also recite technical details of the claimed approach, including computing a first set of halting scores for a first set of tokens that has been input into a first layer of the transformer neural network, determining that a first halting score included in the first set of halting scores exceeds a threshold value, and in response to the first halting score exceeding the threshold value, causing a first token that is included in the first set of tokens and is associated with the first halting score not to be processed by one or more layers within the transformer neural network that are subsequent to the first layer. Accordingly, the present claims are patent-eligible for at least the reasons outlined by the Federal Circuit in the PowerBlock case. 
As the foregoing illustrates, any purported abstract idea recited in the present claims is necessarily integrated into a practical application. Accordingly, the present claims are not directed towards an abstract idea. 
Because the present claims do not recite an abstract idea, and because the present claims recite limitations that integrate any purported abstract idea into a practical application, the present claims are subject-matter eligible under Step 2A of the 2019 Guidance” (Remarks Pages 8-13).
Examiner’s Response:
	The examiner respectfully disagrees. Applicant has made general assertions that claim 1 recites claim elements that are not directed to an abstract idea and that even if the claim elements are directed to an abstract idea, the judicial exceptions are integrated into a practical application because the claims recite elements that cannot reasonably be characterized as covering mental processes or reflect an improvement to a technology or technical field. Regarding the “computing a first set of halting scores for a first set of tokens that has been input into a first layer of the transformer neural network” limitation of claim 1, this limitation, under it broadest reasonable interpretation, is considered an abstract idea encompassing computing a first set of halting scores for a first set of tokens input to the first layer of the transformer neural network (corresponds to mathematical calculations; in particular, equations 3-5 in paragraphs [0039]-[0041] of the specification of the instant application show that halting scores can be mathematically calculated). Additionally, regarding the “determining that a first halting score included in the first set of halting scores exceeds a threshold value” limitation of claim 1, this limitation, under it broadest reasonable interpretation, is considered an abstract idea encompassing determining that a first halting score of the set of halting scores exceeds a threshold value (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can determine if a first halting score exceeds a threshold value). In addition, regarding the “in response to the first halting score exceeding the threshold value, causing a first token that is included in the first set of tokens and is associated with the first halting score not to be processed by one or more layers within the transformer neural network that are subsequent to the first layer” limitation of claim 1, this limitation, under it broadest reasonable interpretation, is considered an abstract idea encompassing not processing a first token associated with the first halting score in subsequent layers of the transformer neural network if the first halting score exceeds the threshold value (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can cause the first token associated with the first halting score that exceeds a threshold to not be used for subsequent processing). Furthermore, since the “computing a first set of halting scores …”, “determining that a first halting score included in the first set of halting scores exceeds a threshold value”, and “in response to the first halting score exceeding the threshold value, causing a first token that is included in the first set of tokens and is associated with the first halting score not to be processed …” limitations are directed to a judicial exception, they cannot provide any alleged solution or improvement. See MPEP 2106.05(a): “It is important to note, the judicial exception alone cannot provide the improvement. The improvement can be provided by one or more additional elements. See the discussion of Diamond v. Diehr, 450 U.S. 175, 187 and 191-92, 209 USPQ 1, 10 (1981)) in subsection II, below.”
Thus, it is the additional elements that are analyzed to determine whether the judicial exception is integrated into a practical application, not the judicial exception itself. The additional elements in claim 1 of “a computer” and “a transformer neural network” as drafted, under their broadest reasonable interpretations, are high level recitations of applying a generic computer and neural network to implement the abstract ideas such that it amounts to no more than merely using a computer as a tool to perform generic computer functions. As such, the recitation that the abstract ideas are to be performed with such circuitry is a mere instruction to apply the judicial exception using a generic computer component. See MPEP 2106.05(f): “Another consideration when determining whether a claim integrates a judicial exception into a practical application in Step 2A Prong Two or recites significantly more than a judicial exception in Step 2B is whether the additional elements amount to more than a recitation of the words "apply it" (or an equivalent) or are more than mere instructions to implement an abstract idea or other exception on a computer. … Thus, for example, claims that amount to nothing more than an instruction to apply the abstract idea using a generic computer do not render an abstract idea eligible.” Accordingly, the additional elements do not integrate the abstract ideas into a practical application.
Furthermore, the claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe a generic computer and neural network for applying the abstract ideas) Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.
In other words, the limitations of “computing a first set of halting scores for a first set of tokens that has been input into a first layer of the transformer neural network”, “determining that a first halting score included in the first set of halting scores exceeds a threshold value”, and “in response to the first halting score exceeding the threshold value, causing a first token that is included in the first set of tokens and is associated with the first halting score not to be processed by one or more layers within the transformer neural network that are subsequent to the first layer” are abstract ideas that are directed to a judicial exception, so they cannot provide any alleged solution or improvement. Furthermore, the additional elements recited in claim 1 are directed to mere instructions to apply an abstract idea. Therefore, claim 1 does not recite additional element(s) that can provide any alleged solution, improvement, or inventive concept. 
Applicant relies on the arguments above regarding independent claims 11 and 20 and dependent claims 2-10, and 12-19 therefore the response above is applicable to those claims.
Applicant's arguments, filed 10/28/2025, with respect to the 35 U.S.C. 103 prior art rejections to the claims have been fully considered but they are not persuasive. Applicant asserts “Claim 1 recites the limitations of in response to the first halting score exceeding the threshold value, causing a first token that is included in the first set of tokens and is associated with the first halting score not to be processed by one or more layers within the transformer neural network that are subsequent to the first layer. None of the cited references teaches or suggests these limitations. Therefore, no combination of the cited references can teach or suggest each and every limitation of amended claim 1. 
Fayyaz discloses techniques for increasing the efficiency of transformer-based technology by using a modified attention component. See Fayyaz at [0003]. The transformer neural network 106 receives embedding vectors 110 that represent a plurality of item tokens (e.g., image tokens 114) generated based on a data item (e.g., the image 116), and the classification token 118. See id. at [0071]. The transformer neural network 106 performs an attention operation that includes, in block 906, generating original attention information 212 based on the embedding vectors 110. See id. at [0071]. In block 908, the transformer neural network 106 generates score information 216 based on attention values in the original attention information 212 that pertain to the classification token 118. See  at [0071]. In block 1002, the transformer neural network 106 generates modified attention information 218 by removing attention values from the original attention information 212 based on the score information 216.See  at [0071]. In block 804, the transformer neural network 106 performs subsequent operations based on the modified attention information 218. See id. at [0071]. The subsequent operations perform fewer operations by using the modified attention information 218 rather than the original attention information 212. See id. at [0071]. 
The sampling component 144 produces modified attention information 218 based on the score information 216 produced by the scoring component 142. See id. at [0044]. From a high-level perspective, the sampling component 144 treats the score information as a probability distribution. See  at [0044]. Guided by that distribution, it picks rows of attribute values in the original attention information 212 that should be retained. See id. at [0044]. Implicitly, the sampling component 144 also chooses which rows of the attribute values should be omitted (if any) in the modified attention information 218. See id. at [0044]. Rows in the original attention information are associated with respective image tokens. See id. at [0044]. Thus, the sampling component 144 can be said to choose which image tokens are allowed to contribute to the task of classifying the image 116, and which image tokens are not. See  at [0044]. Note that the removal (or effective removal) of a row of attention values does not entirely remove the influence of an associated image token in the final classification operation. See  at [0046]. For example, each row that is retained in the modified attention information 218, which is associated with a retained image token, still describes an extent to which the image token that has been removed is important to the retained image token's own interpretation. See id. at [0046]. In this sense, the sampling component 144 can be said to "softly" down-sample within the pipeline of image-processing operations. See id.at [0046]. The sampling component 144 does not erase the memory of the existence of image tokens that have been removed, as their influence remains in the modified attention information 218. See  at [0046]. 
In the rejections, the Examiner maps the halting score and the first set of tokens, recited in claim 1, to the score information and image tokens, disclosed in Fayyaz, respectively. See Office Action at 31-32. Based on this claim mapping, to teach or suggest the above limitations of claim 1, Fayyaz would have to disclose that in response to the score information exceeding a threshold value, causing a first token that is included in the image tokens and is associated with the score information not to be processed by one or more layers within the transformer neural network that are subsequent to the first layer. Importantly, Fayyaz contains no such teachings. Instead, Fayyaz teaches using the score information to generate modified attention information by removing (or effectively removing) rows of attention values associated with respective image tokens, thereby choosing which image tokens are allowed to contribute to the task of classifying the image, and which image tokens are not. However, Fayyaz also teaches that "the removal (or effective removal) of a row of attention values does not entirely remove the influence of an associated image token in the final classification operation." See Fayyaz at [0046]. Accordingly, Fayyaz does not teach or suggest the idea of causing a token not to be processed by a given layer, let alone in response to the score information exceeding a threshold value, causing a first token that is included in the image tokens and is associated with the score information not to be processed by one or more layers within the transformer neural network that are subsequent to the first layer, as required by the claim language. In view of at least these distinctions, Applicant submits that Fayyaz cannot be properly interpreted as teaching or suggesting the above limitations of claim 1. 
A careful review of the remaining references cited by the Examiner shows that those references also fail to teach or suggest the above limitations of claim 1. 
As the foregoing illustrates, no combination of the cited references can teach or suggest each and every limitation of claim 1. Therefore, claim 1 and all claims dependent thereon are in condition for allowance in view of the cited references. Each of independent claims 11 and 20 recites limitations similar to those discussed above with respect to allowable claim 1. Therefore, claims 11 and 20 and all claims dependent thereon, respectively, are in condition for allowance for at least the reasons set forth herein” (Remarks Pages 13-15).
Examiner’s Response:
	The examiner respectfully disagrees. Regarding claim 1, the examiner respectfully disagrees with applicant’s assertion that “in response to the first halting score exceeding the threshold value, causing a first token that is included in the first set of tokens and is associated with the first halting score not to be processed by one or more layers within the transformer neural network that are subsequent to the first layer” is at least not taught by the cited prior arts. In particular, examiner points to paragraph [0071] of Fayyaz et al. (US 2023/0153379 A1), which specifically discloses, with respect to Fig. 9 and Fig. 10, “FIGS. 9 and 10 together show a process 902 that explains one manner of operation of the transformer neural network 106 of FIG. 1. In block 904, the transformer neural network 106 receives embedding vectors 110 that represent a plurality of item tokens (e.g., image tokens 114) generated based on a data item (e.g., the image 116), and the classification token 118. In blocks 906, 908, and 1002, the transformer neural network 106 performs an attention operation. The attention operation includes, in block 906, generating original attention information 212 based on the embedding vectors 110, the original attention information 212 having a plurality of attention values, each attention value describing an importance that a particular token plays in an interpretation of another particular token. In block 908, the transformer neural network 106 generates score information 216 based on attention values in the original attention information 212 that pertain to the classification token 118. In block 1002, the transformer neural network 106 generates modified attention information 218 by removing attention values from the original attention information 212 based on the score information 216. In block 804, the transformer neural network 106 performs subsequent operations based on the modified attention information 218. The subsequent operations perform fewer operations by using the modified attention information 218 rather than the original attention information 212” (i.e. based on score information (e.g. halting score) for a token (e.g. the score exceeding a threshold), attention information for the token is removed from processing subsequent operations for the transformer neural network (from use in subsequent layers)). The attention values representing a token in the set of tokens is removed from the plurality of attention values based on score information (e.g. halting score) for the token. This set of modified attention values (e.g. without the removed attention values for the token) is then used for subsequent processing operations for the transformer neural network (e.g. processing subsequent layers), meaning that the removed attention values for a token are not processed by subsequent layers of the neural network. Furthermore, regarding applicant’s argument that “Fayyaz also teaches that "the removal (or effective removal) of a row of attention values does not entirely remove the influence of an associated image token in the final classification operation." See Fayyaz at [0046]. Accordingly, Fayyaz does not teach or suggest the idea of causing a token not to be processed by a given layer”, the examiner respectfully disagrees. In the limitation “in response to the first halting score exceeding the threshold value, causing a first token that is included in the first set of tokens and is associated with the first halting score not to be processed by one or more layers within the transformer neural network that are subsequent to the first layer” of claim 1, “causing a first token that is included in the first set of tokens and is associated with the first halting score not to be processed by one or more layers within the transformer neural network that are subsequent to the first layer”, under its broadest reasonable interpretation, encompasses not using the first token is subsequent processing of the transformer neural network after the first layer, not the first token removed from subsequent processing to have no influence in a final operation. Therefore, as taught in paragraph [0071] of Fayyaz et al. (US 2023/0153379 A1), which specifically discloses, with respect to Fig. 9 and Fig. 10, “In block 1002, the transformer neural network 106 generates modified attention information 218 by removing attention values from the original attention information 212 based on the score information 216. In block 804, the transformer neural network 106 performs subsequent operations based on the modified attention information 218. The subsequent operations perform fewer operations by using the modified attention information 218 rather than the original attention information 212” (i.e. tokens identified based on score information have their attention values removed from the modified set of attention, which is then used to perform subsequent operations). The removed attention values for the identified token based on score information are not part of the modified attention values, meaning they are not used to process subsequent operations, which is why there are fewer operations for the modified attention information compared to the original attention information. While the removed attention values may have an influence on a final classification result due to their existence in an initial operation, the removed operation values are not actually used for any processing of subsequent layers in the transformer neural after they have been removed.
Applicant relies on the arguments above regarding independent claims 11 and 20 and dependent claims 2-10, and 12-19 therefore the response above is applicable to those claims.
Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to BRIAN J HALES whose telephone number is (571)272-0878. The examiner can normally be reached M-F 9:00am - 5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached at (571) 272-7796. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/BRIAN J HALES/Examiner, Art Unit 2125                                                                                                                                                                                                        

/KAMRAN AFSHAR/Supervisory Patent Examiner, Art Unit 2125
Read full office action
Prosecution Timeline

Jun 15, 2022
Application Filed
Sep 11, 2025
Non-Final Rejection mailed — §101, §103
Oct 28, 2025
Response Filed
Feb 18, 2026
Final Rejection mailed — §101, §103
Apr 03, 2026
Response after Non-Final Action
May 15, 2026
Request for Continued Examination
May 18, 2026
Response after Non-Final Action
Precedent Cases

Applications granted by this same examiner with similar technology

17/718,612
Patent 12572788
WEIGHT CONFIRMATION METHOD FOR AN ANALOG SYNAPTIC DEVICE OF AN ARTIFICIAL NEURAL NETWORK
3y 11m to grant Granted Mar 10, 2026
17/304,365
Patent 12547910
DISTRIBUTING STRUCTURE RISK ASSESSMENT USING INFORMATION DISTRIBUTION STATIONS
4y 7m to grant Granted Feb 10, 2026
17/124,018
Patent 12493796
USING GENERATIVE ADVERSARIAL NETWORKS TO CONSTRUCT REALISTIC COUNTERFACTUAL EXPLANATIONS FOR MACHINE LEARNING MODELS
4y 11m to grant Granted Dec 09, 2025
17/566,885
Patent 12475369
BUILDING AND EXECUTING DEEP LEARNING-BASED DATA PIPELINES
3y 10m to grant Granted Nov 18, 2025
17/492,172
Patent 12450468
PHYSICS AUGMENTED NEURAL NETWORKS CONFIGURED FOR OPERATING IN ENVIRONMENTS THAT MIX ORDER AND CHAOS
4y 0m to grant Granted Oct 21, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

2-3
Expected OA Rounds
78%
Grant Probability
99%
With Interview (+31.3%)
3y 10m (~0m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 87 resolved cases by this examiner. Grant probability derived from career allowance rate.