Prosecution Insights
Last updated: May 29, 2026
Application No. 17/840,169

Generation and Explanation of Transformer Computation Graph Using Graph Attention Model

Final Rejection §101§103§112
Filed
Jun 14, 2022
Examiner
KIM, SEHWAN
Art Unit
2129
Tech Center
2100 — Computer Architecture & Software
Assignee
Microsoft Technology Licensing, LLC
OA Round
2 (Final)
60%
Grant Probability
Moderate
3-4
OA Rounds
0m
Est. Remaining
99%
With Interview

Examiner Intelligence

Grants 60% of resolved cases
60%
Career Allowance Rate
87 granted / 146 resolved
+4.6% vs TC avg
Strong +66% interview lift
Without
With
+65.9%
Interview Lift
resolved cases with interview
Typical timeline
4y 0m
Avg Prosecution
26 currently pending
Career history
180
Total Applications
across all art units

Statute-Specific Performance

§101
5.2%
-34.8% vs TC avg
§103
87.7%
+47.7% vs TC avg
§102
2.2%
-37.8% vs TC avg
§112
4.7%
-35.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 146 resolved cases

Office Action

§101 §103 §112
Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Examiner’s Note Providing supporting paragraph(s) for each limitation of amended/new claim(s) in Remarks is strongly requested for clear and definite claim interpretations by Examiner (e.g., to avoid rejections under 35 U.S.C § 112(a) “Lack of written description”) Applicant can schedule an interview at any stage of the prosecution (e.g., Non-Final, Final, and After-Final) to discuss any issues related to, for example, rejections under 35 U.S.C § 101 and § 103, for moving toward allowance. Priority Acknowledgment is made of applicant's claim for the present application filed on 06/14/2022. Response to Arguments Applicant's arguments filed on 01/27/2026 have been fully considered but they are not persuasive. In Remarks, pp. 8-12, Applicant contends: Applicant respectfully submits the aforementioned claim limitations provide the technical benefit that the information output by the graph attention model provides an important insight into the behavior of the machine learning model that cannot be determined by merely comparing the output of the model to expected values. Moreover, Applicant respectfully submits one skilled in the art would understand the associated improvement to the functioning of the computer provided by these technical improvements, such as reducing the computing, memory, and/or network resources associated with the development of machine learning models ( e.g., the first model). The claimed techniques can be used to analyze the performance of the model and to determine whether the model is working as expected. Such a framework provides significant insight into the behavior of the model without requiring extensive testing of the model with test data and/or comparing the predictions made by the model with the test data to expected results and without having to expend the associated computing resources (e.g., and computing resources associated therewith in the process) to do so. Examiner’s response: The examiner understands the applicant’s assertion. However, it appears that each processing step is just applying the abstract idea to a general field of endeavor with additional elements. In addition, improvements to technology or technical field are not necessarily reflected in the claims. Thus, the claim does not integrate the judicial exception into a practical application, and the claim does not amount to significantly more than the judicial exception. The examiner understands the applicant’s assertion “the information output by the graph attention model provides an important insight into the behavior of the machine learning model that cannot be determined by merely comparing the output of the model to expected values” and “reducing the computing, memory, and/or network resources associated with the development of machine learning models” and “significant insight into the behavior of the model without requiring extensive testing of the model with test data and/or comparing the predictions made by the model with the test data to expected results and without having to expend the associated computing resources (e.g., and computing resources associated therewith in the process) to do so”. (emphasized with underlines) However, it is not clear how the information output by the graph attention model provides an important insight into the behavior of the machine learning model. The claim just recites “the model behavior information identifying which layers of model performed specific tasks associated with generating predictions by the first machine learning model.” Thus, it is not clear if “the information output by the graph attention model” just indicates “which layers of model performed specific tasks” or something else. In addition, the Applicant mentioned “reducing the computing, memory, and/or network resources”, but it is not clear how the recited claim limitations provide the asserted improvements. Basically, it is not clear how the claims reflect the asserted improvements. Currently, it does not appear that the limitations clearly show e.g., improvements in computer technology and improvements to other technical fields. Rather, the improvements in Remarks are about just improving the abstract ideas of the independent claims. It doesn’t seem that the specification and/or the independent claims clearly show how the inventive concept of the claims enables improvements and how they are tied together. The applicant may need to amend the claims to show how the claim languages and improvements are tied together. To find a valid improvement to a technology, MPEP 2106.04(d)(1) says the specification must explain the improvement and that the claim must reflect the disclosed improvement. Furthermore, the improvement should not be merely a consequence of the abstract idea. See MPEP 2106.05(a). An improvement in the abstract idea itself is not an improvement to technology. For at least these reasons, Applicant's arguments are not convincing. In Remarks, pp. 12-15, Applicant contends: Applicant respectfully contends Abnar does not disclose a distance between nodes or distance in any other capacity and thus differs from this limitations recited by the amended independent claims. (Emphasis added). Moreover, although the Office Action generally cites to particular portions of Abnar when rejection associated claims using the Abnar reference, the rejection of original claims 3 and 10 include no such notes tying particular portions of Abnar to the limitations of claim 3 and 7 allegedly disclosed by Abnar. For example, the rejection of claims 3 and 4 cite the same exact portions of Abnar, yet the claim 4 rejection ends with an indication of which portions of Abnar allegedly disclose the limitations of claim 4, whereas claim 3 is silent in this respect. Examiner’s response: The relevant claim limitation(s) appear(s) to be wherein the computation graph includes a representation of the pair-wise similarity values as relative distances between nodes representing the plurality of tokens. As noted in the rejections, Abnar teaches wherein the computation graph includes a representation of the pair-wise similarity values as relative distances between nodes representing the plurality of tokens. (Abnar [fig(s) 1-4] [sec(s) Abs] “In this paper, we consider the problem of quantifying this flow of information through self-attention.” [sec(s) 1] “We propose two simple but effective methods to compute attention scores to input tokens (i.e., token attention) at each layer, by taking raw attentions (i.e., embedding attention) of that layer as well as those from the precedent layers. These methods are based on modelling the information flow in the network with a DAG (Directed Acyclic Graph), in which the nodes are input tokens and hidden embeddings, edges are the attentions from the nodes in each layer to those in the previous layer, and the weights of the edges are the attention weights. The first method, attention rollout, assumes that the identities of input tokens are linearly combined through the layers based on the attention weights. To adjust attention weights, it rolls out the weights to capture the propagation of information from input tokens to intermediate hidden embeddings. The second method, attention flow, considers the attention graph as a flow network. Using a maximum flow algorithm, it computes maximum flow values, from hidden embeddings (sources) to input tokens (sinks). In both methods, we take the residual connection in the network into account to better model the connections between input tokens and hidden embedding.” [sec(s) 2] “Figure 2 (left) gives raw attention scores of the CLS token over input tokens (x-axis) at different layers (y-axis), which similarly lack an interpretable pattern.” [sec(s) 3] “Given the attention module with residual connection, we compute values in layer l+1 as Vl+1 = Vl+WattVl , where Watt is the attention matrix. Thus, we have Vl+1 = (Watt + I)Vl . So, to account for residual connections, we add an identity matrix to the attention matrix and re-normalize the weights. This results in A = 0.5Watt + 0.5I, where A is the raw attention updated by residual connections.” [sec(s) 4] “Figures 2 and 3 show the weights from raw attention, attention rollout and attention flow for the CLS embedding over input tokens (x-axis) in all 6 layers (y-axis) for three examples. The first example is the same as the one in Figure 1”;) In other words, Abnar teaches an attention matrix which is basically computed by comparing each token to every other token. Thus, Abnar clearly and definitely teaches “pair-wise similarity values” based on an attention matrix. In addition, under a broadest reasonable interpretation (BRI), attention weights may be interpreted as “relative distances”, e.g., based on Section Abstract “We propose two methods for approximating the attention to input tokens given attention weights, attention rollout and attention flow, as post hoc methods when we use attention weights as the relative relevance of the input tokens”, Section 3 “They differ in the assumptions they make about how attention weights in lower layers affect the flow of information to the higher layers and whether to compute the token attentions relative to each other or independently” and Section 4 “In the case of the correctly classified example, we observe that both attention rollout and attention flow assign relatively high weights to both the subject of the verb, “article’ and the attractor, “systems”. For the miss-classified example, both attention rollout and attention flow assign relatively high scores to the “NNS” token which is not the subject of the verb. This can explain the wrong prediction of the model.” Furthermore, the rejection of claim 3 and the rejection of claim 4 cite different portions of Abnar. Please refer to the previous Office Action. Therefore, the applicant’s arguments are not convincing. In Remarks, pp. 12-15, Applicant contends: Even assuming, arguendo, the block matrix and identity matrix of Abnar disclose the block matrix and identity matrices of amended claims 4 and 11 respectively, Applicant respectfully contends Abnar does not disclose the particular configuration of a block matrix having diagonal blocks and off-diagonal blocks and thus differs from amended claims 4 and 11. Examiner’s response: The relevant claim limitation(s) appear(s) to be generating a block matrix having diagonal blocks comprising attention weights based on the attention matrices from the plurality of self-attention layers of the first machine learning model and off-diagonal blocks comprising identity matrices representing forward connections between self-attention layers. As noted in the rejections, Abnar teaches generating a block matrix having diagonal blocks comprising attention weights based on the attention matrices from the plurality of self-attention layers of the first machine learning model and off-diagonal blocks comprising identity matrices representing forward connections between self-attention layers. (Abnar [fig(s) 1-4] [sec(s) 1] “We propose two simple but effective methods to compute attention scores to input tokens (i.e., token attention) at each layer, by taking raw attentions (i.e., embedding attention) of that layer as well as those from the precedent layers. These methods are based on modelling the information flow in the network with a DAG (Directed Acyclic Graph), in which the nodes are input tokens and hidden embeddings, edges are the attentions from the nodes in each layer to those in the previous layer, and the weights of the edges are the attention weights. … In both methods, we take the residual connection in the network into account to better model the connections between input tokens and hidden embedding.” [sec(s) 2] “Figure 2 (left) gives raw attention scores of the CLS token over input tokens (x-axis) at different layers (y-axis), which similarly lack an interpretable pattern.” [sec(s) 3] “Hence, to compute attention rollout and attention flow, we augment the attention graph with extra weights to represent residual connections. Given the attention module with residual connection, we compute values in layer l+1 as Vl+1 = Vl+WattVl , where Watt is the attention matrix. Thus, we have Vl+1 = (Watt + I)Vl . So, to account for residual connections, we add an identity matrix to the attention matrix and re-normalize the weights. This results in A = 0.5Watt + 0.5I, where A is the raw attention updated by residual connections.” and “Attention rollout … At the implementation level, to compute the attentions from li to lj, we recursively multiply the attention weights matrices in all the layers below. PNG media_image1.png 185 973 media_image1.png Greyscale (1) In this equation, A˜ is attention rollout, A is raw attention and the multiplication operation is a matrix multiplication. With this formulation, to compute input attention we set j = 0” and “Attention flow … Treating the attention graph as a flow network, where the capacities of the edges are attention weights, using any maximum flow algorithm, we can compute the maximum attention flow from any node in any of the layers to any of the input nodes.” [sec(s) 4] “Figures 2 and 3 show the weights from raw attention, attention rollout and attention flow for the CLS embedding over input tokens (x-axis) in all 6 layers (y-axis) for three examples. The first example is the same as the one in Figure 1”; e.g., figs 2-4 read(s) on “block matrix”. In addition, e.g., “to account for residual connections, we add an identity matrix to the attention matrix” read(s) on “off-diagonal blocks comprising identity matrices representing forward connections between self-attention layers”.) The examiner understands the applicant’s assertion “Abnar does not disclose the particular configuration of a block matrix having diagonal blocks and off-diagonal blocks.” However, as shown in figs 2-4, under a broadest reasonable interpretation (BRI), the attention maps may be interpreted as a block matrix. In addition, e.g., as shown in figs 2-4, a block matrix may be interpreted as having multiple diagonal blocks. Furthermore, as rejected under Claim Rejections - 35 USC § 103, to account for residual connections, an identity matrix is added for each layer. Thus, Abnar clearly and definitely teaches the claimed limitation. Therefore, the applicant’s arguments are not convincing. In Remarks, pp. 12-15, Applicant contends: The limitations of amended claims 7 and 14, however, are not concerned with changing how the attention mechanism of the model works. Instead, amended claims 7 and 14 are concerned with using filtering to reduce the complexity of the computation graph generated based on the behavior of the model being analyzed. The behavior of the model itself is not altered, rather only the representation of that behavior is filtered to focus on the most important connections. Additionally, the lambda is not disclosed as a percentage but rather a negative numerical value (e.g., -3, -4, -6, or -7). Examiner’s response: The relevant claim limitation(s) appear(s) to be filtering the attention weights to exclude weights fall outside a predetermined percentage to generate a sparse version of the computation graph. As noted in the rejections, Abnar teaches [filtering] the attention weights [to exclude weights fall outside a predetermined percentage to generate a sparse version of] the computation graph. (Abnar [fig(s) 1-4] [sec(s) 1] “We propose two simple but effective methods to compute attention scores to input tokens (i.e., token attention) at each layer, by taking raw attentions (i.e., embedding attention) of that layer as well as those from the precedent layers. These methods are based on modelling the information flow in the network with a DAG (Directed Acyclic Graph), in which the nodes are input tokens and hidden embeddings, edges are the attentions from the nodes in each layer to those in the previous layer, and the weights of the edges are the attention weights. … In both methods, we take the residual connection in the network into account to better model the connections between input tokens and hidden embedding.” [sec(s) 2] “Figure 2 (left) gives raw attention scores of the CLS token over input tokens (x-axis) at different layers (y-axis), which similarly lack an interpretable pattern.” [sec(s) 3] “Hence, to compute attention rollout and attention flow, we augment the attention graph with extra weights to represent residual connections. Given the attention module with residual connection, we compute values in layer l+1 as Vl+1 = Vl+WattVl , where Watt is the attention matrix. Thus, we have Vl+1 = (Watt + I)Vl . So, to account for residual connections, we add an identity matrix to the attention matrix and re-normalize the weights. This results in A = 0.5Watt + 0.5I, where A is the raw attention updated by residual connections.” and “Attention rollout … At the implementation level, to compute the attentions from li to lj, we recursively multiply the attention weights matrices in all the layers below. PNG media_image1.png 185 973 media_image1.png Greyscale (1) In this equation, A˜ is attention rollout, A is raw attention and the multiplication operation is a matrix multiplication. With this formulation, to compute input attention we set j = 0” and “Attention flow … Treating the attention graph as a flow network, where the capacities of the edges are attention weights, using any maximum flow algorithm, we can compute the maximum attention flow from any node in any of the layers to any of the input nodes.” [sec(s) 4] “Figures 2 and 3 show the weights from raw attention, attention rollout and attention flow for the CLS embedding over input tokens (x-axis) in all 6 layers (y-axis) for three examples. The first example is the same as the one in Figure 1”;) and, Cui teaches filtering the attention weights to exclude weights fall outside a predetermined percentage to generate a sparse version of the computation graph. (Cui [fig(s) 1, 3] [sec(s) 2.2] “By introducing sparsity to refine the attention weight, our sparse self-attention mechanism (SSAM) strengthens the most important relations among different words such as local interactions, and assigns zero probability to those meaningless connections. This enables us to achieve a more expressive representation for the whole input.” [sec(s) 2.3] “In this part, we propose a sparse self-attention fine-tuning model (SSAF). In particular, this finetuning model with BERT is composed of N sparse self-attention layers, where each layer learns a representation by taking the output from the previous layer: PNG media_image2.png 197 1181 media_image2.png Greyscale where SSAM is adopted to replace the traditional self-attention mechanism, h0 = embed(x) denotes the representation for the input sequence x which is the sum of token embeddings and the position embeddings, and LN is the layer normalization operation.” [sec(s) 3.2] “The coefficient λ which controls the sparisty in Equation 4 is set to -3 in SST-1 and SemEval, -4 in SST-2 and SenTube-T, -6 in SenTube-A and SciTail, and -7 in SQuAD. We investigate the influence of different λ settings in the experiment analysis part.”;) The examiner understands the applicant’s assertion “The behavior of the model itself is not altered, rather only the representation of that behavior is filtered to focus on the most important connections.” However, MPEP 2141.01(a) states “A reference is analogous art to the claimed invention if: (1) the reference is from the same field of endeavor as the claimed invention (even if it addresses a different problem); or (2) the reference is reasonably pertinent to the problem faced by the inventor (even if it is not in the same field of endeavor as the claimed invention)”. That is, a reference still is analogous art to the claimed invention if the reference is reasonably pertinent to the problem faced by the inventor. Thus, as rejected under Claim Rejections - 35 USC § 103, Abnar and Cui, in combination, still teaches the claim limitation even though the main goal of the reference may be slightly different from that of the present invention. In addition, under a broadest reasonable interpretation (BRI), “different λ settings” still reads on “percentage” since a certain percentage is determined based on the pre-determined settings. Note that the claim just recites “a predetermined percentage” but it does not provide further details on the claimed language. Therefore, the applicant’s arguments are not convincing. Claim Rejections - 35 USC § 112 The following is a quotation of 35 U.S.C. 112(d): (d) REFERENCE IN DEPENDENT FORMS.—Subject to subsection (e), a claim in dependent form shall contain a reference to a claim previously set forth and then specify a further limitation of the subject matter claimed. A claim in dependent form shall be construed to incorporate by reference all the limitations of the claim to which it refers. The following is a quotation of pre-AIA 35 U.S.C. 112, fourth paragraph: Subject to the following paragraph [i.e., the fifth paragraph of pre-AIA 35 U.S.C. 112], a claim in dependent form shall contain a reference to a claim previously set forth and then specify a further limitation of the subject matter claimed. A claim in dependent form shall be construed to incorporate by reference all the limitations of the claim to which it refers. Claim(s) 17 is/are rejected under 35 U.S.C. 112(d) or pre-AIA 35 U.S.C. 112, 4th paragraph, as being of improper dependent form for failing to further limit the subject matter of the claim upon which it depends, or for failing to include all the limitations of the claim upon which it depends. Claim 15 has the same claim language as Claim 17, and thus it does not constitute a further limitation. Applicant may cancel the claim(s), amend the claim(s) to place the claim(s) in proper dependent form, rewrite the claim(s) in independent form, or present a sufficient showing that the dependent claim(s) complies with the statutory requirements. Claim Rejections - 35 USC § 101 35 U.S.C. 101 reads as follows: Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title. Claims 1-2, 4-9, 11-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Regarding claim 1 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a system; therefore, it falls into the statutory category of a machine. Step 2A Prong 1: The limitations of “…: …; and … perform operations comprising: …, …, …; analyzing the attention matrices to generate a computation graph based on the attention matrices, …; and analyzing the computation graph …, …”, as drafted, are a machine that, under its broadest reasonable interpretation, covers performance of the limitation in the mind. That is, nothing in the claim element precludes the step from practically being performed in the mind. For example, the limitations in the context of this claim encompass the user mentally thinking with a physical aid (e.g., pencil and paper). If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea. Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites additional elements that are mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea. See MPEP 2106.05(f). In particular, the claim recites an additional element(s) (“A data processing system comprising: a processor; and a non-transitory machine-readable storage medium storing executable instructions that, when executed, cause the processor to”, “using a second machine learning model”) – using a device and a model to process data. The device and the model in each step are recited at a high-level of generality (i.e., as a generic computer performing a generic computer function of processing data) such that it amounts no more than mere instructions to apply the exception using a generic computer component. Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea. In particular, the claim recites an additional element(s) (“storing”) – the act of storing data. The claim is adding an insignificant extra-solution activity to the judicial exception – see MPEP 2106.05(g). The act of storing data is recited at a high-level of generality (i.e., as a generic act of storing performing a generic act function of storing data) such that it amounts no more than a mere act to apply the exception using a generic act of storing. Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea. In particular, the claim recites an additional element(s) (“obtaining attention matrices from a first machine learning model”, “receive the computation graph”) – the act of receiving data. The claim is adding an insignificant extra-solution activity to the judicial exception – see MPEP 2106.05(g). The act of receiving data is recited at a high-level of generality (i.e., as a generic act of receiving performing a generic act function of receiving data) such that it amounts no more than a mere act to apply the exception using a generic act of receiving. Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea. In particular, the claim recites an additional element(s) (“the first machine learning model having been pretrained”, “the second machine learning model being trained to”). The additional element is recited at such a high level without any details as to how a model is trained such that it amounts to only the idea of a solution or outcome because it fails to recite details of how a solution to a problem is accomplished, and, therefore, represents no more than mere instructions to apply the judicial exception on a computer (see MPEP 2106.05(f)). Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea. In particular, the claim recites an additional element (“the first machine learning model including a plurality of self-attention layers, and the attention matrices being associated with the plurality of self-attention layers of the first machine learning model, wherein the attention matrices include pair-wise similarity values for each token of a plurality of tokens of an input to the first machine learning model”, “the computation graph providing a representation of behavior of the first machine learning model across the plurality of self-attention layers, wherein the computation graph includes a representation of the pair-wise similarity values as relative distances between nodes representing the plurality of tokens”, “the model behavior information identifying which layers of the first machine learning model performed specific tasks associated with generating predictions by the first machine learning model”). This is a recitation of a particular type or source of model/data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of model/data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not integrate the abstract idea into a practical application. See MPEP 2106.05(h) In particular, the claim recites an additional element(s) (“to output model behavior information”) – the act of outputting data. The claim is adding an insignificant extra-solution activity to the judicial exception – see MPEP 2106.05(g). The act of outputting data is recited at a high-level of generality (i.e., as a generic act of performing a generic act function of outputting data) such that it amounts no more than a mere act to apply the exception using a generic act of outputting. Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea. Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above, with respect to integration of the abstract idea into a practical application, the additional elements of using a generic computer component to perform each step amount to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claim is not patent eligible. MPEP 2106.05(f). As discussed above, the claim recites the additional element(s) of storing data at a high-level of generality and is adding an insignificant extra-solution activity – see MPEP 2106.05(g) – storing data. However, the addition of insignificant extra-solution activity does not amount to an inventive concept, particularly when the activity is well-understood, routine, and conventional. See MPEP 2106.05(d)(II) – “Receiving or transmitting data over a network” or “Storing and retrieving information in memory”. Accordingly, this additional element does not provide an inventive concept and significantly more than the abstract idea. Thus, the claim is not patent eligible. As discussed above, the claim recites the additional element(s) of receiving data at a high-level of generality and is adding an insignificant extra-solution activity – see MPEP 2106.05(g). However, the addition of insignificant extra-solution activity does not amount to an inventive concept, particularly when the activity is well-understood, routine, and conventional. See MPEP 2106.05(d)(II) – “Receiving or transmitting data over a network” or “Storing and retrieving information in memory”. Accordingly, this additional element does not provide an inventive concept and significantly more than the abstract idea. Thus, the claim is not patent eligible. The additional elements regarding training are recited at such a high level without any details as to how a model is trained such that it amounts to only the idea of a solution or outcome because it fails to recite details of how a solution to a problem is accomplished, and, therefore, represents no more than mere instructions to apply the judicial exception on a computer (see MPEP 2106.05(f)). Accordingly, this additional element does not amount to significantly more than the abstract idea. The claim is directed to an abstract idea. This is a recitation of a particular type or source of model/data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of model/data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not amount to significantly more than the abstract idea. See MPEP 2106.05(h). As discussed above, the claim recites the additional element(s) of outputting data at a high-level of generality and is adding an insignificant extra-solution activity – see MPEP 2106.05(g). However, the addition of insignificant extra-solution activity does not amount to an inventive concept, particularly when the activity is well-understood, routine, and conventional. See MPEP 2106.05(d)(II) – “Receiving or transmitting data over a network” or “Storing and retrieving information in memory”. Accordingly, this additional element does not provide an inventive concept and significantly more than the abstract idea. Thus, the claim is not patent eligible. Regarding claim 2 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a system; therefore, it falls into the statutory category of a machine. Step 2A Prong 1: The claim recites the abstract idea identified above regarding claim 1. Step 2A Prong 2: This judicial exception is not integrated into a practical application. In particular, the claim recites an additional element (“wherein the first machine learning model is a transformer model, and wherein the plurality of self-attention layers are one or more encoding layers or one or more encoding layers and one or more decoding layers”). This is a recitation of a particular type or source of model/data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of model/data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not integrate the abstract idea into a practical application. See MPEP 2106.05(h) Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. This is a recitation of a particular type or source of model/data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of model/data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not amount to significantly more than the abstract idea. See MPEP 2106.05(h). Regarding claim 4 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a system; therefore, it falls into the statutory category of a machine. Step 2A Prong 1: The limitations of “generating a block matrix having diagonal blocks comprising attention weights based on the attention matrices from the plurality of self-attention layers of the first machine learning model and off-diagonal blocks comprising identity matrices representing forward connections between self-attention layers”, as drafted, are a machine that, under its broadest reasonable interpretation, covers performance of the limitation in the mind. That is, nothing in the claim element precludes the step from practically being performed in the mind. For example, the limitations in the context of this claim encompass the user mentally thinking with a physical aid (e.g., pencil and paper). If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea. Step 2A Prong 2: This judicial exception is not integrated into a practical application. In particular, the claim does not recite additional elements. Thus, the claim is directed to an abstract idea. Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. Thus, the claim is not patent eligible. Regarding claim 5 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a system; therefore, it falls into the statutory category of a machine. Step 2A Prong 1: The claim recites the abstract idea identified above regarding claim 4. Step 2A Prong 2: This judicial exception is not integrated into a practical application. In particular, the claim recites an additional element (“wherein a respective forward connection represents a connection between a respective token at a respective self-attention layer and the respective token at a next self-attention layer of the first machine learning model”). This is a recitation of a particular type or source of model/data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of model/data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not integrate the abstract idea into a practical application. See MPEP 2106.05(h) Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. This is a recitation of a particular type or source of model/data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of model/data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not amount to significantly more than the abstract idea. See MPEP 2106.05(h). Regarding claim 6 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a system; therefore, it falls into the statutory category of a machine. Step 2A Prong 1: The limitations of “wherein the block matrix is generated by processing the attention matrices associated with each of the plurality of self-attention layers sequentially such that representations of sequential self-attention layers are adjacent on the computation graph”, as drafted, are a machine that, under its broadest reasonable interpretation, covers performance of the limitation in the mind. That is, nothing in the claim element precludes the step from practically being performed in the mind. For example, the limitations in the context of this claim encompass the user mentally thinking with a physical aid (e.g., pencil and paper). If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea. Step 2A Prong 2: This judicial exception is not integrated into a practical application. In particular, the claim does not recite additional elements. Thus, the claim is directed to an abstract idea. Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. Thus, the claim is not patent eligible. Regarding claim 7 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a system; therefore, it falls into the statutory category of a machine. Step 2A Prong 1: The limitations of “…: filtering the attention weights to exclude weights that fall outside a predetermined percentage to generate a sparse version of the computation graph”, as drafted, are a machine that, under its broadest reasonable interpretation, covers performance of the limitation in the mind. That is, nothing in the claim element precludes the step from practically being performed in the mind. For example, the limitations in the context of this claim encompass the user mentally thinking with a physical aid (e.g., pencil and paper). If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea. Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites additional elements that are mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea. See MPEP 2106.05(f). In particular, the claim recites an additional element(s) (“wherein the non-transitory machine-readable storage medium includes instructions configured to cause the processor to perform operations of”) – using a device and a model to process data. The device and the model in each step are recited at a high-level of generality (i.e., as a generic computer performing a generic computer function of processing data) such that it amounts no more than mere instructions to apply the exception using a generic computer component. Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea. Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above, with respect to integration of the abstract idea into a practical application, the additional elements of using a generic computer component to perform each step amount to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claim is not patent eligible. MPEP 2106.05(f). Regarding claim 8 The claim recites “A method implemented in a data processing system analyzing performance of a machine learning model, the method comprising:” to perform precisely the system of Claim 1. As performance of an abstract idea on generic computer components (see MPEP 2106.05(f)) cannot integrate the abstract idea into a practical application nor provide significantly more than the abstract idea itself, the claim is rejected for reasons set forth in the rejection of Claim 1. Regarding claim 9 The claim is rejected for the reasons set forth in the rejection of Claim 2 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception. Regarding claim 11 The claim is rejected for the reasons set forth in the rejection of Claim 4 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception. Regarding claim 12 The claim is rejected for the reasons set forth in the rejection of Claim 5 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception. Regarding claim 13 The claim is rejected for the reasons set forth in the rejection of Claim 6 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception. Regarding claim 14 The claim is rejected for the reasons set forth in the rejection of Claim 7 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception. Regarding claim 15 The claim recites “A machine-readable medium on which are stored instructions that, when executed, cause a processor of a programmable device to perform operations of:” to perform precisely the system of Claim 1. As performance of an abstract idea on generic computer components (see MPEP 2106.05(f)) and “Storing and retrieving information in memory” (see MPEP 2106.05(g) on Insignificant Extra-Solution Activity, and MPEP 2106.05(d) on Well-Understood, Routine, Conventional Activity) cannot integrate the abstract idea into a practical application nor provide significantly more than the abstract idea itself, the claim is rejected for reasons set forth in the rejection of Claim 1. Regarding claim 16 The claim is rejected for the reasons set forth in the rejection of Claim 2 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception. Regarding claim 17 The claim is rejected for the reasons set forth in the rejection of Claim 3 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception. Regarding claim 18 The claim is rejected for the reasons set forth in the rejection of Claim 4 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception. Regarding claim 19 The claim is rejected for the reasons set forth in the rejection of Claim 5 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception. Regarding claim 20 The claim is rejected for the reasons set forth in the rejection of Claim 6 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception. Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claim(s) 1-2, 4-6, 8-9, 11-13, 15-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Abnar et al. (Quantifying Attention Flow in Transformers) in view of Dalli et al. (US20220198254A1) in view of Yuan et al. (Explainability in Graph Neural Networks: A Taxonomic Survey) Regarding claim 1 Abnar teaches A data processing system comprising: [a processor; and A non-transitory machine-readable storage medium storing executable instructions that, when executed, cause the processor to] perform operations comprising: obtaining attention matrices from a first machine learning model, the first machine learning model having been pretrained, the first machine learning model including a plurality of self-attention layers, and the attention matrices being associated with the plurality of self-attention layers of the first machine learning model, wherein the attention matrices include pair-wise similarity values for each token of a plurality of tokens of an input to the first machine learning model; (Abnar [fig(s) 1-2] [sec(s) Abs] “In this paper, we consider the problem of quantifying this flow of information through self-attention.” [sec(s) 1] “We propose two simple but effective methods to compute attention scores to input tokens (i.e., token attention) at each layer, by taking raw attentions (i.e., embedding attention) of that layer as well as those from the precedent layers. These methods are based on modelling the information flow in the network with a DAG (Directed Acyclic Graph), in which the nodes are input tokens and hidden embeddings, edges are the attentions from the nodes in each layer to those in the previous layer, and the weights of the edges are the attention weights. The first method, attention rollout, assumes that the identities of input tokens are linearly combined through the layers based on the attention weights. To adjust attention weights, it rolls out the weights to capture the propagation of information from input tokens to intermediate hidden embeddings. The second method, attention flow, considers the attention graph as a flow network. Using a maximum flow algorithm, it computes maximum flow values, from hidden embeddings (sources) to input tokens (sinks). In both methods, we take the residual connection in the network into account to better model the connections between input tokens and hidden embedding. We show that compared to raw attention, the token attentions from attention rollout and attention flow have higher correlations with the importance scores obtained from input gradients as well as blank-out, an input ablation based attribution method. Furthermore, we visualize the token attention weights and demonstrate that they are better approximations of how input tokens contribute to a predicted output, compared to raw attention.” [sec(s) 3] “Given the attention module with residual connection, we compute values in layer l+1 as Vl+1 = Vl+WattVl , where Watt is the attention matrix. Thus, we have Vl+1 = (Watt + I)Vl . So, to account for residual connections, we add an identity matrix to the attention matrix and re-normalize the weights. This results in A = 0.5Watt + 0.5I, where A is the raw attention updated by residual connections.” [sec(s) 4] “At last, to illustrate the application of attention flow and attention rollout on different tasks and different models, we examine them on two pretrained BERT models. We use the models available at https://github.com/huggingface/transformers.” [sec(s) 2] “Figure 2 (left) gives raw attention scores of the CLS token over input tokens (x-axis) at different layers (y-axis), which similarly lack an interpretable pattern.” [sec(s) 4] “Figures 2 and 3 show the weights from raw attention, attention rollout and attention flow for the CLS embedding over input tokens (x-axis) in all 6 layers (y-axis) for three examples. The first example is the same as the one in Figure 1”;) analyzing the attention matrices to generate a computation graph based on the attention matrices, the computation graph providing a representation of behavior of the first machine learning model across the plurality of self-attention layers, wherein the computation graph includes a representation of the pair-wise similarity values as relative distances between nodes representing the plurality of tokens; and (Abnar [fig(s) 3-4] “attention maps” [sec(s) 3] “They differ in the assumptions they make about how attention weights in lower layers affect the flow of information to the higher layers and whether to compute the token attentions relative to each other or independently.” and “Given the attention module with residual connection, we compute values in layer l+1 as Vl+1 = Vl+WattVl , where Watt is the attention matrix. Thus, we have Vl+1 = (Watt + I)Vl . So, to account for residual connections, we add an identity matrix to the attention matrix and re-normalize the weights. This results in A = 0.5Watt + 0.5I, where A is the raw attention updated by residual connections.” and “Attention rollout … At the implementation level, to compute the attentions from li to lj, we recursively multiply the attention weights matrices in all the layers below. PNG media_image1.png 185 973 media_image1.png Greyscale (1) In this equation, A˜ is attention rollout, A is raw attention and the multiplication operation is a matrix multiplication. With this formulation, to compute input attention we set j = 0” and “Attention flow … Treating the attention graph as a flow network, where the capacities of the edges are attention weights, using any maximum flow algorithm, we can compute the maximum attention flow from any node in any of the layers to any of the input nodes. We can use this maximum-flow-value as an approximation of the attention to input nodes. In attention flow, the weight of a single path is the minimum value of the weights of the edges in the path, instead of the product of the weights.” [sec(s) 2] “Figure 2 (left) gives raw attention scores of the CLS token over input tokens (x-axis) at different layers (y-axis), which similarly lack an interpretable pattern.” [sec(s) 4] “Figures 2 and 3 show the weights from raw attention, attention rollout and attention flow for the CLS embedding over input tokens (x-axis) in all 6 layers (y-axis) for three examples. The first example is the same as the one in Figure 1”;) analyzing the computation graph using [a second machine learning model, the second machine learning model being trained to receive] the computation graph to output model behavior information, the model behavior information identifying which layers of the first machine learning model performed specific tasks associated with generating predictions by the first machine learning model. (Abnar [fig(s) 3-4] “attention maps” [sec(s) 4] “For all cases, the raw attention weights are almost uniform above layer three (discussed before). In the case of the correctly classified example, we observe that both attention rollout and attention flow assign relatively high weights to both the subject of the verb, “article’ and the attractor, “systems”. For the miss-classified example, both attention rollout and attention flow assign relatively high scores to the “NNS” token which is not the subject of the verb. This can explain the wrong prediction of the model. The main difference between attention rollout and attention flow is that attention flow weights are amortized among the set of most attended tokens, as expected. Attention flow can indicate a set of input tokens that are important for the final decision. Thus we do not get sharp distinctions among them. On the other hand, attention rollout weights are more focused compared to attention flow weights, which is sensible for the third example but not as much for the second one. … Furthermore, in Figure 4, we show an example of applying these methods to a pre-trained Bert to see how it resolves the pronouns in a sentence. What we do here is to feed the model with a sentence, masking a pronoun. Next, we look at the prediction of the model for the masked word and compare the probabilities assigned to “her” and “his”. Then we look at raw attention, attention rollout and attention flow weights of the embeddings for the masked pronoun at all the layers. In the first example, in Figure 4a, attention rollout and attention flow are consistent with each other and the prediction of the model. Whereas, the final layer of raw attention does not seem to be consistent with the prediction of the models, and it varies a lot across different layers. In the second example, in Figure 4b, only attention flow weights are consistent with the prediction of the model.” [sec(s) 5] “Translating embedding attentions to token attentions can provide us with better explanations about models’ internals.”;) However, Abnar does not appear to explicitly teach: [a processor; and a non-transitory machine-readable storage medium storing executable instructions that, when executed, cause the processor to] perform operations comprising: analyzing the computation graph using [a second machine learning model, the second machine learning model being trained to receive] the computation graph to output model behavior information, the model behavior information identifying which layers of the first machine learning model performed specific tasks associated with generating predictions by the first machine learning model. Dalli teaches a processor; and a non-transitory machine-readable storage medium storing executable instructions that, when executed, cause the processor to perform operations comprising: (DALLI [par(s) 61] “Further, many of the embodiments described herein are described in terms of sequences of actions to be performed by, for example, elements of a computing device. It should be recognized by those skilled in the art that the various sequences of actions described herein can be performed by specific circuits (e.g., application specific integrated circuits (ASICs)) and/or by program instructions executed by at least one processor. Additionally, the sequence of actions described herein can be embodied entirely within any form of computer-readable storage medium such that execution of the sequence of actions enables the at least one processor to perform the functionality described herein. Furthermore, the sequence of actions described herein can be embodied in a combination of hardware and software. Thus, the various aspects of the present invention may be embodied in a number of different forms, all of which have been contemplated to be within the scope of the claimed subject matter. In addition, for each of the embodiments described herein, the corresponding form of any such embodiment may be described herein as, for example, “a computer configured to” perform the described action.” See also [par(s) 250], [par(s) 73-75] “Explainable architectures utilized in the explainable transformer XTT models include, but are not limited to, eXplainable artificial intelligence (XAI) models, Interpretable Neural Nets (INNs), eXplainable Neural Nets (XNN), eXplainable Spiking Nets (XSN) and eXplainable Memory Nets (XMN) models. A further exemplary embodiment may present methods for detecting bias both globally and locally by harnessing the white-box nature of eXplainable Reinforcement Learning (XRL). … Explainable Neural Networks (XNNs) are a new type of Artificial Neural Networks (ANNs) that are inherently interpretable and explainable. The main concept behind an XNN is that it is that the inner network structure is fully interpretable. Interpretability is built within the architecture itself, yet it functions like a standard neural network. This eliminates the need to apply additional techniques or processing for interpreting the result of a neural network. XNNs compute both the answer and its explanation in a single feed-forward step without any need for simulations, iterations, perturbation, etc. XNNs are also designed to be easily implementable both in software but also in hardware efficiently, leading to substantial speed and space improvements.”;) Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Abnar with the processor and the memory of Dalli. One of ordinary skill in the art would have been motived to combine in order to be designed to be easily implementable both in software but also in hardware efficiently, leading to substantial speed and space improvements, towards Explainable Neural Networks (XNNs). (Dalli [par(s) 75] “Explainable Neural Networks (XNNs) are a new type of Artificial Neural Networks (ANNs) that are inherently interpretable and explainable. The main concept behind an XNN is that it is that the inner network structure is fully interpretable. Interpretability is built within the architecture itself, yet it functions like a standard neural network. This eliminates the need to apply additional techniques or processing for interpreting the result of a neural network. XNNs compute both the answer and its explanation in a single feed-forward step without any need for simulations, iterations, perturbation, etc. XNNs are also designed to be easily implementable both in software but also in hardware efficiently, leading to substantial speed and space improvements.”) However, the combination of Abnar, Dalli does not appear to explicitly teach: analyzing the computation graph using [a second machine learning model, the second machine learning model being trained to receive] the computation graph to output model behavior information, the model behavior information identifying which layers of model performed specific tasks associated with generating predictions by the first machine learning model. Yuan teaches analyzing the computation graph using a second machine learning model, the second machine learning model being trained to receive the computation graph to output model behavior information, the model behavior information identifying which layers of the first machine learning model performed specific tasks associated with generating predictions by the first machine learning model. (Yuan [fig(s) 3] “The general pipeline of the surrogate methods. Given an input graph and its prediction, they first sample a local dataset to represent the relationships around the target data. Then different surrogate methods are applied to fit the local dataset. Note that surrogate models are generally simple and interpretable ML models. Finally, the explanations from the surrogate model can be regarded as the explanations of the original prediction” [sec(s) 4.3] “Recently, several surrogate methods are proposed to explain deep graph models, including GraphLime [59], RelEx [60], and PGM-Explainer [61]. The general pipeline of these methods in shown in Figure 3. To explain the prediction of a given input graph, they first obtain a local dataset containing multiple neighboring data objects and their predictions. Then they fit a interpretable model to learn the local dataset. Finally, the explanations from the interpretable model are regarded as the explanations of the original model for the input graph. While these methods share a similar high-level idea, the key difference lies in two aspects: how to obtain the local dataset and what interpretable surrogate model to use.”;) Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Abnar, Dalli with the processor and the memory of Yuan. One of ordinary skill in the art would have been motived to combine in order to provide instance-level explanations efficiently and approximate effectively the predictions of the complex deep model based on a simple and interpretable surrogate model. (Yuan [sec(s) 4.3] “Deep models are challenging to explain because of the complex and non-linear relationships between the input space and output predictions. A popular way to provide instance-level explanations for image models is known as surrogate method. The underlying idea is to employ a simple and interpretable surrogate model to approximate the predictions of the complex deep model for the neighboring areas of the input example.”) Regarding claim 2 The combination of Abnar, Dalli, Yuan teaches claim 1. Abnar further teaches wherein the first machine learning model is a transformer model, and wherein the plurality of self-attention layers are one or more encoding layers or one or more encoding layers and one or more decoding layers. (Abnar [fig(s) 1-2] [sec(s) 2] “We train a Transformer encoder, with GPT2 Transformer blocks as described in (Radford et al., 2019; Wolf et al., 2019) (without masking). The model has 6 layers, and 8 heads, with hidden/embedding size of 128. Similar to Bert (Devlin et al., 2019) we add a CLS token and use its embedding in the final layer as the input to the classifier.” [sec(s) Abs] “In this paper, we consider the problem of quantifying this flow of information through self-attention.” [sec(s) 1] “We propose two simple but effective methods to compute attention scores to input tokens (i.e., token attention) at each layer, by taking raw attentions (i.e., embedding attention) of that layer as well as those from the precedent layers. These methods are based on modelling the information flow in the network with a DAG (Directed Acyclic Graph), in which the nodes are input tokens and hidden embeddings, edges are the attentions from the nodes in each layer to those in the previous layer, and the weights of the edges are the attention weights. The first method, attention rollout, assumes that the identities of input tokens are linearly combined through the layers based on the attention weights. To adjust attention weights, it rolls out the weights to capture the propagation of information from input tokens to intermediate hidden embeddings. The second method, attention flow, considers the attention graph as a flow network.” [sec(s) 4] “At last, to illustrate the application of attention flow and attention rollout on different tasks and different models, we examine them on two pretrained BERT models. We use the models available at https://github.com/huggingface/transformers.”;) Regarding claim 4 The combination of Abnar, Dalli, Yuan teaches claim 1. Abnar further teaches wherein analyzing the attention matrices to generate the computation graph further comprises: (See claim 1) Abnar further teaches generating a block matrix having diagonal blocks comprising attention weights based on the attention matrices from the plurality of self-attention layers of the first machine learning model and off-diagonal blocks comprising identity matrices representing forward connections between self-attention layers. (Abnar [fig(s) 1-4] [sec(s) 1] “We propose two simple but effective methods to compute attention scores to input tokens (i.e., token attention) at each layer, by taking raw attentions (i.e., embedding attention) of that layer as well as those from the precedent layers. These methods are based on modelling the information flow in the network with a DAG (Directed Acyclic Graph), in which the nodes are input tokens and hidden embeddings, edges are the attentions from the nodes in each layer to those in the previous layer, and the weights of the edges are the attention weights. … In both methods, we take the residual connection in the network into account to better model the connections between input tokens and hidden embedding.” [sec(s) 2] “Figure 2 (left) gives raw attention scores of the CLS token over input tokens (x-axis) at different layers (y-axis), which similarly lack an interpretable pattern.” [sec(s) 3] “Hence, to compute attention rollout and attention flow, we augment the attention graph with extra weights to represent residual connections. Given the attention module with residual connection, we compute values in layer l+1 as Vl+1 = Vl+WattVl , where Watt is the attention matrix. Thus, we have Vl+1 = (Watt + I)Vl . So, to account for residual connections, we add an identity matrix to the attention matrix and re-normalize the weights. This results in A = 0.5Watt + 0.5I, where A is the raw attention updated by residual connections.” and “Attention rollout … At the implementation level, to compute the attentions from li to lj, we recursively multiply the attention weights matrices in all the layers below. PNG media_image1.png 185 973 media_image1.png Greyscale (1) In this equation, A˜ is attention rollout, A is raw attention and the multiplication operation is a matrix multiplication. With this formulation, to compute input attention we set j = 0” and “Attention flow … Treating the attention graph as a flow network, where the capacities of the edges are attention weights, using any maximum flow algorithm, we can compute the maximum attention flow from any node in any of the layers to any of the input nodes.” [sec(s) 4] “Figures 2 and 3 show the weights from raw attention, attention rollout and attention flow for the CLS embedding over input tokens (x-axis) in all 6 layers (y-axis) for three examples. The first example is the same as the one in Figure 1”; e.g., figs 2-4 read(s) on “block matrix”. In addition, e.g., “to account for residual connections, we add an identity matrix to the attention matrix” read(s) on “off-diagonal blocks comprising identity matrices representing forward connections between self-attention layers”.) Regarding claim 5 The combination of Abnar, Dalli, Yuan teaches claim 4. Abnar further teaches wherein a respective forward connection represents a connection between a respective token at a respective self-attention layer and the respective token at a next self-attention layer of the first machine learning model. (Abnar [fig(s) 1-4] [sec(s) 1] “We propose two simple but effective methods to compute attention scores to input tokens (i.e., token attention) at each layer, by taking raw attentions (i.e., embedding attention) of that layer as well as those from the precedent layers. These methods are based on modelling the information flow in the network with a DAG (Directed Acyclic Graph), in which the nodes are input tokens and hidden embeddings, edges are the attentions from the nodes in each layer to those in the previous layer, and the weights of the edges are the attention weights. … In both methods, we take the residual connection in the network into account to better model the connections between input tokens and hidden embedding.” [sec(s) 2] “Figure 2 (left) gives raw attention scores of the CLS token over input tokens (x-axis) at different layers (y-axis), which similarly lack an interpretable pattern.” [sec(s) 3] “Hence, to compute attention rollout and attention flow, we augment the attention graph with extra weights to represent residual connections. Given the attention module with residual connection, we compute values in layer l+1 as Vl+1 = Vl+WattVl , where Watt is the attention matrix. Thus, we have Vl+1 = (Watt + I)Vl . So, to account for residual connections, we add an identity matrix to the attention matrix and re-normalize the weights. This results in A = 0.5Watt + 0.5I, where A is the raw attention updated by residual connections.” and “Attention rollout … At the implementation level, to compute the attentions from li to lj, we recursively multiply the attention weights matrices in all the layers below. PNG media_image1.png 185 973 media_image1.png Greyscale (1) In this equation, A˜ is attention rollout, A is raw attention and the multiplication operation is a matrix multiplication. With this formulation, to compute input attention we set j = 0” and “Attention flow … Treating the attention graph as a flow network, where the capacities of the edges are attention weights, using any maximum flow algorithm, we can compute the maximum attention flow from any node in any of the layers to any of the input nodes.” [sec(s) 4] “Figures 2 and 3 show the weights from raw attention, attention rollout and attention flow for the CLS embedding over input tokens (x-axis) in all 6 layers (y-axis) for three examples. The first example is the same as the one in Figure 1”; e.g., “to account for residual connections, we add an identity matrix to the attention matrix” read(s) on “respective forward connection”.) Regarding claim 6 The combination of Abnar, Dalli, Yuan teaches claim 4. Abnar further teaches wherein the block matrix is generated by processing the attention matrices associated with each of the plurality of self-attention layers sequentially such that representations of sequential self-attention layers are adjacent on the computation graph. (Abnar [fig(s) 1-4] [sec(s) 1] “We propose two simple but effective methods to compute attention scores to input tokens (i.e., token attention) at each layer, by taking raw attentions (i.e., embedding attention) of that layer as well as those from the precedent layers. These methods are based on modelling the information flow in the network with a DAG (Directed Acyclic Graph), in which the nodes are input tokens and hidden embeddings, edges are the attentions from the nodes in each layer to those in the previous layer, and the weights of the edges are the attention weights. … In both methods, we take the residual connection in the network into account to better model the connections between input tokens and hidden embedding.” [sec(s) 2] “Figure 2 (left) gives raw attention scores of the CLS token over input tokens (x-axis) at different layers (y-axis), which similarly lack an interpretable pattern.” [sec(s) 3] “Hence, to compute attention rollout and attention flow, we augment the attention graph with extra weights to represent residual connections. Given the attention module with residual connection, we compute values in layer l+1 as Vl+1 = Vl+WattVl , where Watt is the attention matrix. Thus, we have Vl+1 = (Watt + I)Vl . So, to account for residual connections, we add an identity matrix to the attention matrix and re-normalize the weights. This results in A = 0.5Watt + 0.5I, where A is the raw attention updated by residual connections.” and “Attention rollout … At the implementation level, to compute the attentions from li to lj, we recursively multiply the attention weights matrices in all the layers below. PNG media_image1.png 185 973 media_image1.png Greyscale (1) In this equation, A˜ is attention rollout, A is raw attention and the multiplication operation is a matrix multiplication. With this formulation, to compute input attention we set j = 0” and “Attention flow … Treating the attention graph as a flow network, where the capacities of the edges are attention weights, using any maximum flow algorithm, we can compute the maximum attention flow from any node in any of the layers to any of the input nodes.” [sec(s) 4] “Figures 2 and 3 show the weights from raw attention, attention rollout and attention flow for the CLS embedding over input tokens (x-axis) in all 6 layers (y-axis) for three examples. The first example is the same as the one in Figure 1”; e.g., figs 3-4 read(s) on “representations of sequential self-attention layers”.) Regarding claim 8 The claim is a method claim corresponding to the system claim 1, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the system claim. Regarding claim 9 The claim is a method claim corresponding to the system claim 2, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the system claim. Regarding claim 11 The claim is a method claim corresponding to the system claim 4, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the system claim. Regarding claim 12 The claim is a method claim corresponding to the system claim 5, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the system claim. Regarding claim 13 The claim is a method claim corresponding to the system claim 6, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the system claim. Regarding claim 15 The claim is a machine-readable medium claim corresponding to the system claim 1, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the system claim. Regarding claim 16 The claim is a machine-readable medium claim corresponding to the system claim 2, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the system claim. Regarding claim 17 The claim is a machine-readable medium claim corresponding to the system claim 3, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the system claim. Regarding claim 18 The claim is a machine-readable medium claim corresponding to the system claim 4, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the system claim. Regarding claim 19 The claim is a machine-readable medium claim corresponding to the system claim 5, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the system claim. Regarding claim 20 The claim is a machine-readable medium claim corresponding to the system claim 6, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the system claim. Claim(s) 7, 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Abnar et al. (Quantifying Attention Flow in Transformers) in view of Dalli et al. (US20220198254A1) in view of Yuan et al. (Explainability in Graph Neural Networks: A Taxonomic Survey) in view of Cui et al. (Fine-tune BERT with Sparse Self-Attention Mechanism) Regarding claim 7 The combination of Abnar, Dalli, Yuan teaches claim 4. wherein the non-transitory machine-readable storage medium includes instructions configured to cause the processor to perform operations of: (See claim 1) Abnar further teaches [filtering] the attention weights [to exclude weights that fall outside a predetermined percentage to generate a sparse version of] the computation graph. (Abnar [fig(s) 1-4] [sec(s) 1] “We propose two simple but effective methods to compute attention scores to input tokens (i.e., token attention) at each layer, by taking raw attentions (i.e., embedding attention) of that layer as well as those from the precedent layers. These methods are based on modelling the information flow in the network with a DAG (Directed Acyclic Graph), in which the nodes are input tokens and hidden embeddings, edges are the attentions from the nodes in each layer to those in the previous layer, and the weights of the edges are the attention weights. … In both methods, we take the residual connection in the network into account to better model the connections between input tokens and hidden embedding.” [sec(s) 2] “Figure 2 (left) gives raw attention scores of the CLS token over input tokens (x-axis) at different layers (y-axis), which similarly lack an interpretable pattern.” [sec(s) 3] “Hence, to compute attention rollout and attention flow, we augment the attention graph with extra weights to represent residual connections. Given the attention module with residual connection, we compute values in layer l+1 as Vl+1 = Vl+WattVl , where Watt is the attention matrix. Thus, we have Vl+1 = (Watt + I)Vl . So, to account for residual connections, we add an identity matrix to the attention matrix and re-normalize the weights. This results in A = 0.5Watt + 0.5I, where A is the raw attention updated by residual connections.” and “Attention rollout … At the implementation level, to compute the attentions from li to lj, we recursively multiply the attention weights matrices in all the layers below. PNG media_image1.png 185 973 media_image1.png Greyscale (1) In this equation, A˜ is attention rollout, A is raw attention and the multiplication operation is a matrix multiplication. With this formulation, to compute input attention we set j = 0” and “Attention flow … Treating the attention graph as a flow network, where the capacities of the edges are attention weights, using any maximum flow algorithm, we can compute the maximum attention flow from any node in any of the layers to any of the input nodes.” [sec(s) 4] “Figures 2 and 3 show the weights from raw attention, attention rollout and attention flow for the CLS embedding over input tokens (x-axis) in all 6 layers (y-axis) for three examples. The first example is the same as the one in Figure 1”;) However, the combination of Abnar, Dalli, Yuan does not appear to explicitly teach: [filtering] the attention weights [to exclude weights that fall outside a predetermined percentage to generate a sparse version of] the computation graph. Cui teaches filtering the attention weights to exclude weights that fall outside a predetermined percentage to generate a sparse version of the computation graph. (Cui [fig(s) 1, 3] [sec(s) 2.2] “By introducing sparsity to refine the attention weight, our sparse self-attention mechanism (SSAM) strengthens the most important relations among different words such as local interactions, and assigns zero probability to those meaningless connections. This enables us to achieve a more expressive representation for the whole input.” [sec(s) 2.3] “In this part, we propose a sparse self-attention fine-tuning model (SSAF). In particular, this finetuning model with BERT is composed of N sparse self-attention layers, where each layer learns a representation by taking the output from the previous layer: PNG media_image2.png 197 1181 media_image2.png Greyscale where SSAM is adopted to replace the traditional self-attention mechanism, h0 = embed(x) denotes the representation for the input sequence x which is the sum of token embeddings and the position embeddings, and LN is the layer normalization operation.” [sec(s) 3.2] “The coefficient λ which controls the sparisty in Equation 4 is set to -3 in SST-1 and SemEval, -4 in SST-2 and SenTube-T, -6 in SenTube-A and SciTail, and -7 in SQuAD. We investigate the influence of different λ settings in the experiment analysis part.”;) Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Abnar, Dalli, Yuan with the filtering of Cui. One of ordinary skill in the art would have been motived to combine in order to achieve remarkable performances improvements over other competing models in sentiment analysis, question answering, and natural language inference. (Cui [sec(s) 4.3] “Extensive experiments are conducted on three NLP tasks to investigate the performances of SSAF, which include sentiment analysis, question answering, and natural language inference. Evaluation results on seven public datasets show that the proposed approach achieves remarkable improvements over other competing models.”) Regarding claim 14 The claim is a method claim corresponding to the system claim 7, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the system claim. Prior Art The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Ying et al. (GNNExplainer: Generating Explanations for Graph Neural Networks) teaches GNNEXPLAINER, the first general, model-agnostic approach for providing interpretable explanations for predictions of any GNN-based model on any graph-based machine learning task. Conclusion Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. Any inquiry concerning this communication or earlier communications from the examiner should be directed to SEHWAN KIM whose telephone number is (571)270-7409. The examiner can normally be reached Mon - Thu 7:00 AM - 5:00 PM. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael J Huntley can be reached on (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /SEHWAN KIM/Examiner, Art Unit 2129
Read full office action

Prosecution Timeline

Jun 14, 2022
Application Filed
Aug 27, 2025
Non-Final Rejection mailed — §101, §103, §112
Oct 17, 2025
Interview Requested
Oct 31, 2025
Applicant Interview (Telephonic)
Oct 31, 2025
Examiner Interview Summary
Jan 27, 2026
Response Filed
May 04, 2026
Final Rejection mailed — §101, §103, §112 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12619853
DECISION-MAKING DEVICE, UNMANNED SYSTEM, DECISION-MAKING METHOD, AND PROGRAM
5y 6m to grant Granted May 05, 2026
Patent 12619921
PREDICTIVE FOG DATA CENTER MIGRATION
3y 8m to grant Granted May 05, 2026
Patent 12608592
AUTOMATED ELECTRIC SUBMERSIBLE PUMP (ESP) FAILURE ANALYSIS
3y 4m to grant Granted Apr 21, 2026
Patent 12602595
SYSTEM AND METHOD OF USING A KNOWLEDGE REPRESENTATION FOR FEATURES IN A MACHINE LEARNING CLASSIFIER
9y 4m to grant Granted Apr 14, 2026
Patent 12602580
Dataset Dependent Low Rank Decomposition Of Neural Networks
6y 9m to grant Granted Apr 14, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

3-4
Expected OA Rounds
60%
Grant Probability
99%
With Interview (+65.9%)
4y 0m (~0m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 146 resolved cases by this examiner. Grant probability derived from career allowance rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month