Prosecution Insights
Last updated: April 19, 2026
Application No. 18/335,685

PROPAGATING ATTENTION INFORMATION IN EFFICIENT MACHINE LEARNING MODELS

Non-Final OA §101§102§103
Filed
Jun 15, 2023
Examiner
KIM, SEHWAN
Art Unit
2129
Tech Center
2100 — Computer Architecture & Software
Assignee
Qualcomm Incorporated
OA Round
1 (Non-Final)
60%
Grant Probability
Moderate
1-2
OA Rounds
4y 1m
To Grant
99%
With Interview

Examiner Intelligence

Grants 60% of resolved cases
60%
Career Allow Rate
86 granted / 144 resolved
+4.7% vs TC avg
Strong +66% interview lift
Without
With
+65.6%
Interview Lift
resolved cases with interview
Typical timeline
4y 1m
Avg Prosecution
35 currently pending
Career history
179
Total Applications
across all art units

Statute-Specific Performance

§101
20.8%
-19.2% vs TC avg
§103
46.2%
+6.2% vs TC avg
§102
6.3%
-33.7% vs TC avg
§112
23.3%
-16.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 144 resolved cases

Office Action

§101 §102 §103
Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Examiner’s Note The Examiner encourages Applicant to schedule an interview to discuss issues related to, for example, the rejections noted below under 35 U.S.C § 101 and § 103, for moving forward allowance. Providing supporting paragraph(s) for each limitation of amended/new claim(s) in Remarks is strongly requested for clear and definite claim interpretations by Examiner. For clarification, claim 35 may be amended to include one or more processors and a non-transitory memory (as claim 18) to make sure that the claim falls within one of the four statutory categories. On the other hand, it may be amended to a non-transitory computer-readable medium claim since claim 18 is a system claim as well. Priority Acknowledgment is made of applicant's claim for the provisional application filed on 11/11/2022. Claim Interpretation The following is a quotation of 35 U.S.C. 112(f): (f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. The following is a quotation of pre-AIA 35 U.S.C. 112, sixth paragraph: An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art. The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is invoked. As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph: (A) the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; (B) the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and (C) the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Such claim limitation(s) is/are: claim 35: “means for generating, using a first transformer block of a plurality of transformer blocks, a first attention propagation output” (Note that fig 9 and pars 126-49 of the present application describe sufficient structure for performing the claimed function.) claim 35: “the means for generating being configured to process input data for the first transformer block using a first self-attention sub-block of the first transformer block” (Note that fig 9 and pars 126-49 of the present application describe sufficient structure for performing the claimed function.) claim 35: “means for propagating the first attention propagation output to a second transformer block of the plurality of transformer blocks” (Note that fig 9 and pars 126-49 of the present application describe sufficient structure for performing the claimed function.) claim 35: “means for generating an output for the second transformer block” (Note that fig 9 and pars 126-49 of the present application describe sufficient structure for performing the claimed function.) claim 35: “the means for generating the output for the second transformer block being configured to output features for the second transformer block based on the first attention propagation output” (Note that fig 9 and pars 126-49 of the present application describe sufficient structure for performing the claimed function.) Claim Rejections - 35 USC § 101 35 U.S.C. 101 reads as follows: Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title. Claims 1-35 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Regarding claim 1 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The limitations of “… method, comprising: generating, …, a first attention propagation output, the generating comprising processing input data for the first transformer block …; propagating the first attention propagation output to a second transformer block of the plurality of transformer blocks; and generating an output for the second transformer block, the generating the output for the second transformer block comprising generating output features for the second transformer block based on the first attention propagation output.”, as drafted, are a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind. That is, nothing in the claim element precludes the step from practically being performed in the mind. For example, the limitations in the context of this claim encompass the user mentally thinking with a physical aid (e.g., pencil and paper). If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea. Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites additional elements that are mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea. See MPEP 2106.05(f). In particular, the claim recites an additional element(s) (“A computer-implemented”, “using a first transformer block of a plurality of transformer blocks”, “using a first self-attention sub-block of the first transformer block”) – using a device and/or a model to process data. The device and the model in each step are recited at a high-level of generality (i.e., as a generic computer performing a generic computer function of processing data) such that it amounts no more than mere instructions to apply the exception using a generic computer component. Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea. Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above, with respect to integration of the abstract idea into a practical application, the additional elements of using a generic computer component to perform each step amount to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claim is not patent eligible. MPEP 2106.05(f). Regarding claim 2 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The limitations of “generating the first attention propagation output further comprises generating, …, an attention matrix; and generating the attention matrix comprises processing a query representation and a key representation of the input data for the first transformer block …”, as drafted, are a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind. That is, nothing in the claim element precludes the step from practically being performed in the mind. For example, the limitations in the context of this claim encompass the user mentally thinking with a physical aid (e.g., pencil and paper). If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea. Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites additional elements that are mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea. See MPEP 2106.05(f). In particular, the claim recites an additional element(s) (“using the first transformer block”, “using the first self-attention sub-block”) – using a device and/or a model to process data. The device and the model in each step are recited at a high-level of generality (i.e., as a generic computer performing a generic computer function of processing data) such that it amounts no more than mere instructions to apply the exception using a generic computer component. Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea. Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above, with respect to integration of the abstract idea into a practical application, the additional elements of using a generic computer component to perform each step amount to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claim is not patent eligible. MPEP 2106.05(f). Regarding claim 3 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The claim recites the abstract idea identified above regarding claim 1. Step 2A Prong 2: This judicial exception is not integrated into a practical application. In particular, the claim recites an additional element (“wherein the first attention propagation output comprises the attention matrix”). This is a recitation of a particular type or source of data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not integrate the abstract idea into a practical application. See MPEP 2106.05(h) Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. This is a recitation of a particular type or source of data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not amount to significantly more than the abstract idea. See MPEP 2106.05(h). Regarding claim 4 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The limitations of “…, …; and generating, …, the output features for the second transformer block based on the first attention propagation output and a value representation of the output for the third transformer block”, as drafted, are a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind. That is, nothing in the claim element precludes the step from practically being performed in the mind. For example, the limitations in the context of this claim encompass the user mentally thinking with a physical aid (e.g., pencil and paper). If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea. Step 2A Prong 2: This judicial exception is not integrated into a practical application. In particular, the claim recites an additional element(s) (“accessing an output for a third transformer block of the plurality of transformer blocks”) – the act of accessing data. The claim is adding an insignificant extra-solution activity to the judicial exception – see MPEP 2106.05(g). The act of accessing data is recited at a high-level of generality (i.e., as a generic act of performing a generic act function of accessing data) such that it amounts no more than a mere act to apply the exception using a generic act of accessing. Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea. In particular, the claim recites an additional element (“wherein the third transformer block immediately precedes the second transformer block; and”). This is a recitation of a particular type or source of model/data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of model/data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not integrate the abstract idea into a practical application. See MPEP 2106.05(h) The claim recites additional elements that are mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea. See MPEP 2106.05(f). In particular, the claim recites an additional element(s) (“using a second self-attention sub-block of the second transformer block”) – using a device and/or a model to process data. The device and the model in each step are recited at a high-level of generality (i.e., as a generic computer performing a generic computer function of processing data) such that it amounts no more than mere instructions to apply the exception using a generic computer component. Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea. Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above, the claim recites the additional element(s) of accessing data at a high-level of generality and is adding an insignificant extra-solution activity – see MPEP 2106.05(g). However, the addition of insignificant extra-solution activity does not amount to an inventive concept, particularly when the activity is well-understood, routine, and conventional. See MPEP 2106.05(d)(II) – “Receiving or transmitting data over a network” or “Storing and retrieving information in memory”. Accordingly, this additional element does not provide an inventive concept and significantly more than the abstract idea. Thus, the claim is not patent eligible. This is a recitation of a particular type or source of model/data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of model/data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not amount to significantly more than the abstract idea. See MPEP 2106.05(h). As discussed above, with respect to integration of the abstract idea into a practical application, the additional elements of using a generic computer component to perform each step amount to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claim is not patent eligible. MPEP 2106.05(f). Regarding claim 5 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The limitations of “wherein generating the first attention propagation output comprises generating, …, output features for the first transformer block by processing the attention matrix and a value representation of the input data for the first transformer block …”, as drafted, are a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind. That is, nothing in the claim element precludes the step from practically being performed in the mind. For example, the limitations in the context of this claim encompass the user mentally thinking with a physical aid (e.g., pencil and paper). If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea. Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites additional elements that are mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea. See MPEP 2106.05(f). In particular, the claim recites an additional element(s) (“using the first transformer block”, “using the first self-attention sub-block”) – using a device and/or a model to process data. The device and the model in each step are recited at a high-level of generality (i.e., as a generic computer performing a generic computer function of processing data) such that it amounts no more than mere instructions to apply the exception using a generic computer component. Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea. Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above, with respect to integration of the abstract idea into a practical application, the additional elements of using a generic computer component to perform each step amount to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claim is not patent eligible. MPEP 2106.05(f). Regarding claim 6 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The claim recites the abstract idea identified above regarding claim 1. Step 2A Prong 2: This judicial exception is not integrated into a practical application. In particular, the claim recites an additional element (“wherein the first attention propagation output comprises the output features for the first transformer block”). This is a recitation of a particular type or source of data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not integrate the abstract idea into a practical application. See MPEP 2106.05(h) Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. This is a recitation of a particular type or source of data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not amount to significantly more than the abstract idea. See MPEP 2106.05(h). Regarding claim 7 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The limitations of “further comprising generating, …, an output for the first transformer block, wherein generating the output for the first transformer block comprises processing output features of the first self-attention sub-block …”, as drafted, are a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind. That is, nothing in the claim element precludes the step from practically being performed in the mind. For example, the limitations in the context of this claim encompass the user mentally thinking with a physical aid (e.g., pencil and paper). If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea. Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites additional elements that are mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea. See MPEP 2106.05(f). In particular, the claim recites an additional element(s) (“using the first transformer block”, “using a first feedforward sub-block of the first transformer block”) – using a device and/or a model to process data. The device and the model in each step are recited at a high-level of generality (i.e., as a generic computer performing a generic computer function of processing data) such that it amounts no more than mere instructions to apply the exception using a generic computer component. Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea. Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above, with respect to integration of the abstract idea into a practical application, the additional elements of using a generic computer component to perform each step amount to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claim is not patent eligible. MPEP 2106.05(f). Regarding claim 8 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The limitations of “wherein generating the output for the second transformer block comprises processing the output features of the second self-attention sub-block …”, as drafted, are a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind. That is, nothing in the claim element precludes the step from practically being performed in the mind. For example, the limitations in the context of this claim encompass the user mentally thinking with a physical aid (e.g., pencil and paper). If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea. Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites additional elements that are mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea. See MPEP 2106.05(f). In particular, the claim recites an additional element(s) (“using a second feedforward sub-block of the second transformer block”) – using a device and/or a model to process data. The device and the model in each step are recited at a high-level of generality (i.e., as a generic computer performing a generic computer function of processing data) such that it amounts no more than mere instructions to apply the exception using a generic computer component. Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea. Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above, with respect to integration of the abstract idea into a practical application, the additional elements of using a generic computer component to perform each step amount to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claim is not patent eligible. MPEP 2106.05(f). Regarding claim 9 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The claim recites the abstract idea identified above regarding claim 1. Step 2A Prong 2: This judicial exception is not integrated into a practical application. In particular, the claim recites an additional element (“the first transformer block comprises an encoder block, and the second transformer block comprises a decoder block.”). This is a recitation of a particular type or source of model/data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of model/data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not integrate the abstract idea into a practical application. See MPEP 2106.05(h) Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. This is a recitation of a particular type or source of model/data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of model/data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not amount to significantly more than the abstract idea. See MPEP 2106.05(h). Regarding claim 10 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The claim recites the abstract idea identified above regarding claim 1. Step 2A Prong 2: This judicial exception is not integrated into a practical application. In particular, the claim recites an additional element (“the plurality of transformer blocks comprises a sequence of transformer blocks, the sequence of transformer blocks comprises one or more initial blocks, a plurality of intermediate blocks, and one or more final blocks, and the plurality of intermediate blocks comprises the first transformer block and the second transformer block”). This is a recitation of a particular type or source of model/data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of model/data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not integrate the abstract idea into a practical application. See MPEP 2106.05(h) Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. This is a recitation of a particular type or source of model/data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of model/data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not amount to significantly more than the abstract idea. See MPEP 2106.05(h). Regarding claim 11 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The limitations of “wherein generating the first attention propagation output comprises processing the input data for the first transformer block using a plurality of window self-attention operations to generate the output features for the first transformer block”, as drafted, are a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind. That is, nothing in the claim element precludes the step from practically being performed in the mind. For example, the limitations in the context of this claim encompass the user mentally thinking with a physical aid (e.g., pencil and paper). If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea. Step 2A Prong 2: This judicial exception is not integrated into a practical application. In particular, the claim does not recite additional elements. Thus, the claim is directed to an abstract idea. Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. Thus, the claim is not patent eligible. Regarding claim 12 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The claim recites the abstract idea identified above regarding claim 1. Step 2A Prong 2: This judicial exception is not integrated into a practical application. In particular, the claim recites an additional element (“wherein the first attention propagation output comprises the output features for the first transformer block”). This is a recitation of a particular type or source of model/data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of model/data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not integrate the abstract idea into a practical application. See MPEP 2106.05(h) Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. This is a recitation of a particular type or source of model/data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of model/data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not amount to significantly more than the abstract idea. See MPEP 2106.05(h). Regarding claim 13 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The limitations of “wherein generating the first attention propagation output comprises generating, …, output features for the first transformer block by processing the attention matrix and a value representation of the input data for the first transformer block …”, as drafted, are a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind. That is, nothing in the claim element precludes the step from practically being performed in the mind. For example, the limitations in the context of this claim encompass the user mentally thinking with a physical aid (e.g., pencil and paper). If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea. Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites additional elements that are mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea. See MPEP 2106.05(f). In particular, the claim recites an additional element(s) (“using the first transformer block”, “using the first self-attention sub-block”) – using a device and/or a model to process data. The device and the model in each step are recited at a high-level of generality (i.e., as a generic computer performing a generic computer function of processing data) such that it amounts no more than mere instructions to apply the exception using a generic computer component. Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea. Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above, with respect to integration of the abstract idea into a practical application, the additional elements of using a generic computer component to perform each step amount to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claim is not patent eligible. MPEP 2106.05(f). Regarding claim 14 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The limitations of “propagating the first attention propagation output to the second transformer block comprises propagating the first attention propagation output using a propagation operation, and the propagation operation comprises transforming the first attention propagation output using an upsampling operation”, as drafted, are a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind. That is, nothing in the claim element precludes the step from practically being performed in the mind. For example, the limitations in the context of this claim encompass the user mentally thinking with a physical aid (e.g., pencil and paper). If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea. Step 2A Prong 2: This judicial exception is not integrated into a practical application. In particular, the claim does not recite additional elements. Thus, the claim is directed to an abstract idea. Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. Thus, the claim is not patent eligible. Regarding claim 15 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The limitations of “wherein propagating the first attention propagation output to the second transformer block comprises propagating the first attention propagation output using a propagation operation”, as drafted, are a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind. That is, nothing in the claim element precludes the step from practically being performed in the mind. For example, the limitations in the context of this claim encompass the user mentally thinking with a physical aid (e.g., pencil and paper). If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea. Step 2A Prong 2: This judicial exception is not integrated into a practical application. In particular, the claim does not recite additional elements. Thus, the claim is directed to an abstract idea. Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. Thus, the claim is not patent eligible. Regarding claim 16 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The limitations of “wherein the propagation operation comprises transforming the first attention propagation output by performing one or more convolution operations on the first attention propagation output”, as drafted, are a process that, under its broadest reasonable interpretation, covers performance of the limitation based on mathematical relationships and/or mathematical formulas or equations and/or mathematical calculations. That is, nothing in the claim element precludes the step from practically being performed based on mathematical relationships and/or mathematical formulas or equations and/or mathematical calculations. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation based on mathematical relationships and/or mathematical formulas or equations and/or mathematical calculations, but for the recitation of generic computer components, then it falls within the “Mathematical concepts” grouping of abstract ideas. Accordingly, the claim recites an abstract idea. Step 2A Prong 2: This judicial exception is not integrated into a practical application. In particular, the claim does not recite additional elements. Thus, the claim is directed to an abstract idea. Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. Thus, the claim is not patent eligible. Regarding claim 17 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The limitations of “wherein, when generating the output features for the second transformer block, … does not compute an attention matrix”, as drafted, are a process that, under its broadest reasonable interpretation, covers performance of the limitation based on mathematical relationships and/or mathematical formulas or equations and/or mathematical calculations. That is, nothing in the claim element precludes the step from practically being performed based on mathematical relationships and/or mathematical formulas or equations and/or mathematical calculations. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation based on mathematical relationships and/or mathematical formulas or equations and/or mathematical calculations, but for the recitation of generic computer components, then it falls within the “Mathematical concepts” grouping of abstract ideas. Accordingly, the claim recites an abstract idea. Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites additional elements that are mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea. See MPEP 2106.05(f). In particular, the claim recites an additional element(s) (“a second self-attention sub-block”) – using a device and/or a model to process data. The device and the model in each step are recited at a high-level of generality (i.e., as a generic computer performing a generic computer function of processing data) such that it amounts no more than mere instructions to apply the exception using a generic computer component. Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea. Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above, with respect to integration of the abstract idea into a practical application, the additional elements of using a generic computer component to perform each step amount to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claim is not patent eligible. MPEP 2106.05(f). Regarding claim 18 The claim recites “A processing system comprising: a memory comprising computer-executable instructions; and one or more processors configured to execute the computer-executable instructions and cause the processing system to perform an operation comprising:” to perform precisely the method of Claim 1. As performance of an abstract idea on generic computer components (see MPEP 2106.05(f)) cannot integrate the abstract idea into a practical application nor provide significantly more than the abstract idea itself, the claim is rejected for reasons set forth in the rejection of Claim 1. Regarding claim 19 The claim is rejected for the reasons set forth in the rejection of Claim 2 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception. Regarding claim 20 The claim is rejected for the reasons set forth in the rejection of Claim 3 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception. Regarding claim 21 The claim is rejected for the reasons set forth in the rejection of Claim 4 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception. Regarding claim 22 The claim is rejected for the reasons set forth in the rejection of Claim 5 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception. Regarding claim 23 The claim is rejected for the reasons set forth in the rejection of Claim 6 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception. Regarding claim 24 The claim is rejected for the reasons set forth in the rejection of Claim 7 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception. Regarding claim 25 The claim is rejected for the reasons set forth in the rejection of Claim 8 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception. Regarding claim 26 The claim is rejected for the reasons set forth in the rejection of Claim 9 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception. Regarding claim 27 The claim is rejected for the reasons set forth in the rejection of Claim 10 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception. Regarding claim 28 The claim is rejected for the reasons set forth in the rejection of Claim 11 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception. Regarding claim 29 The claim is rejected for the reasons set forth in the rejection of Claim 12 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception. Regarding claim 30 The claim is rejected for the reasons set forth in the rejection of Claim 13 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception. Regarding claim 31 The claim is rejected for the reasons set forth in the rejection of Claim 14 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception. Regarding claim 32 The claim is rejected for the reasons set forth in the rejection of Claim 15 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception. Regarding claim 33 The claim is rejected for the reasons set forth in the rejection of Claim 16 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception. Regarding claim 34 The claim is rejected for the reasons set forth in the rejection of Claim 17 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception. Regarding claim 35 The claim recites “A processing system, comprising:” to perform precisely the method of Claim 1. As performance of an abstract idea on generic computer components (see MPEP 2106.05(f)) cannot integrate the abstract idea into a practical application nor provide significantly more than the abstract idea itself, the claim is rejected for reasons set forth in the rejection of Claim 1. Claim Rejections - 35 USC § 102 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action: A person shall be entitled to a patent unless – (a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention. Claim(s) 1-8, 10, 15, 17-25, 27, 32, 34-35 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Xiao et al. (Sharing Attention Weights for Fast Transformer) Regarding claim 1 Xiao teaches A computer-implemented method, comprising: generating, using a first transformer block of a plurality of transformer blocks, a first attention propagation output, the generating comprising processing input data for the first transformer block using a first self-attention sub-block of the first transformer block; (Xiao [fig(s) 1] [fig(s) 3] “Layer n=m+1”, “Layer m” [sec(s) 1] “In this work, we observe that the attention model shares a similar distribution among layers in weighting different positions of the sequence. This experience lead us to study the issue in another line of research, in which we reduce redundant computation and re-use some of the hidden states in the attention network. We propose a method to share attention weights in adjacent layers (call it shared attention network, or SAN for short). It leads to a model that shares attention computation in the stacked layers vertically. In addition to the new architecture, we develop a joint method to learn sharing policies and MT models simultaneously. As another “bonus”, SAN reduces the memory footprint because some hidden states are kept in the same piece of memory. SAN is simple and can be implemented in a few hours by anyone with an existing kit of Transformer. Also, it is orthogonal to previous methods and is straightforwardly applicable to the variants of Transformer.” [sec(s) 3.2] “An obvious next step is to develop a faster attention model that makes efficient re-use of the states in Eqs. (1) and (2), instead of computing everything on the fly. In this work we present a shared attention network (SAN) to share weight matrix S for adjacent layers. The idea is that we just compute the weight matrix once and reuse it for upper-level layers. Here we describe SAN for both the self-attention and encoder-decoder attention models. • SAN Self-Attention. We define the self-attention weight matrix in layer m as: Sm = s(Qm, Km) (3) where s(·, ·) is the function described in Eq. (1), Qm and Km are the inputs, and Sm is the attention weight for the output.” [sec(s) 4.1] “The bilingual and evaluation data came from three sources. … All models were trained for 100k steps with a mini-batch of 4,096 tokens on machines with 8 Nvidia 1080Ti GPUs”; e.g., “Sm = s(Qm, Km)” of fig 3 read(s) on “first self-attention sub-block”.) propagating the first attention propagation output to a second transformer block of the plurality of transformer blocks; and (Xiao [fig(s) 1] [fig(s) 3] “Layer n=m+1”, “Layer m” [sec(s) 1] “In this work, we observe that the attention model shares a similar distribution among layers in weighting different positions of the sequence.” [sec(s) 3.2] “An obvious next step is to develop a faster attention model that makes efficient re-use of the states in Eqs. (1) and (2), instead of computing everything on the fly. In this work we present a shared attention network (SAN) to share weight matrix S for adjacent layers. The idea is that we just compute the weight matrix once and reuse it for upper-level layers. Here we describe SAN for both the self-attention and encoder-decoder attention models. • SAN Self-Attention. We define the self-attention weight matrix in layer m as: Sm = s(Qm, Km) (3) where s(·, ·) is the function described in Eq. (1), Qm and Km are the inputs, and Sm is the attention weight for the output. In SAN, we can share Sm with the layers above m, like this Sm+i = s(Qm, Km) (4) for i ∈ [1, π − 1] where π indicates how many layers share the same attention weights. For example, in a 6-layer decoder, we can share the self-attention weights for every two layers (π = 2), or share the weights for the first two layers (π1 = 2) and let the remaining 4 layers use another weights (π2 = 4).”;) generating an output for the second transformer block, the generating the output for the second transformer block comprising generating output features for the second transformer block based on the first attention propagation output. (Xiao [fig(s) 1] [fig(s) 3] “Layer n=m+1”, “Layer m”, “An=Sn·V” [sec(s) 1] “In this work, we observe that the attention model shares a similar distribution among layers in weighting different positions of the sequence.” [sec(s) 3.2] “An obvious next step is to develop a faster attention model that makes efficient re-use of the states in Eqs. (1) and (2), instead of computing everything on the fly. In this work we present a shared attention network (SAN) to share weight matrix S for adjacent layers. The idea is that we just compute the weight matrix once and reuse it for upper-level layers. Here we describe SAN for both the self-attention and encoder-decoder attention models. • SAN Self-Attention. We define the self-attention weight matrix in layer m as: Sm = s(Qm, Km) (3) where s(·, ·) is the function described in Eq. (1), Qm and Km are the inputs, and Sm is the attention weight for the output. In SAN, we can share Sm with the layers above m, like this Sm+i = s(Qm, Km) (4) for i ∈ [1, π − 1] where π indicates how many layers share the same attention weights. For example, in a 6-layer decoder, we can share the self-attention weights for every two layers (π = 2), or share the weights for the first two layers (π1 = 2) and let the remaining 4 layers use another weights (π2 = 4).” [sec(s) 2] “Note that S is essentially a weight (or scalar) matrix where every column represents a distribution. The output of self-attention is simply defined as the weighted sum of values: A = S · V (2) Here Q, K and V are generated from the same source with a linear transformation. The self-attention result is then fed into a fully connected feed-forward network (FFN).”;) Regarding claim 2 Xiao teaches claim 1. Xiao further teaches generating the first attention propagation output further comprises generating, using the first transformer block, an attention matrix; and (Xiao [fig(s) 1] [fig(s) 3] “Layer n=m+1”, “Layer m” [sec(s) 1] “In this work, we observe that the attention model shares a similar distribution among layers in weighting different positions of the sequence. This experience lead us to study the issue in another line of research, in which we reduce redundant computation and re-use some of the hidden states in the attention network. We propose a method to share attention weights in adjacent layers (call it shared attention network, or SAN for short). It leads to a model that shares attention computation in the stacked layers vertically. In addition to the new architecture, we develop a joint method to learn sharing policies and MT models simultaneously. As another “bonus”, SAN reduces the memory footprint because some hidden states are kept in the same piece of memory. SAN is simple and can be implemented in a few hours by anyone with an existing kit of Transformer. Also, it is orthogonal to previous methods and is straightforwardly applicable to the variants of Transformer.” [sec(s) 3.2] “An obvious next step is to develop a faster attention model that makes efficient re-use of the states in Eqs. (1) and (2), instead of computing everything on the fly. In this work we present a shared attention network (SAN) to share weight matrix S for adjacent layers. The idea is that we just compute the weight matrix once and reuse it for upper-level layers. Here we describe SAN for both the self-attention and encoder-decoder attention models. • SAN Self-Attention. We define the self-attention weight matrix in layer m as: Sm = s(Qm, Km) (3) where s(·, ·) is the function described in Eq. (1), Qm and Km are the inputs, and Sm is the attention weight for the output.” [sec(s) 4.1] “The bilingual and evaluation data came from three sources.”;) generating the attention matrix comprises processing a query representation and a key representation of the input data for the first transformer block using the first self-attention sub-block. (Xiao [fig(s) 1] [fig(s) 3] “Layer n=m+1”, “Layer m” [sec(s) 1] “In this work, we observe that the attention model shares a similar distribution among layers in weighting different positions of the sequence. … SAN is simple and can be implemented in a few hours by anyone with an existing kit of Transformer. Also, it is orthogonal to previous methods and is straightforwardly applicable to the variants of Transformer.” [sec(s) 3.2] “An obvious next step is to develop a faster attention model that makes efficient re-use of the states in Eqs. (1) and (2), instead of computing everything on the fly. In this work we present a shared attention network (SAN) to share weight matrix S for adjacent layers. The idea is that we just compute the weight matrix once and reuse it for upper-level layers. Here we describe SAN for both the self-attention and encoder-decoder attention models. • SAN Self-Attention. We define the self-attention weight matrix in layer m as: Sm = s(Qm, Km) (3) where s(·, ·) is the function described in Eq. (1), Qm and Km are the inputs, and Sm is the attention weight for the output.” [sec(s) 4.1] “The bilingual and evaluation data came from three sources.”;) Regarding claim 3 Xiao teaches claim 2. Xiao further teaches wherein the first attention propagation output comprises the attention matrix. (Xiao [fig(s) 1] [fig(s) 3] “Layer n=m+1”, “Layer m” [sec(s) 1] “In this work, we observe that the attention model shares a similar distribution among layers in weighting different positions of the sequence. … SAN is simple and can be implemented in a few hours by anyone with an existing kit of Transformer. Also, it is orthogonal to previous methods and is straightforwardly applicable to the variants of Transformer.” [sec(s) 3.2] “An obvious next step is to develop a faster attention model that makes efficient re-use of the states in Eqs. (1) and (2), instead of computing everything on the fly. In this work we present a shared attention network (SAN) to share weight matrix S for adjacent layers. The idea is that we just compute the weight matrix once and reuse it for upper-level layers. Here we describe SAN for both the self-attention and encoder-decoder attention models. • SAN Self-Attention. We define the self-attention weight matrix in layer m as: Sm = s(Qm, Km) (3) where s(·, ·) is the function described in Eq. (1), Qm and Km are the inputs, and Sm is the attention weight for the output.” [sec(s) 4.1] “The bilingual and evaluation data came from three sources.”;) Regarding claim 4 Xiao teaches claim 3. wherein generating the output features for the second transformer block further comprises: (See claim 1) Xiao further teaches accessing an output for a third transformer block of the plurality of transformer blocks, wherein the third transformer block immediately precedes the second transformer block; and (Xiao [fig(s) 1] [fig(s) 3] “Layer n=m+1”, “Layer m”, “An=Sn·V” [sec(s) 1] “In this work, we observe that the attention model shares a similar distribution among layers in weighting different positions of the sequence.” [sec(s) 3.2] “An obvious next step is to develop a faster attention model that makes efficient re-use of the states in Eqs. (1) and (2), instead of computing everything on the fly. In this work we present a shared attention network (SAN) to share weight matrix S for adjacent layers. The idea is that we just compute the weight matrix once and reuse it for upper-level layers. Here we describe SAN for both the self-attention and encoder-decoder attention models. • SAN Self-Attention. We define the self-attention weight matrix in layer m as: Sm = s(Qm, Km) (3) where s(·, ·) is the function described in Eq. (1), Qm and Km are the inputs, and Sm is the attention weight for the output. In SAN, we can share Sm with the layers above m, like this Sm+i = s(Qm, Km) (4) for i ∈ [1, π − 1] where π indicates how many layers share the same attention weights. For example, in a 6-layer decoder, we can share the self-attention weights for every two layers (π = 2), or share the weights for the first two layers (π1 = 2) and let the remaining 4 layers use another weights (π2 = 4).” [sec(s) 2] “Note that S is essentially a weight (or scalar) matrix where every column represents a distribution. The output of self-attention is simply defined as the weighted sum of values: A = S · V (2) Here Q, K and V are generated from the same source with a linear transformation. The self-attention result is then fed into a fully connected feed-forward network (FFN).”;) generating, using a second self-attention sub-block of the second transformer block, the output features for the second transformer block based on the first attention propagation output and a value representation of the output for the third transformer block. (Xiao [fig(s) 1] [fig(s) 3] “Layer n=m+1”, “Layer m”, “An=Sn·V” [sec(s) 1] “In this work, we observe that the attention model shares a similar distribution among layers in weighting different positions of the sequence.” [sec(s) 3.2] “An obvious next step is to develop a faster attention model that makes efficient re-use of the states in Eqs. (1) and (2), instead of computing everything on the fly. In this work we present a shared attention network (SAN) to share weight matrix S for adjacent layers. The idea is that we just compute the weight matrix once and reuse it for upper-level layers. Here we describe SAN for both the self-attention and encoder-decoder attention models. • SAN Self-Attention. We define the self-attention weight matrix in layer m as: Sm = s(Qm, Km) (3) where s(·, ·) is the function described in Eq. (1), Qm and Km are the inputs, and Sm is the attention weight for the output. In SAN, we can share Sm with the layers above m, like this Sm+i = s(Qm, Km) (4) for i ∈ [1, π − 1] where π indicates how many layers share the same attention weights. For example, in a 6-layer decoder, we can share the self-attention weights for every two layers (π = 2), or share the weights for the first two layers (π1 = 2) and let the remaining 4 layers use another weights (π2 = 4).” [sec(s) 2] “Note that S is essentially a weight (or scalar) matrix where every column represents a distribution. The output of self-attention is simply defined as the weighted sum of values: A = S · V (2) Here Q, K and V are generated from the same source with a linear transformation. The self-attention result is then fed into a fully connected feed-forward network (FFN).”;) Regarding claim 5 Xiao teaches claim 2. Xiao further teaches wherein generating the first attention propagation output comprises generating, using the first transformer block, output features for the first transformer block by processing the attention matrix and a value representation of the input data for the first transformer block using the first self-attention sub-block. (Xiao [fig(s) 1] [fig(s) 3] “Layer n=m+1”, “Layer m”, “An=Sn·V” [sec(s) 1] “In this work, we observe that the attention model shares a similar distribution among layers in weighting different positions of the sequence.” [sec(s) 3.2] “An obvious next step is to develop a faster attention model that makes efficient re-use of the states in Eqs. (1) and (2), instead of computing everything on the fly. In this work we present a shared attention network (SAN) to share weight matrix S for adjacent layers. The idea is that we just compute the weight matrix once and reuse it for upper-level layers. Here we describe SAN for both the self-attention and encoder-decoder attention models. • SAN Self-Attention. We define the self-attention weight matrix in layer m as: Sm = s(Qm, Km) (3) where s(·, ·) is the function described in Eq. (1), Qm and Km are the inputs, and Sm is the attention weight for the output. In SAN, we can share Sm with the layers above m, like this Sm+i = s(Qm, Km) (4) for i ∈ [1, π − 1] where π indicates how many layers share the same attention weights. For example, in a 6-layer decoder, we can share the self-attention weights for every two layers (π = 2), or share the weights for the first two layers (π1 = 2) and let the remaining 4 layers use another weights (π2 = 4).” [sec(s) 2] “Note that S is essentially a weight (or scalar) matrix where every column represents a distribution. The output of self-attention is simply defined as the weighted sum of values: A = S · V (2) Here Q, K and V are generated from the same source with a linear transformation. The self-attention result is then fed into a fully connected feed-forward network (FFN).” [sec(s) 4.1] “The bilingual and evaluation data came from three sources.”; e.g., “Sm = s(Qm, Km)” of fig 3 read(s) on “first self-attention sub-block”.) Regarding claim 6 Xiao teaches claim 5. Xiao further teaches wherein the first attention propagation output comprises the output features for the first transformer block. (Xiao [fig(s) 1] [fig(s) 3] “Layer n=m+1”, “Layer m”, “Am=Sm·V” [sec(s) 1] “In this work, we observe that the attention model shares a similar distribution among layers in weighting different positions of the sequence.” [sec(s) 3.2] “An obvious next step is to develop a faster attention model that makes efficient re-use of the states in Eqs. (1) and (2), instead of computing everything on the fly. In this work we present a shared attention network (SAN) to share weight matrix S for adjacent layers. The idea is that we just compute the weight matrix once and reuse it for upper-level layers. Here we describe SAN for both the self-attention and encoder-decoder attention models. • SAN Self-Attention. We define the self-attention weight matrix in layer m as: Sm = s(Qm, Km) (3) where s(·, ·) is the function described in Eq. (1), Qm and Km are the inputs, and Sm is the attention weight for the output. In SAN, we can share Sm with the layers above m, like this Sm+i = s(Qm, Km) (4) for i ∈ [1, π − 1] where π indicates how many layers share the same attention weights. For example, in a 6-layer decoder, we can share the self-attention weights for every two layers (π = 2), or share the weights for the first two layers (π1 = 2) and let the remaining 4 layers use another weights (π2 = 4).” [sec(s) 2] “Note that S is essentially a weight (or scalar) matrix where every column represents a distribution. The output of self-attention is simply defined as the weighted sum of values: A = S · V (2) Here Q, K and V are generated from the same source with a linear transformation. The self-attention result is then fed into a fully connected feed-forward network (FFN).” [sec(s) 4.1] “The bilingual and evaluation data came from three sources.”; e.g., “Am” read(s) on “output features”.) Regarding claim 7 Xiao teaches claim 1. Xiao further teaches further comprising generating, using the first transformer block, an output for the first transformer block, wherein generating the output for the first transformer block comprises processing output features of the first self-attention sub-block using a first feedforward sub-block of the first transformer block. (Xiao [fig(s) 1] [fig(s) 3] “Layer n=m+1”, “Layer m”, “Am=Sm·V” [sec(s) 1] “In this work, we observe that the attention model shares a similar distribution among layers in weighting different positions of the sequence.” [sec(s) 3.2] “An obvious next step is to develop a faster attention model that makes efficient re-use of the states in Eqs. (1) and (2), instead of computing everything on the fly. In this work we present a shared attention network (SAN) to share weight matrix S for adjacent layers. The idea is that we just compute the weight matrix once and reuse it for upper-level layers. Here we describe SAN for both the self-attention and encoder-decoder attention models. • SAN Self-Attention. We define the self-attention weight matrix in layer m as: Sm = s(Qm, Km) (3) where s(·, ·) is the function described in Eq. (1), Qm and Km are the inputs, and Sm is the attention weight for the output. In SAN, we can share Sm with the layers above m, like this Sm+i = s(Qm, Km) (4) for i ∈ [1, π − 1] where π indicates how many layers share the same attention weights. For example, in a 6-layer decoder, we can share the self-attention weights for every two layers (π = 2), or share the weights for the first two layers (π1 = 2) and let the remaining 4 layers use another weights (π2 = 4).” [sec(s) 2] “The Transformer system follows the popular encoder-decoder paradigm. On the encoder side, there are a number of identical stacked layers. Each of them is composed of a self-attention sub-layer and a feed-forward sub-layer. … Note that S is essentially a weight (or scalar) matrix where every column represents a distribution. The output of self-attention is simply defined as the weighted sum of values: A = S · V (2) Here Q, K and V are generated from the same source with a linear transformation. The self-attention result is then fed into a fully connected feed-forward network (FFN).” [sec(s) 4.1] “The bilingual and evaluation data came from three sources.”; e.g., “FFN” read(s) on “feedforward sub-block”.) Regarding claim 8 Xiao teaches claim 1. Xiao further teaches wherein generating the output for the second transformer block comprises processing the output features of the second self-attention sub-block using a second feedforward sub-block of the second transformer block. (Xiao [fig(s) 1] [fig(s) 3] “Layer n=m+1”, “Layer m”, “Am=Sm·V” [sec(s) 1] “In this work, we observe that the attention model shares a similar distribution among layers in weighting different positions of the sequence.” [sec(s) 3.2] “An obvious next step is to develop a faster attention model that makes efficient re-use of the states in Eqs. (1) and (2), instead of computing everything on the fly. In this work we present a shared attention network (SAN) to share weight matrix S for adjacent layers. The idea is that we just compute the weight matrix once and reuse it for upper-level layers. Here we describe SAN for both the self-attention and encoder-decoder attention models. • SAN Self-Attention. We define the self-attention weight matrix in layer m as: Sm = s(Qm, Km) (3) where s(·, ·) is the function described in Eq. (1), Qm and Km are the inputs, and Sm is the attention weight for the output. In SAN, we can share Sm with the layers above m, like this Sm+i = s(Qm, Km) (4) for i ∈ [1, π − 1] where π indicates how many layers share the same attention weights. For example, in a 6-layer decoder, we can share the self-attention weights for every two layers (π = 2), or share the weights for the first two layers (π1 = 2) and let the remaining 4 layers use another weights (π2 = 4).” [sec(s) 2] “The Transformer system follows the popular encoder-decoder paradigm. On the encoder side, there are a number of identical stacked layers. Each of them is composed of a self-attention sub-layer and a feed-forward sub-layer. … Note that S is essentially a weight (or scalar) matrix where every column represents a distribution. The output of self-attention is simply defined as the weighted sum of values: A = S · V (2) Here Q, K and V are generated from the same source with a linear transformation. The self-attention result is then fed into a fully connected feed-forward network (FFN).” [sec(s) 4.1] “The bilingual and evaluation data came from three sources.”; e.g., “FFN” read(s) on “feedforward sub-block”.) Regarding claim 10 Xiao teaches claim 1. Xiao further teaches the plurality of transformer blocks comprises a sequence of transformer blocks, (Xiao [fig(s) 1] [fig(s) 3] “Layer n=m+1”, “Layer m”, “Am=Sm·V” [sec(s) 1] “In this work, we observe that the attention model shares a similar distribution among layers in weighting different positions of the sequence.” [sec(s) 3.2] “We define the self-attention weight matrix in layer m as: Sm = s(Qm, Km) (3) where s(·, ·) is the function described in Eq. (1), Qm and Km are the inputs, and Sm is the attention weight for the output. In SAN, we can share Sm with the layers above m, like this Sm+i = s(Qm, Km) (4) for i ∈ [1, π − 1] where π indicates how many layers share the same attention weights. For example, in a 6-layer decoder, we can share the self-attention weights for every two layers (π = 2), or share the weights for the first two layers (π1 = 2) and let the remaining 4 layers use another weights (π2 = 4).” [sec(s) 2] “The Transformer system follows the popular encoder-decoder paradigm. On the encoder side, there are a number of identical stacked layers. Each of them is composed of a self-attention sub-layer and a feed-forward sub-layer. … Note that S is essentially a weight (or scalar) matrix where every column represents a distribution. The output of self-attention is simply defined as the weighted sum of values: A = S · V (2) Here Q, K and V are generated from the same source with a linear transformation. The self-attention result is then fed into a fully connected feed-forward network (FFN).” [sec(s) 4.1] “The Transformer system used in our experiments consisted of a 6-layer encoder and a 6-layer decoder.”;) the sequence of transformer blocks comprises one or more initial blocks, a plurality of intermediate blocks, and one or more final blocks, and (Xiao [fig(s) 1] [fig(s) 3] “Layer n=m+1”, “Layer m”, “Am=Sm·V” [sec(s) 1] “In this work, we observe that the attention model shares a similar distribution among layers in weighting different positions of the sequence.” [sec(s) 3.2] “We define the self-attention weight matrix in layer m as: Sm = s(Qm, Km) (3) where s(·, ·) is the function described in Eq. (1), Qm and Km are the inputs, and Sm is the attention weight for the output. In SAN, we can share Sm with the layers above m, like this Sm+i = s(Qm, Km) (4) for i ∈ [1, π − 1] where π indicates how many layers share the same attention weights. For example, in a 6-layer decoder, we can share the self-attention weights for every two layers (π = 2), or share the weights for the first two layers (π1 = 2) and let the remaining 4 layers use another weights (π2 = 4).” [sec(s) 2] “The Transformer system follows the popular encoder-decoder paradigm. On the encoder side, there are a number of identical stacked layers. Each of them is composed of a self-attention sub-layer and a feed-forward sub-layer. … Note that S is essentially a weight (or scalar) matrix where every column represents a distribution. The output of self-attention is simply defined as the weighted sum of values: A = S · V (2) Here Q, K and V are generated from the same source with a linear transformation. The self-attention result is then fed into a fully connected feed-forward network (FFN).” [sec(s) 4.1] “The Transformer system used in our experiments consisted of a 6-layer encoder and a 6-layer decoder.”;) the plurality of intermediate blocks comprises the first transformer block and the second transformer block. (Xiao [fig(s) 1] [fig(s) 3] “Layer n=m+1”, “Layer m”, “Am=Sm·V” [sec(s) 1] “In this work, we observe that the attention model shares a similar distribution among layers in weighting different positions of the sequence.” [sec(s) 3.2] “We define the self-attention weight matrix in layer m as: Sm = s(Qm, Km) (3) where s(·, ·) is the function described in Eq. (1), Qm and Km are the inputs, and Sm is the attention weight for the output. In SAN, we can share Sm with the layers above m, like this Sm+i = s(Qm, Km) (4) for i ∈ [1, π − 1] where π indicates how many layers share the same attention weights. For example, in a 6-layer decoder, we can share the self-attention weights for every two layers (π = 2), or share the weights for the first two layers (π1 = 2) and let the remaining 4 layers use another weights (π2 = 4).” [sec(s) 2] “The Transformer system follows the popular encoder-decoder paradigm. On the encoder side, there are a number of identical stacked layers. Each of them is composed of a self-attention sub-layer and a feed-forward sub-layer. … Note that S is essentially a weight (or scalar) matrix where every column represents a distribution. The output of self-attention is simply defined as the weighted sum of values: A = S · V (2) Here Q, K and V are generated from the same source with a linear transformation. The self-attention result is then fed into a fully connected feed-forward network (FFN).” [sec(s) 4.1] “The Transformer system used in our experiments consisted of a 6-layer encoder and a 6-layer decoder.”;) Regarding claim 15 Xiao teaches claim 1. Xiao further teaches wherein propagating the first attention propagation output to the second transformer block comprises propagating the first attention propagation output using a propagation operation. (Xiao [fig(s) 1] [fig(s) 3] “Layer n=m+1”, “Layer m” [sec(s) 1] “In this work, we observe that the attention model shares a similar distribution among layers in weighting different positions of the sequence.” [sec(s) 3.2] “An obvious next step is to develop a faster attention model that makes efficient re-use of the states in Eqs. (1) and (2), instead of computing everything on the fly. In this work we present a shared attention network (SAN) to share weight matrix S for adjacent layers. The idea is that we just compute the weight matrix once and reuse it for upper-level layers. Here we describe SAN for both the self-attention and encoder-decoder attention models. • SAN Self-Attention. We define the self-attention weight matrix in layer m as: Sm = s(Qm, Km) (3) where s(·, ·) is the function described in Eq. (1), Qm and Km are the inputs, and Sm is the attention weight for the output. In SAN, we can share Sm with the layers above m, like this Sm+i = s(Qm, Km) (4) for i ∈ [1, π − 1] where π indicates how many layers share the same attention weights. For example, in a 6-layer decoder, we can share the self-attention weights for every two layers (π = 2), or share the weights for the first two layers (π1 = 2) and let the remaining 4 layers use another weights (π2 = 4).”;) Regarding claim 17 Xiao teaches claim 1. Xiao further teaches wherein, when generating the output features for the second transformer block, a second self-attention sub-block does not compute an attention matrix. (Xiao [fig(s) 1] [fig(s) 3] “Layer n=m+1”, “Layer m”, “An=Sn·V” [sec(s) 1] “In this work, we observe that the attention model shares a similar distribution among layers in weighting different positions of the sequence.” [sec(s) 3.2] “An obvious next step is to develop a faster attention model that makes efficient re-use of the states in Eqs. (1) and (2), instead of computing everything on the fly. In this work we present a shared attention network (SAN) to share weight matrix S for adjacent layers. The idea is that we just compute the weight matrix once and reuse it for upper-level layers. Here we describe SAN for both the self-attention and encoder-decoder attention models. • SAN Self-Attention. We define the self-attention weight matrix in layer m as: Sm = s(Qm, Km) (3) where s(·, ·) is the function described in Eq. (1), Qm and Km are the inputs, and Sm is the attention weight for the output. In SAN, we can share Sm with the layers above m, like this Sm+i = s(Qm, Km) (4) for i ∈ [1, π − 1] where π indicates how many layers share the same attention weights. For example, in a 6-layer decoder, we can share the self-attention weights for every two layers (π = 2), or share the weights for the first two layers (π1 = 2) and let the remaining 4 layers use another weights (π2 = 4).” [sec(s) 2] “Note that S is essentially a weight (or scalar) matrix where every column represents a distribution. The output of self-attention is simply defined as the weighted sum of values: A = S · V (2) Here Q, K and V are generated from the same source with a linear transformation. The self-attention result is then fed into a fully connected feed-forward network (FFN).”; e.g., “we can share Sm with the layers above m” read(s) on “does not compute an attention matrix”.) Regarding claim 18 The claim is a system claim corresponding to the method claim 1, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim. Regarding claim 19 The claim is a system claim corresponding to the method claim 2, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim. Regarding claim 20 The claim is a system claim corresponding to the method claim 3, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim. Regarding claim 21 The claim is a system claim corresponding to the method claim 4, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim. Regarding claim 22 The claim is a system claim corresponding to the method claim 5, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim. Regarding claim 23 The claim is a system claim corresponding to the method claim 6, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim. Regarding claim 24 The claim is a system claim corresponding to the method claim 7, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim. Regarding claim 25 The claim is a system claim corresponding to the method claim 8, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim. Regarding claim 27 The claim is a system claim corresponding to the method claim 10, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim. Regarding claim 32 The claim is a system claim corresponding to the method claim 15, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim. Regarding claim 34 The claim is a system claim corresponding to the method claim 17, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim. Regarding claim 35 The claim is a system claim corresponding to the method claim 1, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim. Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claim(s) 9, 26 is/are rejected under 35 U.S.C. 103 as being unpatentable over Xiao et al. (Sharing Attention Weights for Fast Transformer) in view of Xia et al. (Tied Transformers: Neural Machine Translation with Shared Encoder and Decoder) Regarding claim 9 Xiao teaches claim 1. (Note: Hereinafter, if a limitation has bold brackets (i.e. [·]) around claim languages, the bracketed claim languages indicate that they have not been taught yet by the current prior art reference but they will be taught by another prior art reference afterwards.) Xiao further teaches the first transformer block comprises an [encoder] block, and the second transformer block comprises a decoder block. (Xiao [fig(s) 1] [fig(s) 3] “(c) SAN Encoder-Decoder Attention”, “Qm”, “Km”, “Vm” [sec(s) 4.1] “The Transformer system used in our experiments consisted of a 6-layer encoder and a 6-layer decoder.” [sec(s) 4.2] “In encoder-decoder attention, we share the context V generated by the encoder for further speed-ups (see Figure 3(c)). It is therefore worth a study on how much this method can accelerate the system. Table 4 shows that sharing the context contributes half of the speed improvement. This agrees with our design that weight sharing is more beneficial to the decoder because attention is heavier on the decoder side. Another interesting question is whether SAN can improve the system on the encoder side. To seek an answer, we apply SAN to the encoder-side self-attention sub-layers and see small speed improvements (Table 5). This result confirms the previous report that the decoder occupies the inference time and the encoder is light [Zhang et al., 2018].” [sec(s) 3.2] “An obvious next step is to develop a faster attention model that makes efficient re-use of the states in Eqs. (1) and (2), instead of computing everything on the fly. In this work we present a shared attention network (SAN) to share weight matrix S for adjacent layers. The idea is that we just compute the weight matrix once and reuse it for upper-level layers. Here we describe SAN for both the self-attention and encoder-decoder attention models. • SAN Self-Attention. We define the self-attention weight matrix in layer m as: Sm = s(Qm, Km) (3) where s(·, ·) is the function described in Eq. (1), Qm and Km are the inputs, and Sm is the attention weight for the output. In SAN, we can share Sm with the layers above m, like this Sm+i = s(Qm, Km) (4) for i ∈ [1, π − 1] where π indicates how many layers share the same attention weights. For example, in a 6-layer decoder, we can share the self-attention weights for every two layers (π = 2), or share the weights for the first two layers (π1 = 2) and let the remaining 4 layers use another weights (π2 = 4). … SAN Encoder-Decoder Attention … where Am is the attention output of layer m, V is the context representation generated by the encoder.”;) However, Xiao does not appear to explicitly teach: the first transformer block comprises an [encoder] block, and (Note: Hereinafter, if a limitation has one or more bold underlines, the one or more underlined claim languages indicate that they are taught by the current prior art reference, while the one or more non-underlined claim languages indicate that they have been taught already by one or more previous art references.) Xia teaches the first transformer block comprises an encoder block, and (Xia [fig(s) 1] [sec(s) 1] “We make an initial attempt to answer this question and then propose tied transformer. We cast the typical encoder-decoder based sequence-to-sequence model into a more compact one in that there is only one copy of parameter set, which is applicable to both the encoder and decoder. In that way we force the sharing among the weights of the encoder and the decoder, rather than only among the source-side and target-side word embeddings” [sec(s) 3.1] “he architecture of tied transformer is shown in Figure 1. We follow the notations defined in Section 2.3 to mathematically describe our model. Tied transformer is a stacked model with L blocks, where the l’th block consists of a self-attention module ϕlS , a cross-lingual attention ϕlC and a nonlinear function ϕlF , where superscript l represents the layer id. We show how the encoder and decoder are shared.”;) Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Xiao with the encoder of Xia. One of ordinary skill in the art would have been motived to combine in order to significantly outperform the conventional approach in diverse test datasets. (Xia [sec(s) 4] “The three rows in Table 1 represent the basic Transformer algorithm, our proposed algorithm and the improvements brought by our algorithm. We also list the IWSLT 2014 De→En results reported by previous literature. The tied transformer significantly outperforms the standard transformer. The improvements are two-folded: first, we could see Transformer outperforms non-transformer systems, which is helpful to get a great score; second, on top of such a strong baseline, we are still capable to further improve the performances, which demonstrate the effectiveness of our method. On the three tasks De→En, Es→En and Ro→En, our proposed method could outperform the Transformer baseline by 1.83, 1.84 and 1.35 points, which indeed verifies our motivation introduced in Section 1 that such a framework is helpful to improve the translations within a same language family. As far as we could survey, we achieve the best result on IWSLT 2014 De→En, whose previously best result in 33.81 provided by (Elbayad et al. 2018).”) Regarding claim 26 The claim is a system claim corresponding to the method claim 9, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim. Claim(s) 11-12, 28-29 is/are rejected under 35 U.S.C. 103 as being unpatentable over Xiao et al. (Sharing Attention Weights for Fast Transformer) in view of Beltagy et al. (Longformer: The Long-Document Transformer) Regarding claim 11 Xiao teaches claim 1. Xiao further teaches wherein generating the first attention propagation output comprises processing the input data for the first transformer block using a plurality of [window] self-attention operations to generate the output features for the first transformer block. (Xiao [fig(s) 1] [fig(s) 3] “Layer n=m+1”, “Layer m”, “Am=Sm·V” [sec(s) 1] “In this work, we observe that the attention model shares a similar distribution among layers in weighting different positions of the sequence.” [sec(s) 3.2] “We define the self-attention weight matrix in layer m as: Sm = s(Qm, Km) (3) where s(·, ·) is the function described in Eq. (1), Qm and Km are the inputs, and Sm is the attention weight for the output. In SAN, we can share Sm with the layers above m, like this Sm+i = s(Qm, Km) (4) for i ∈ [1, π − 1] where π indicates how many layers share the same attention weights. For example, in a 6-layer decoder, we can share the self-attention weights for every two layers (π = 2), or share the weights for the first two layers (π1 = 2) and let the remaining 4 layers use another weights (π2 = 4).” [sec(s) 2] “The Transformer system follows the popular encoder-decoder paradigm. On the encoder side, there are a number of identical stacked layers. Each of them is composed of a self-attention sub-layer and a feed-forward sub-layer. … Note that S is essentially a weight (or scalar) matrix where every column represents a distribution. The output of self-attention is simply defined as the weighted sum of values: A = S · V (2) Here Q, K and V are generated from the same source with a linear transformation. The self-attention result is then fed into a fully connected feed-forward network (FFN).” [sec(s) 4.1] “The Transformer system used in our experiments consisted of a 6-layer encoder and a 6-layer decoder.”;) However, Xiao does not appear to explicitly teach: wherein generating the first attention propagation output comprises processing the input data for the first transformer block using a plurality of [window] self-attention operations to generate the output features for the first transformer block. Beltagy teaches wherein generating the first attention propagation output comprises processing the input data for the first transformer block using a plurality of window self-attention operations to generate the output features for the first transformer block. (Beltagy [sec(s) 1] “Longformer’s attention mechanism is a combination of a windowed local-context self-attention and an end task motivated global attention that encodes inductive bias about the task. Through ablations and controlled trials we show both attention types are essential – the local attention is primarily used to build contextual representations, while the global attention allows Longformer to build full sequence representations for prediction.” [sec(s) 3.1] “Sliding Window Given the importance of local context (Kovaleva et al., 2019), our attention pattern employs a fixed-size window attention surrounding each token. Using multiple stacked layers of such windowed attention results in a large receptive field, where top layers have access to all input locations and have the capacity to build representations that incorporate information across the entire input, similar to CNNs (Wu et al., 2019). … Dilated Sliding Window To further increase the receptive field without increasing computation, the sliding window can be “dilated”. This is analogous to dilated CNNs (van den Oord et al., 2016) where the window has gaps of size dilation d (Fig. 2c). Assuming a fixed d and w for all layers, the receptive field is l × d × w, which can reach tens of thousands of tokens even for small values of d. In multi-headed attention, each attention head computes a different attention score. We found settings with different dilation configurations per head improves performance by allowing some heads without dilation to focus on local context, while others with dilation focus on longer context.”;) Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Xiao with the window self-attention operations of Beltagy. One of ordinary skill in the art would have been motived to combine in order to achieve state-of-the-art results on the character-level language modeling tasks and, when pretrained, consistently outperform the conventional approach on long document tasks. (Beltagy [sec(s) 8] “Longformer achieves state-of-the-art results on the character-level language modeling tasks of text8 and enwik8. When pretrained, Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA. We further present LED, an encoder-decoder variant of Longformer for modeling sequence-to-sequence tasks, and achieve state-of-the-art results on the arXiv long document summarization task.”) Regarding claim 12 The combination of Xiao, Beltagy teaches claim 11. Xiao further teaches wherein the first attention propagation output comprises the output features for the first transformer block. (Xiao [fig(s) 1] [fig(s) 3] “Layer n=m+1”, “Layer m”, “Am=Sm·V” [sec(s) 1] “In this work, we observe that the attention model shares a similar distribution among layers in weighting different positions of the sequence.” [sec(s) 3.2] “We define the self-attention weight matrix in layer m as: Sm = s(Qm, Km) (3) where s(·, ·) is the function described in Eq. (1), Qm and Km are the inputs, and Sm is the attention weight for the output. In SAN, we can share Sm with the layers above m, like this Sm+i = s(Qm, Km) (4) for i ∈ [1, π − 1] where π indicates how many layers share the same attention weights. For example, in a 6-layer decoder, we can share the self-attention weights for every two layers (π = 2), or share the weights for the first two layers (π1 = 2) and let the remaining 4 layers use another weights (π2 = 4).” [sec(s) 2] “The Transformer system follows the popular encoder-decoder paradigm. On the encoder side, there are a number of identical stacked layers. Each of them is composed of a self-attention sub-layer and a feed-forward sub-layer. … Note that S is essentially a weight (or scalar) matrix where every column represents a distribution. The output of self-attention is simply defined as the weighted sum of values: A = S · V (2) Here Q, K and V are generated from the same source with a linear transformation. The self-attention result is then fed into a fully connected feed-forward network (FFN).” [sec(s) 4.1] “The Transformer system used in our experiments consisted of a 6-layer encoder and a 6-layer decoder.”;) Regarding claim 28 The claim is a system claim corresponding to the method claim 11, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim. Regarding claim 29 The claim is a system claim corresponding to the method claim 12, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim. Claim(s) 13, 30 is/are rejected under 35 U.S.C. 103 as being unpatentable over Xiao et al. (Sharing Attention Weights for Fast Transformer) in view of Beltagy et al. (Longformer: The Long-Document Transformer) in view of Zhang et al. (Densely-Connected Transformer with Co-attentive Information for Matching Text Sequences) Regarding claim 13 The combination of Xiao, Beltagy teaches claim 12. Xiao further teaches propagating the first attention propagation output to the second transformer block comprises propagating the first attention propagation output using a propagation operation, (Xiao [fig(s) 1] [fig(s) 3] “Layer n=m+1”, “Layer m” [sec(s) 1] “In this work, we observe that the attention model shares a similar distribution among layers in weighting different positions of the sequence.” [sec(s) 3.2] “An obvious next step is to develop a faster attention model that makes efficient re-use of the states in Eqs. (1) and (2), instead of computing everything on the fly. In this work we present a shared attention network (SAN) to share weight matrix S for adjacent layers. The idea is that we just compute the weight matrix once and reuse it for upper-level layers. Here we describe SAN for both the self-attention and encoder-decoder attention models. • SAN Self-Attention. We define the self-attention weight matrix in layer m as: Sm = s(Qm, Km) (3) where s(·, ·) is the function described in Eq. (1), Qm and Km are the inputs, and Sm is the attention weight for the output. In SAN, we can share Sm with the layers above m, like this Sm+i = s(Qm, Km) (4) for i ∈ [1, π − 1] where π indicates how many layers share the same attention weights. For example, in a 6-layer decoder, we can share the self-attention weights for every two layers (π = 2), or share the weights for the first two layers (π1 = 2) and let the remaining 4 layers use another weights (π2 = 4).”;) the propagation operation comprises transforming the first attention propagation output by [concatenating] output features for a third transformer block of the plurality of transformer blocks to the first attention propagation output, and (Xiao [fig(s) 1] [fig(s) 3] “Layer n=m+1”, “Layer m” [sec(s) 1] “In this work, we observe that the attention model shares a similar distribution among layers in weighting different positions of the sequence.” [sec(s) 3.2] “An obvious next step is to develop a faster attention model that makes efficient re-use of the states in Eqs. (1) and (2), instead of computing everything on the fly. In this work we present a shared attention network (SAN) to share weight matrix S for adjacent layers. The idea is that we just compute the weight matrix once and reuse it for upper-level layers. Here we describe SAN for both the self-attention and encoder-decoder attention models. • SAN Self-Attention. We define the self-attention weight matrix in layer m as: Sm = s(Qm, Km) (3) where s(·, ·) is the function described in Eq. (1), Qm and Km are the inputs, and Sm is the attention weight for the output. In SAN, we can share Sm with the layers above m, like this Sm+i = s(Qm, Km) (4) for i ∈ [1, π − 1] where π indicates how many layers share the same attention weights. For example, in a 6-layer decoder, we can share the self-attention weights for every two layers (π = 2), or share the weights for the first two layers (π1 = 2) and let the remaining 4 layers use another weights (π2 = 4).”;) the third transformer block immediately precedes the second transformer block. (Xiao [fig(s) 1] [fig(s) 3] “Layer n=m+1”, “Layer m”, “An=Sn·V” [sec(s) 1] “In this work, we observe that the attention model shares a similar distribution among layers in weighting different positions of the sequence.” [sec(s) 3.2] “We define the self-attention weight matrix in layer m as: Sm = s(Qm, Km) (3) where s(·, ·) is the function described in Eq. (1), Qm and Km are the inputs, and Sm is the attention weight for the output. In SAN, we can share Sm with the layers above m, like this Sm+i = s(Qm, Km) (4) for i ∈ [1, π − 1] where π indicates how many layers share the same attention weights. For example, in a 6-layer decoder, we can share the self-attention weights for every two layers (π = 2), or share the weights for the first two layers (π1 = 2) and let the remaining 4 layers use another weights (π2 = 4).””;) However, the combination of Xiao, Beltagy does not appear to explicitly teach: the propagation operation comprises transforming the first attention propagation output by [concatenating] output features for a third transformer block of the plurality of transformer blocks to the first attention propagation output, and Zhang teaches the propagation operation comprises transforming the first attention propagation output by concatenating output features for a third transformer block of the plurality of transformer blocks to the first attention propagation output, and (Zhang [sec(s) 3.1] “Transformer Encoder. The inputs of transformer encoder are the representations Ha = {h1a, h2a, ..., hala} and Hb = {h1b, h2b, ..., hblb} for text sequences which are the concatenation of outputs of word embedding layer and all the previous sublayers. We leverage transformer encoder [31] to obtain refined representation for two text sequences separately. Transformer is a recently proposed neural architecture which based solely on attention mechanisms, it achieves remarkable performance on neural machine translation task while being more parallelizable and requiring significantly less time to train. … Multi-head Co-attention. Motivated by the success of multi-head self-attention in neural machine translation task, we leverage multi-head co-attention mechanism to perform matching. Following is the general single head co-attention mechanism PNG media_image1.png 255 590 media_image1.png Greyscale , (12) where f is the scaled dot-product attention, i represents the i-th head. The multi-head co-attentive representations are then obtained as follows: PNG media_image2.png 151 760 media_image2.png Greyscale (13)”;) Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Xiao, Beltagy with the concatenation of Zhang. One of ordinary skill in the art would have been motived to combine in order to show the superiority of the system which outperforms many other neural architectures and achieves competitive performance on five real-world datasets. (Zhang [sec(s) 1] “Extensive experiments are conducted to show the superiority of our proposed model which outperforms many other neural architectures and achieves competitive performance on five real-world datasets.”) Regarding claim 30 The claim is a system claim corresponding to the method claim 13, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim. Claim(s) 14, 31 is/are rejected under 35 U.S.C. 103 as being unpatentable over Xiao et al. (Sharing Attention Weights for Fast Transformer) in view of Beltagy et al. (Longformer: The Long-Document Transformer) in view of Zamir et al. (Restormer: Efficient Transformer for High-Resolution Image Restoration) Regarding claim 14 Xiao teaches claim 12. Xiao further teaches propagating the first attention propagation output to the second transformer block comprises propagating the first attention propagation output using a propagation operation, and (Xiao [fig(s) 1] [fig(s) 3] “Layer n=m+1”, “Layer m” [sec(s) 1] “In this work, we observe that the attention model shares a similar distribution among layers in weighting different positions of the sequence.” [sec(s) 3.2] “An obvious next step is to develop a faster attention model that makes efficient re-use of the states in Eqs. (1) and (2), instead of computing everything on the fly. In this work we present a shared attention network (SAN) to share weight matrix S for adjacent layers. The idea is that we just compute the weight matrix once and reuse it for upper-level layers. Here we describe SAN for both the self-attention and encoder-decoder attention models. • SAN Self-Attention. We define the self-attention weight matrix in layer m as: Sm = s(Qm, Km) (3) where s(·, ·) is the function described in Eq. (1), Qm and Km are the inputs, and Sm is the attention weight for the output. In SAN, we can share Sm with the layers above m, like this Sm+i = s(Qm, Km) (4) for i ∈ [1, π − 1] where π indicates how many layers share the same attention weights. For example, in a 6-layer decoder, we can share the self-attention weights for every two layers (π = 2), or share the weights for the first two layers (π1 = 2) and let the remaining 4 layers use another weights (π2 = 4).”;) the propagation operation comprises transforming the first attention propagation output using an [upsampling] operation. (Xiao [fig(s) 1] [fig(s) 3] “Layer n=m+1”, “Layer m” [sec(s) 1] “In this work, we observe that the attention model shares a similar distribution among layers in weighting different positions of the sequence.” [sec(s) 3.2] “An obvious next step is to develop a faster attention model that makes efficient re-use of the states in Eqs. (1) and (2), instead of computing everything on the fly. In this work we present a shared attention network (SAN) to share weight matrix S for adjacent layers. The idea is that we just compute the weight matrix once and reuse it for upper-level layers. Here we describe SAN for both the self-attention and encoder-decoder attention models. • SAN Self-Attention. We define the self-attention weight matrix in layer m as: Sm = s(Qm, Km) (3) where s(·, ·) is the function described in Eq. (1), Qm and Km are the inputs, and Sm is the attention weight for the output. In SAN, we can share Sm with the layers above m, like this Sm+i = s(Qm, Km) (4) for i ∈ [1, π − 1] where π indicates how many layers share the same attention weights. For example, in a 6-layer decoder, we can share the self-attention weights for every two layers (π = 2), or share the weights for the first two layers (π1 = 2) and let the remaining 4 layers use another weights (π2 = 4).”;) However, the combination of Xiao, Beltagy does not appear to explicitly teach: the propagation operation comprises transforming the first attention propagation output using an [upsampling] operation. Zamir teaches the propagation operation comprises transforming the first attention propagation output using an upsampling operation. (Zamir [fig(s) 2] “Transformer Block”, “Concatenation”, “Skip Connections”, “Architecture of Restormer for high-resolution image restoration. Our Restormer consists of multiscale hierarchical design incorporating efficient Transformer blocks. The core modules of Transformer block are: (a) multi-Dconv head transposed attention (MDTA) that performs (spatially enriched) query-key feature interaction across channels rather the spatial dimension, and (b) Gated-Dconv feed-forward network (GDFN) that performs controlled feature transformation, i.e., to allow useful information to propagate further” [sec(s) 3] “For feature downsampling and upsampling, we apply pixel-unshuffle and pixel-shuffle operations [69], respectively. To assist the recovery process, the encoder features are concatenated with the decoder features via skip connections [66]. The concatenation operation is followed by a 1×1 convolution to reduce channels (by half) at all levels, except the top one. At level-1, we let Transformer blocks to aggregate the low-level image features of the encoder with the high-level features of the decoder. It is beneficial in preserving the fine structural and textural details in the restored images. Next, the deep features Fd are further enriched in the refinement stage operating at high spatial resolution. These design choices yield quality improvements as we shall see in the experiment section (Sec. 4).”;) Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Xiao, Beltagy with the upsampling of Zamir. One of ordinary skill in the art would have been motived to combine in order to achieve consistent and significant performance gains over existing approaches on all five datasets. (Zamir [sec(s) 4] “We compute PSNR/SSIM scores using the Y channel in YCbCr color space in a way similar to existing methods [32, 61, 93]. Table 1 shows that our Restormer achieves consistent and significant performance gains over existing approaches on all five datasets. Compared to the recent best method SPAIR [61], Restormer achieves 1.05 dB improvement when averaged across all datasets.”) Regarding claim 31 The claim is a system claim corresponding to the method claim 14, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim. Claim(s) 16, 33 is/are rejected under 35 U.S.C. 103 as being unpatentable over Xiao et al. (Sharing Attention Weights for Fast Transformer) in view of Zamir et al. (Restormer: Efficient Transformer for High-Resolution Image Restoration) Regarding claim 16 Xiao teaches claim 15. Xiao further teaches wherein the propagation operation comprises transforming the first attention propagation output by performing one or more [convolution] operations on the first attention propagation output. (Xiao [fig(s) 1] [fig(s) 3] “Layer n=m+1”, “Layer m” [sec(s) 1] “In this work, we observe that the attention model shares a similar distribution among layers in weighting different positions of the sequence.” [sec(s) 3.2] “An obvious next step is to develop a faster attention model that makes efficient re-use of the states in Eqs. (1) and (2), instead of computing everything on the fly. In this work we present a shared attention network (SAN) to share weight matrix S for adjacent layers. The idea is that we just compute the weight matrix once and reuse it for upper-level layers. Here we describe SAN for both the self-attention and encoder-decoder attention models. • SAN Self-Attention. We define the self-attention weight matrix in layer m as: Sm = s(Qm, Km) (3) where s(·, ·) is the function described in Eq. (1), Qm and Km are the inputs, and Sm is the attention weight for the output. In SAN, we can share Sm with the layers above m, like this Sm+i = s(Qm, Km) (4) for i ∈ [1, π − 1] where π indicates how many layers share the same attention weights. For example, in a 6-layer decoder, we can share the self-attention weights for every two layers (π = 2), or share the weights for the first two layers (π1 = 2) and let the remaining 4 layers use another weights (π2 = 4).”;) However, Xiao does not appear to explicitly teach: wherein the propagation operation comprises transforming the first attention propagation output by performing one or more [convolution] operations on the first attention propagation output. Zamir teaches wherein the propagation operation comprises transforming the first attention propagation output by performing one or more convolution operations on the first attention propagation output. (Zamir [fig(s) 2] “Convolution”, “Transformer Block”, “Concatenation”, “Skip Connections”, “Architecture of Restormer for high-resolution image restoration. Our Restormer consists of multiscale hierarchical design incorporating efficient Transformer blocks. The core modules of Transformer block are: (a) multi-Dconv head transposed attention (MDTA) that performs (spatially enriched) query-key feature interaction across channels rather the spatial dimension, and (b) Gated-Dconv feed-forward network (GDFN) that performs controlled feature transformation, i.e., to allow useful information to propagate further” [sec(s) 3] “Overall Pipeline. Given a degraded image I ∈ RH×W×3, Restormer first applies a convolution to obtain low-level feature embeddings F0 ∈ RH×W×C. … For feature downsampling and upsampling, we apply pixel-unshuffle and pixel-shuffle operations [69], respectively. To assist the recovery process, the encoder features are concatenated with the decoder features via skip connections [66]. The concatenation operation is followed by a 1×1 convolution to reduce channels (by half) at all levels, except the top one. At level-1, we let Transformer blocks to aggregate the low-level image features of the encoder with the high-level features of the decoder. It is beneficial in preserving the fine structural and textural details in the restored images. Next, the deep features Fd are further enriched in the refinement stage operating at high spatial resolution. These design choices yield quality improvements as we shall see in the experiment section (Sec. 4).” [sec(s) 3.1] “As another essential component in MDTA, we introduce depth-wise convolutions to emphasize on the local context before computing feature covariance to produce the global attention map. From a layer normalized tensor Y ∈ RHˆ×Wˆ×Cˆ, our MDTA first generates query (Q), key (K) and value (V) projections, enriched with local context. It is achieved by applying 1×1 convolutions to aggregate pixel-wise cross-channel context followed by 3×3 depth-wise convolutions to encode channel-wise spatial context”;) Xiao is combinable with Zamir for the same rationale as set forth above with respect to claim 14. Regarding claim 33 The claim is a system claim corresponding to the method claim 16, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim. Prior Art The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Zhu et al. (A Densely Connected Transformer for Machine Translation) teaches concatenation for multiple heads. Conclusion Any inquiry concerning this communication or earlier communications from the examiner should be directed to SEHWAN KIM whose telephone number is (571)270-7409. The examiner can normally be reached Mon - Fri 9:00 AM - 5:00 PM. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael J Huntley can be reached on (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /SEHWAN KIM/Examiner, Art Unit 2129 1/22/2026
Read full office action

Prosecution Timeline

Jun 15, 2023
Application Filed
Jan 22, 2026
Non-Final Rejection — §101, §102, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12602595
SYSTEM AND METHOD OF USING A KNOWLEDGE REPRESENTATION FOR FEATURES IN A MACHINE LEARNING CLASSIFIER
2y 5m to grant Granted Apr 14, 2026
Patent 12602580
Dataset Dependent Low Rank Decomposition Of Neural Networks
2y 5m to grant Granted Apr 14, 2026
Patent 12602581
Systems and Methods for Out-of-Distribution Detection
2y 5m to grant Granted Apr 14, 2026
Patent 12602606
APPARATUSES, COMPUTER-IMPLEMENTED METHODS, AND COMPUTER PROGRAM PRODUCTS FOR IMPROVED GLOBAL QUBIT POSITIONING IN A QUANTUM COMPUTING ENVIRONMENT
2y 5m to grant Granted Apr 14, 2026
Patent 12541722
MACHINE LEARNING TECHNIQUES FOR VALIDATING AND MUTATING OUTPUTS FROM PREDICTIVE SYSTEMS
2y 5m to grant Granted Feb 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

1-2
Expected OA Rounds
60%
Grant Probability
99%
With Interview (+65.6%)
4y 1m
Median Time to Grant
Low
PTA Risk
Based on 144 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month