Office Action Analysis: 18179398 — METHOD FOR OPTIMIZING WORKFLOW-BASED NEURAL NETWORK INCLUDING ATTENTION LAYER

Office Action

§101 §103 §112
DETAILED ACTION
Status of Claims
This Office action is responsive to communications filed on 2023-03-07. Claim(s) 1-16 is/are pending and are examined herein.
Claim(s) 2-3 and 6-16 is/are objected to. 
Claim(s) 2-3, 8, 10-11, and 16 is/are rejected under 35 USC 112(b).
Claim(s) 9-16 is/are rejected under 35 USC 101. 
Claim(s) 1-16 is/are rejected under 35 USC 103.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after 2013-03-16, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The attached information disclosure statement(s) (IDS), submitted on 2023-03-07 and 2023-07-04, is/are in compliance with the provisions of 37 CFR 1.97. Accordingly, the attached information disclosure statement(s) is/are being considered by the examiner.

Examiner’s Remarks
Claims 2-3 and 10-11 recite conditional limitations: 
[Claims 2 and 10] creating the proposed attention mask pattern based on one or more desired attention placements and attention weights on the input elements if the original attention mask pattern does not make intuitive sense. [emphasis added]
[Claims 3 and 11] creating the attention mask updating function if the attention mask updating function can be created to fulfill the proposed attention mask pattern; considering a generation of a new attention pattern proposal if the attention mask updating function cannot be created to fulfill the proposed attention mask pattern; and adopting a reinforcement learning (RL) model as the attention mask updating function if the attention mask updating function cannot be created to fulfill the proposed attention mask pattern and if a new attention pattern proposal cannot be obtained. [emphasis added]
MPEP 2111.04(II) indicates that “[t]he broadest reasonable interpretation of a method (or process) claim having contingent limitations requires only that those steps that must be performed and does not include steps that are not required to be performed because the condition(s) are not met”. The conditions recited in these conditional limitations do not need to occur. Thus, the limitations cited above are not part of the broadest reasonable interpretation of method claims in which they occur. The applicant is advised to amend the claims to positively recite conditional limitations. Amendment to all claims, not just the method claims, is advisable for consistency and clarity. 

Claim Objections
Claim(s) 2-3 and 6-16 is/are objected to because of the following informalities: 
Claims 2 and 10 recite wherein the generation of attention pattern proposal comprising but this should be “wherein the generating of the attention mask pattern proposal comprises” for proper antecedent basis and grammaticality (a “wherein” clause requires a finite verb). 
Claims 3 and 11 recite wherein the creation of the attention mask updating function comprising but this should be “the creation of the attention mask updating function comprises” for grammaticality (a “wherein” clause requires a finite verb). 
Claims 3 and 11 recite between the original attention mask pattern deviates and the proposed attention mask pattern [emphasis added] but the word underlined word appears to be extraneous; it should be removed for grammaticality. 
Claims 3 and 11 recite determining whether an attention mask updating function can be created to fulfill the attention pattern proposal [emphasis added] but this should be “determining whether the attention mask updating function can be created to fulfill the attention mask pattern proposal” for proper antecedent basis and consistency of nomenclature. 
Claims 6 and 14 recite a key matrix of queries and a key matrix of keys [emphasis added]. For clarity of nomenclature, the examiner suggests “a query matrix of queries and a key matrix of keys”. Dependent claims 7-8 an 15-16 inherit the objection. 
Claims 7 and 15 recite Q represents a key matrix of queries, K represents a key matrix of keys, d_k represents a dimension of both the key matrix of queries and the key matrix of keys, and V represents a value matrix of values [emphasis added]. For clarity of nomenclature and proper antecedent basis, this should be “Q represents the query matrix of queries, K represents the key matrix of keys, d_k represents a dimension of both the query matrix of queries and the key matrix of keys, and V represents the value matrix of values” (since the respective parent claims already introduces entities bearing these names). Dependent claims 8 and 16 inherit the objection.
Claim 9 recites the attention layer having an original attention function [emphasis added] but this should be “the attention layer having the original attention function” for proper antecedent basis (since the claim already recites “an original attention function”). Dependent claims 10-16 inherit the objection. 

Appropriate correction is required.
	
Claim Rejections - 35 USC 112(b)
The following is a quotation of 35 USC 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

The following is a quotation of 35 USC 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claim(s) 2-3, 8, 10-11, and 16 is/are rejected under 35 USC 112(b) or 35 USC 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 USC 112, the applicant), regards as the invention.

Claims 2 and 10 recite:
determining whether the original attention mask pattern makes intuitive sense, comprising: determining whether the attentions are being placed correctly and with appropriate attention weights in relation to the identified elements of the input elements and the original prediction results; and creating the proposed attention mask pattern based on one or more desired attention placements and attention weights on the input elements if the original attention mask pattern does not make intuitive sense.
This limitation is indefinite for the following reasons. First, determining whether something makes “intuitive sense” is subjective: something that makes “intuitive sense” to one person may not to another. MPEP 2173.05(b)(IV) indicates that a “claim term that requires the exercise of subjective judgment without restriction may render the claim indefinite”. In the present instance, the only “restriction” provided by the claim is the indication that this determination is to be based on “attentions being placed correctly and with appropriate attention weights” but this is only further subjective language: it is not clear what it means for attentions to be “placed correctly” or for attention weights to be “appropriate”. The specification also provides no guidance on these points, and in fact, the specification even indicates specifically that the determination is to be made “by a human observer” [specification, 0041; emphasis added], further reinforcing that these limitations are explicitly envisioned by the applicant as being subjective determinations made by a human being. Moreover, the examiner notes that the claim recites only determining “whether” the original attention mask pattern does not make intuitive sense; it does not positively recite actually determining that the original attention mask pattern does not make intuitive sense, which means that it is unclear that the claim actually requires the limitation “creating the proposed attention mask pattern” (since that step is only required to occur “if the original attention mask pattern does not make intuitive sense”). For the purpose of compact prosecution, the claim is interpreted broadly as reciting any step of creating the proposed attention mask pattern in response to some criterion. 

Claims 3 and 11 recite: 
determining whether an attention mask updating function can be created to fulfill the attention pattern proposal; creating the attention mask updating function if the attention mask updating function can be created to fulfill the proposed attention mask pattern;
considering a generation of a new attention pattern proposal if the attention mask updating function cannot be created to fulfill the proposed attention mask pattern; and adopting a reinforcement learning (RL) model as the attention mask updating function if the attention mask updating function cannot be created to fulfill the proposed attention mask pattern and if a new attention pattern proposal cannot be obtained.
These limitations, as best understood by the examiner in view of the specification, appear to be an attempt to formulate a genus claim encompassing at least two species: (1) a species involving creating the attention mask updating function when this is possible, (2) a species involving reinforcement learning when it is impossible to both create an attention mask updating function and to obtain a new attention pattern proposal. However, there are several points of unclarity regarding this interpretation which render the claim indefinite: 
First, the “determining” step mentions fulfilling “the attention pattern proposal” [emphasis added], but the subsequent steps (which appear, at first glance, to recite steps which may occur in response to the result of that determination) instead mention fulfilling “the proposed attention mask pattern” [emphasis added]. This renders unclear whether those subsequent steps are in fact intended to be responsive to the “determining” step. The examiner suggests replacing every instance of “fulfill the proposed attention mask pattern” appearing in the claim is replaced with “fulfill the attention mask pattern proposal” for consistency and clarity. 
Second, it is not clear what it means to “fulfill” a proposal. It is possible to fulfill, for example, a requirement, but a proposal is not a requirement. The specification merely repeats the language about fulfilling proposals [specification, 0052-0054] without indicating what it means to “fulfill” a proposal. For the purpose of compact prosecution, the limitation is interpreted broadly as just requiring that the attention mask updating function be related in some way to the proposed attention mask pattern. 
Third, the claim recites “considering a generation of a new attention pattern proposal” but it is not clear what it means to “consider” a generation. The examiner suggests replacing this with “determining whether a new attention pattern proposal can be obtained” for clarity. 
Fourth, the two species as described above do not appear to exhaust the logically available set of possibilities; the claim does not appear to describe the species where the attention mask updating function cannot be created but a new attention pattern proposal can be obtained. 
Fifth, the claim recites “determining whether [the] attention mask updating function can be created” but the claims already inherit a positively recited step of “creating an attention mask updating function” from their respective parent claims. In other words, the claim, when viewed in light of its parents, appears to require that this “determining” step return an affirmative response, since the attention mask function has in fact been created. This suggests that the claim may force the interpretation as species (1) as described above, rendering unclear why the remaining limitations appear in the claim at all. 
The claims are indefinite in view of the above. For the purpose of compact prosecution, the claim is interpreted as encompassing at least species (1), i.e., the one where the attention mask updating function is in fact created. 

Claims 8 and 16 recite:
the attention mask updating function is expressed as: 
    PNG
    media_image1.png
    42
    362
    media_image1.png
    Greyscale
where W_t represents attention weights at time t, c_{t-1} is an attention weight of a centroid of W_{t-1}, e_x is an attention weight on an element at x-distance from the centroid, and threshold is a pre-defined difference in attention weight threshold value.
The formula as written in the claim is not well-formed as a mathematical expression for several reasons. First, the value of a function on a given input should be a value, but the expression as written appears to indicate that the value of the function f on the input W_t is an equality (either “e = 0” or “e = e”) instead of a value, and it is not clear how to interpret an equality as a value. In fact, the formula involving f as it appears in parent claims 7 and 15 appears to require that the output of f be a matrix of certain dimensionality (in order for the formula appearing in those claims to be well-formed), but it is not at all clear how to interpret the formula given in claims 8 and 16 so that the output of f is a matrix of the required dimensionality. Second, the formula for the function does not appear to reference the input variable W_t at all; the variable W_t occurs nowhere on the right-hand side of the equation defining f(W_t). Third, while the formula appears to be an attempt at a piecewise definition of a function, the conditions on which the piecewise definition are also not well-formed: the set W is not defined in the claim or in the formula, and the element e of W over which the conditions are quantified appear nowhere in the conditions themselves (i.e., the conditions “abs(e_x - c_{t-1}) > threshold” and “abs(e_x - c_{t-1}) ≤ threshold” make no mention of the variable e). If the variable e_x appearing in the conditions is a typographical error for e, the conditions do not appear to exhaustive (since it is possible for “abs(e - c_{t-1}) > threshold” to be true for some values of e and “abs(e - c_{t-1}) ≤ threshold” to be true for other values of e), and the “function” does not appear to be defined in those situations (a well-defined function is required to have a well-defined output for every element in its domain, so the expression for f does not appear to actually define a function). However, it is likely that e_x is not a typographical error for e, since an attempted definition of e_x is included in the text of the claim following the formula, but even there, the variable x is still left undefined and it is not clear to what x refers. The specification merely repeats the same formula [specification, 0046-0048] without providing any guidance on any of these points. Consequently, the claims are rejected for indefiniteness. For the purpose of compact prosecution, the claim is interpreted broadly as best understood by the examiner as encompassing any function f which acts on a matrix by setting certain entries of that matrix to 0 and maintaining the others as is. 

Claim Rejections - 35 USC 101
35 USC 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claim(s) 9-16 is/are rejected under 35 USC 101 because the claimed invention is directed to non-statutory subject matter. The claims do not fall within at least one of the four categories of patent eligible subject matter because the claims are directed to a workflow-based neural network. A neural network is not a product that has a physical or tangible form; it is software per se. See MPEP 2106.03.

Claim Rejections - 35 USC 103
The following is a quotation of 35 USC 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 USC 102(b)(2)(C) for any potential 35 USC 102(a)(2) prior art against the later invention.

Claim(s) 1-16 is/are rejected under 35 USC 103 as being unpatentable over Andrea GALASSI et al. (Attention in Natural Language Processing, published 2020-05-28; hereafter, “Galassi”) in view of Hongqiu WU et al. (Not All Attention Is All You Need, published 2021-06-01; hereafter, “Wu”).

Claim 1
Galassi discloses: 
A method for optimizing a workflow-based neural network including an attention layer, the method comprising: ([Galassi, abstract]: Galassi discloses the use of attention mechanisms in neural architectures [Galassi, abstract]. Such a neural network maps to the “workflow-based neural network” of the claim, and the attention mechanism to the “attention layer” of the claim.)
training the workflow-based neural network to predict one or more results from one or more input elements under a prediction model ([Galassi, sections II.A and IV.A]: Galassi discloses a use of the neural network having an attention mechanism to produce an output sequence based on an input sequence x [Galassi, section II.A; see also, figure 4]. Galassi also discloses that attention mechanisms are trained alongside the rest of the neural architecture [Galassi, section IV.A first paragraph]. The neural network also maps to the “prediction model” of the claim; its training maps to the “training” step of the claim, the input sequence to the “input elements” of the claim, and the output sequence to the “one or more results” of the claim.) with the attention layer assigning one or more attention placements and weights, ([Galassi, figure 3 and section II.B]: The attention mechanism disclosed in Galassi produces a vector of “attention weights” [Galassi, figure 3]. The vector of attention weights maps to the “one or more attention placements and weights” of the claim (the elements of the vector being the “weights” and the indices at which they are located being the “placements”). The examiner notes that Galassi indicates that attention weights are computed as a = g(e) where e = f(q, K) [Galassi, section II.B equations (6-7)], where g is a distribution function, f is a compatibility function, q is a query vector, and K are the keys [Galassi, figure 3; see also, table II].) based on an original attention function of the attention layer, ([Galassi, figure 4 and section II.B]: Galassi indicates that attention weights are combined with V to obtain the context vector [Galassi, figure 4 and section II.B equations (8-9)]. The overall attention model, which takes keys, queries, and values as input and outputs a context vector [Galassi, figure 4], maps to the “original attention function” of the claim.) to the input elements ([Galassi, figure 4 and section II.B]: In the attention mechanism [Galassi, figure 4], the “one or more attention placements and weights” as mapped above are assigned to the “input elements” as mapped above.) until the prediction model converges; ([Galassi, section IV.A]: As noted above, Galassi discloses training the neural network [Galassi, section IV.A first paragraph]. The point when training concludes is when the “prediction model converges” as recited by the claim. The examiner notes that the reference Wu used in the combination as proposed below discusses convergence more explicitly; see, for instance, [Wu, figure 4 and section 6.2].)

While Galassi briefly discusses positional masks [Galassi, section IV.C], it might not distinctly disclose:
generating an attention mask pattern proposal to obtain an original attention mask pattern and a proposed attention mask pattern; 
creating an attention mask updating function based on the original attention mask pattern and the proposed attention mask pattern; 
and combining the attention mask updating function with the original attention function to form an updated attention function of the attention layer. 

Wu is in the field of machine learning. Like Galassi, it discusses attention mechanisms [Wu, abstract] and more specifically mentions scaled dot-product attention [Galassi, table IV and section IV.C; Wu, section 3.1 equation (10)] (cf. “original attention function” of the claim). Moreover, Galassi in view of Wu discloses: 
generating an attention mask pattern proposal to obtain an original attention mask pattern and a proposed attention mask pattern; ([Wu, sections 1 and 3]: Wu discloses a method which “dynamically generate[s] dropout patterns for each attention layer” [Wu, section 1 paragraph beginning “Dropout”]. Each dropout pattern is determined by a mask matrix M [Wu, section 3 first paragraph; see also, section 4 paragraph beginning “Generator”], and the mechanism performs “element-wise multiplication” of Softmax(QK^T/sqrt{d_k}) and M [Wu, section 3.2 paragraph beginning “Weights Dropout” and equation (2)]. The mask matrix M maps to the “attention mask pattern proposal” of the claim, the matrix Softmax(QK^T/sqrt{d_k}) maps to the “original attention mask pattern” of the claim, and the result of element-wise multiplication to the “proposed attention mask pattern” of the claim.) 
creating an attention mask updating function based on the original attention mask pattern and the proposed attention mask pattern; ([Wu, section 3.2]: As noted above, Wu discloses a method of dropout involving “element-wise multiplication” of Softmax(QK^T/sqrt{d_k}) by the mask matrix M [Wu, section 3.2 paragraph beginning “Weights Dropout” and equation (2)]. Element-wise multiplication by M maps to the “attention mask updating function” of the claim. This function is “based on the original attention mask pattern and the proposed attention mask pattern” as mapped above since the “original attention mask pattern” as mapped above is its input and the “proposed attention mask pattern” as mapped above is its output.)
and combining the attention mask updating function with the original attention function to form an updated attention function of the attention layer. ([Wu, section 3.2]: The overall attention function [Wu, section 3.2 equation (2)] maps to the “updated attention function” of the claim. It is a result of “combining” the “original attention function” as mapped above (i.e., the scaled dot-product attention of [Wu, section 3.1 equation (1)]) with the “attention mask updating function” as mapped above (i.e., element-wise multiplication by the mask matrix M).)

Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art to combine attention mechanisms as disclosed with Galassi with the method of generating dropout masks disclosed by Wu because the “proposed approach is universal and qualified to enable more robust task-specific tuning” [Wu, section 7], thereby resulting in a more effective system overall. 

Claim 2
Galassi in view of Wu discloses the elements of the parent claim(s). It also discloses: 
[The method according to claim 1, wherein the generation of attention pattern proposal comprising:] identifying one or more elements of the input elements based on one or more features extracted from the input elements; ([Galassi, section IV.C]: Galassi indicates that masks are used to “to focus the attention only on a specific portion of the input” because “relevant features are found in a neighborhood of a certain position” [Galassi, section IV.C paragraph beginning “In some tasks”]. The specific portion of the input on which attention is focused using the masks of Wu maps to the “one or more elements” of the claim, and the relevant features map to the “one or more features” of the claim.)
visualizing the attention layer to identify the original attention mask pattern based on one or more attentions being placed by the attention layer in relation to the identified elements of the input elements and one or more original prediction results; ([Galassi, figures 3-7; Wu, figure 2]: The figures [Galassi, figures 3-7] and [Wu, figure 2] are all visualizations of attention mechanisms, including the attention weights and their relationship to the inputs.)
determining whether the original attention mask pattern makes intuitive sense, comprising: determining whether the attentions are being placed correctly and with appropriate attention weights in relation to the identified elements of the input elements and the original prediction results; and creating the proposed attention mask pattern based on one or more desired attention placements and attention weights on the input elements if the original attention mask pattern does not make intuitive sense. ([Wu, sections 1 and 4]: As noted above, Wu discloses “dynamically generate dropout patterns for each attention layer” [Wu, section 1 paragraph beginning “Dropout”]. More precisely, it describes a “Generator” or “G-Net” which “generat[es] a mask matrix for each attention layer” [Wu, section 4.1 paragraph beginning “Generator”]. The mask matrices it generates are provided to A-Net but not D-Net, and G-Net is rewarded positively when A-Net performs better than D-Net and negatively otherwise [Wu, section 4.1 paragraph beginning “Generator”; see also, section 4.2 algorithm 1]. In other words, a comparison between A-Net and D-Net as the “determining” step of the claim, and the generation of a new mask by G-Net after an iteration in which A-Net does not perform as well as D-Net maps to the “creating” step of the claim as best understood by the examiner in view of the 112(b) rejections (since A-Net not performing as well as D-Net can be viewed as the “original attention mask pattern” not making “intuitive sense”).)

The same motivation to combine applies. 

Claim 3
Galassi in view of Wu discloses the elements of the parent claims. It also discloses: 
[The method according to claim 1, wherein the creation of the attention mask updating function comprising:] designing a deviation function to obtain a deviation value representing a quantifiable deviation between the original attention mask pattern deviates and the proposed attention mask pattern; ([Wu, section 4.1]: As noted above, Wu discloses computing rewards for the G-Net based on a comparison between the A-Net (which uses the mask generated by the G-Net) and the D-Net (which does not) [Wu, section 4.1]. In other words, the reward maps to the “deviation value” and the “quantifiable deviation between the original attention mask pattern… and the proposed attention mask pattern” (with the “original attention mask pattern” and the “proposed attention mask pattern” being as mapped under the parent claim). The function computing the reward is the “deviation function” of the claim.)
determining whether an attention mask updating function can be created ([Wu, algorithm 1 and section 4.1]: Wu discloses an iterative process in which, “for each training step” [Wu, algorithm 1 line 2], the algorithm uses the G-Net to produce a mask to apply to the A-Net [Wu, algorithm 1 line 4 and section 4.1]. The algorithm stops creating masks when this “for” loop [Wu, algorithm 1 lines 2-11] terminates; i.e., when the “for” loop has not yet terminated, a mask and thus the “attention mask updating function” can be created, and when the “for” loop has terminated, a mask and thus the “attention mask updating function” cannot be created. In other words, the check regarding whether the “for” loop has terminated [Wu, algorithm 1 line 2] falls under the broadest reasonable interpretation of “determining whether an attention mask updating function can be created” as recited by the claim.) to fulfill the attention pattern proposal; ([Wu, section 3.2]: As noted above, the mask matrix M maps to the “attention [mask] pattern proposal” of the claim, and element-wise multiplication by M maps to the “attention mask updating function” of the claim. Thus the “attention mask updating function” which Wu discloses has the property that it “fulfill[s] the attention [mask] pattern proposal” in the sense that it is related to the “attention [mask] pattern proposal” as mapped above; cf. 112(b) rejections.)
creating the attention mask updating function if the attention mask updating function can be created to fulfill the proposed attention mask pattern; considering a generation of a new attention pattern proposal if the attention mask updating function cannot be created to fulfill the proposed attention mask pattern; and adopting a reinforcement learning (RL) model as the attention mask updating function if the attention mask updating function cannot be created to fulfill the proposed attention mask pattern and if a new attention pattern proposal cannot be obtained. ([Wu, section 3.2]: As noted above, Wu already discloses “creating the attention mask updating function”, which means that it also discloses “creating attention mask updating function if the attention mask updating function can be created to fulfill the attention [mask] pattern proposal” as required by the first of the at least two species encompassed by these limitations as best understood by the examiner in view of the 112(b) rejections. MPEP 2131.02(I) indicates that a species always anticipates a genus. In other words, the genus claim as presented cannot be allowed over the species disclosed by Galassi in view of Wu. The examiner notes that Wu also discloses iteratively generating new actions a_t, i.e., new mask matrices [Wu, section 4], and that the rewards mechanism of Wu is a reinforcement learning system.) 

The same motivation to combine applies. 

Claim 4
Galassi in view of Wu discloses the elements of the parent claim(s). It also discloses: 
[The method according to claim 1, wherein the combining of the attention mask updating function with the original attention function to form the updated attention function of the attention layer comprises:] training the workflow-based neural network to learn through backpropagation with an auxiliary loss function defined with the original attention function and the attention mask updating function. ([Galassi, section IV.A; Wu, section 4]: As noted under the parent claim, Galassi discloses training the neural network [Galassi, section IV.A first paragraph]. Wu discusses more details about training [Wu, section 4]. The function J(theta_G) [Wu, section 4.2 second displayed equation] maps to the “auxiliary loss function” of the claim; it is defined in terms of the rewards r_t provided to G-Net [Wu, section 4.2 first paragraph], which are based on the comparison between A-Net and D-Net [Wu, section 4.1 paragraph beginning “Generator”], which means that J(theta_G) is “defined with the original attention function and the attention mask updating function” as required by the claim. The use of “policy gradient to update theta_G” [Wu, section 4.2 first paragraph] (i.e., using [Wu, section 4.2 equation (5)]) falls under the broadest reasonable interpretation of “learn[ing] through backpropagation with an auxiliary loss function” as recited by the claim.)

The same motivation to combine applies. 

Claim 5
Galassi in view of Wu discloses the elements of the parent claim(s). It also discloses: 
[The method according to claim 1, wherein the combining of the attention mask updating function with the original attention function to form the updated attention function of the attention layer comprises:] directly applying the attention mask updating function to the original attention mask pattern in the attention layer. ([Wu, section 3.2]: In Wu, element-wise multiplication by M (i.e., the “attention mask updating function” of the claim) is applied directly to the softmax expression (i.e., the “original attention mask pattern” of the claim).)

The same motivation to combine applies. 

Claim 6
Galassi in view of Wu discloses the elements of the parent claim(s). It also discloses: 
[The method according to claim 1, wherein] the original attention function is a scaled dot-product attention function ([Wu, section 3.1]: Wu discloses scaled-dot product attention [Wu, section 3.1 equation (1)]. The examiner notes that scaled dot-product attention can also be found in Galassi by taking the compatibility function to be the “scaled multiplicative” one [Galassi, table IV] and the distribution function to be the “softmax function” [Galassi, section IV.C second paragraph].) having input including a key matrix of queries and a key matrix of keys, both of a dimension d_k, and a value matrix of values; ([Wu, section 3.1]: The matrices Q, K, and V of Wu map, respectively, to the “key matrix of queries”, the “key matrix of keys”, and the “value matrix of values” of the claim. The variable d_k of Wu corresponds to the identically named variable of the claim. The examiner notes that Wu cites the reference Vaswani for details about scaled dot-product attention [Wu, section 3 first paragraph]; the reference Vaswani, made of record in the conclusion of this Office action, indicates more explicitly that d_k is the dimension of Q and K.)
wherein the attention weights are obtained by computing dot products of the queries and the keys, then dividing each of the dot products by a square root of d_k; ([Wu, section 3.1]: The computation of QK^T in [Wu, section 3.1 equation (1)] is a computation of “dot products of the queries and the keys” as recited by the claim, and the division by the square root of d_k in [Wu, section 3.1 equation (1)] is “dividing each of the dot products by a square root of d_k” as recited by the claim.)
wherein the obtained attention weights are applied to the values to obtain the attention function. ([Wu, section 3.1]: The dot product with V [Wu, section 3.1 equation (1)] maps to “appl[ying the obtained attention weights] to the values” as recited by the claim.)

The same motivation to combine applies. 

Claim 7
Galassi in view of Wu discloses the elements of the parent claim(s). It also discloses: 
[The method according to claim 6, wherein] the original attention function is expressed as: 
    PNG
    media_image2.png
    34
    267
    media_image2.png
    Greyscale
wherein Q represents a key matrix of queries, K represents a key matrix of keys, d_k represents a dimension of both the key matrix of queries and the key matrix of keys, and V represents a value matrix of values; ([Wu, section 3.1]: The formula [Wu, section 3.1 equation (1)] is identical to the formula appearing in the claim.)
and wherein the updated attention function is expressed as: 
    PNG
    media_image3.png
    39
    288
    media_image3.png
    Greyscale
wherein f() represents the attention mask updating function. ([Wu, section 3.2]: As noted under the parent claims, element-wise multiplication by the mask matrix M maps to the “attention mask updating function” of the claim. In other words, taking the function f of the claim to be element-wise multiplication by M, the formula [Wu, section 3.2 equation (2)] is identical to the formula appearing in the claim.)

The same motivation to combine applies. 

Claim 8
Galassi in view of Wu discloses the elements of the parent claim(s). It also discloses: 
[The method according to claim 7, wherein] the attention mask updating function is expressed as: 
    PNG
    media_image1.png
    42
    362
    media_image1.png
    Greyscale
where W_t represents attention weights at time t, c_{t-1} is an attention weight of a centroid of W_{t-1}, e_x is an attention weight on an element at x-distance from the centroid, and threshold is a pre-defined difference in attention weight threshold value. ([Wu, section 3.2]: In Wu, the mask matrix M is a “is a binary matrix with elements in {0, 1}” [Wu, section 3.2 paragraph beginning “Weights Dropout”] and element-wise multiplication by such a matrix has the effect of setting certain weights to 0 (namely, the ones corresponding to entries of M which are equal to 0) and maintaining the others as is. The formula given in Wu thus falls under the broadest reasonable interpretation of this claim as best understood by the examiner in view of the 112(b) rejections.)

The same motivation to combine applies. 

Claim 9
Galassi discloses: 
A workflow-based neural network including an attention layer; ([Galassi, abstract]: Galassi discloses the use of attention mechanisms in neural architectures [Galassi, abstract]. Such a neural network maps to the “workflow-based neural network” of the claim, and the attention mechanism to the “attention layer” of the claim.)
wherein the workflow-based neural network is trained to predict one or more results from one or more input elements under a prediction model ([Galassi, sections II.A and IV.A]: Galassi discloses a use of the neural network having an attention mechanism to produce an output sequence based on an input sequence x [Galassi, section II.A; see also, figure 4]. Galassi also discloses that attention mechanisms are trained alongside the rest of the neural architecture [Galassi, section IV.A first paragraph]. The neural network also maps to the “prediction model” of the claim; its training maps to the “training” step of the claim, the input sequence to the “input elements” of the claim, and the output sequence to the “one or more results” of the claim.) with the attention layer assigning one or more attention placements and weights, ([Galassi, figure 3 and section II.B]: The attention mechanism disclosed in Galassi produces a vector of “attention weights” [Galassi, figure 3]. The vector of attention weights maps to the “one or more attention placements and weights” of the claim (the elements of the vector being the “weights” and the indices at which they are located being the “placements”). The examiner notes that Galassi indicates that attention weights are computed as a = g(e) where e = f(q, K) [Galassi, section II.B equations (6-7)], where g is a distribution function, f is a compatibility function, q is a query vector, and K are the keys [Galassi, figure 3; see also, table II]. based on an original attention function of the attention layer, ([Galassi, figure 4 and section II.B]: Galassi indicates that attention weights are combined with V to obtain the context vector [Galassi, figure 4 and section II.B equations (8-9)]. The overall attention model, which takes keys, queries, and values as input and outputs a context vector [Galassi, figure 4], maps to the “original attention function” of the claim.) to the input elements ([Galassi, figure 4 and section II.B]: In the attention mechanism [Galassi, figure 4], the “one or more attention placements and weights” as mapped above are assigned to the “input elements” as mapped above.) until the prediction model converges; ([Galassi, section IV.A]: As noted above, Galassi discloses training the neural network [Galassi, section IV.A first paragraph]. The point when training concludes is when the “prediction model converges” as recited by the claim. The examiner notes that the reference Wu used in the combination as proposed below discusses convergence more explicitly; see, for instance, [Wu, figure 4 and section 6.2].)

While Galassi briefly discusses positional masks [Galassi, section IV.C], it might not distinctly disclose:
and wherein the attention layer having an original attention function that is updated to form an updated attention layer by: generating an attention mask pattern proposal to obtain an original attention mask pattern and a proposed attention mask pattern; 
creating an attention mask updating function based on the original attention mask pattern and the proposed attention mask pattern; 
and combining the attention mask updating function with the original attention function to update the original attention function to form the updated attention layer.  

Wu is in the field of machine learning. Like Galassi, it discusses attention mechanisms [Wu, abstract] and more specifically mentions scaled dot-product attention [Galassi, table IV and section IV.C; Wu, section 3.1 equation (10)] (cf. “original attention function” of the claim). Moreover, Galassi in view of Wu discloses:
and wherein the attention layer having an original attention function that is updated to form an updated attention layer by: generating an attention mask pattern proposal to obtain an original attention mask pattern and a proposed attention mask pattern; ([Wu, sections 1 and 3]: Wu discloses a method which “dynamically generate[s] dropout patterns for each attention layer” [Wu, section 1 paragraph beginning “Dropout”]. Each dropout pattern is determined by a mask matrix M [Wu, section 3 first paragraph; see also, section 4 paragraph beginning “Generator”], and the mechanism performs “element-wise multiplication” of Softmax(QK^T/sqrt{d_k}) and M [Wu, section 3.2 paragraph beginning “Weights Dropout” and equation (2)]. The mask matrix M maps to the “attention mask pattern proposal” of the claim, the matrix Softmax(QK^T/sqrt{d_k}) maps to the “original attention mask pattern” of the claim, and the result of element-wise multiplication to the “proposed attention mask pattern” of the claim.)
creating an attention mask updating function based on the original attention mask pattern and the proposed attention mask pattern; ([Wu, section 3.2]: As noted above, Wu discloses a method of dropout involving “element-wise multiplication” of Softmax(QK^T/sqrt{d_k}) by the mask matrix M [Wu, section 3.2 paragraph beginning “Weights Dropout” and equation (2)]. Element-wise multiplication by M maps to the “attention mask updating function” of the claim. This function is “based on the original attention mask pattern and the proposed attention mask pattern” as mapped above since the “original attention mask pattern” as mapped above is its input and the “proposed attention mask pattern” as mapped above is its output.)
and combining the attention mask updating function with the original attention function to update the original attention function to form the updated attention layer. ([Wu, section 3.2]: The overall attention function [Wu, section 3.2 equation (2)] maps to the “updated attention function” of the claim. It is a result of “combining” the “original attention function” as mapped above (i.e., the scaled dot-product attention of [Wu, section 3.1 equation (1)]) with the “attention mask updating function” as mapped above (i.e., element-wise multiplication by the mask matrix M). The attention layer using this attention function maps to the “updated attention layer” of the claim.) 

Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art to combine attention mechanisms as disclosed with Galassi with the method of generating dropout masks disclosed by Wu because the “proposed approach is universal and qualified to enable more robust task-specific tuning” [Wu, section 7], thereby resulting in a more effective system overall. 

Dependent claims 10-16 inherit limitations from claim 9 and recite additional limitations which are substantially similar to those recited by claims 2-8, so they are rejected by the same rationale.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Ashish VASWANI et al. (Attention is All You Need, published 2017-12-06; hereafter, “Vaswani”) discloses scaled dot-product attention [Vaswani, section 3.2.1], including the fact that the scaling factor d_k is the dimension of Q and K [Vaswani, section 3.2.1 first paragraph]. 
Biswarup CHOUDHURY et al. (US20240127049A1, effectively filed 2022-06-17; hereafter, “Choudhury”) discloses an “attention-based neural network” [Choudhury, abstract] which iteratively updates attention masks [Choudhury, 0025]. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Shishir AGRAWAL whose telephone number is +1 703-756-1183. The examiner can normally be reached Monday through Thursday, 08:30-14:30 Pacific Time.

Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Alexey SHMATOV can be reached on +1 571-270-3428. The fax phone number for the organization where this application or proceeding is assigned is +1 571-273-8300.

Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at +1 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call +1 800-786-9199 (IN USA OR CANADA) or +1 571-272-1000.

/S.A./Examiner, Art Unit 2123


/ALEXEY SHMATOV/Supervisory Patent Examiner, Art Unit 2123
Read full office action
METHOD FOR OPTIMIZING WORKFLOW-BASED NEURAL NETWORK INCLUDING ATTENTION LAYER

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

METHOD FOR OPTIMIZING WORKFLOW-BASED NEURAL NETWORK INCLUDING ATTENTION LAYER

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email