Detailed Action
Notice of Pre-AIA or AIA status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Receipt is acknowledged of certified copies of papers submitted under 35 U.S.C. § 119(a)–(d), which papers have been placed of record in the file.
Information Disclosure Statement
The information disclosure statement filed on October 9, 2023 complies with the provisions of 37 C.F.R. § 1.97, 1.98, and MPEP § 609, and therefore has been placed in the application file. The information referred to therein has been considered as to the merits.
The information disclosure statement filed February 7, 2024 fails to comply with the provisions of 37 CFR 1.97, 1.98 and MPEP § 609 because the title of non-patent literature document no. 1 is incorrect. It has been placed in the application file, but the information referred to therein has not been considered as to the merits. Applicant is advised that the date of any re-submission of any item of information contained in this information disclosure statement or the submission of any missing element(s) will be the date of submission for purposes of determining compliance with the requirements based on the time of filing the statement, including all certification requirements for statements under 37 CFR 1.97(e). See MPEP § 609.05(a).
Specification
The Office objects to the specification for having the following informalities:
(1) The title of the invention is not descriptive. A new title is required that is clearly indicative of the invention to which the claims are directed.
(2) in paragraph 229 (page 40, line 14), there is a typographical error, likely due to a bug in the Applicant’s word processing software, causing the gating module parameter to appear as a question mark. Based on the Examiner’s understanding of the subject matter (and line 11, in the same paragraph), it is believed that the question mark should be replaced with a lower-case italic theta (“a skeleton model is fixed, and the gating module parameter θ is updated based on” etc.).
Appropriate correction is required.
Claim Rejections – 35 U.S.C. § 101
35 U.S.C. § 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
I. Claim 18 is directed to non-statutory subject matter.
Claim 18 is rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter. The claim(s) does/do not fall within at least one of the four categories of patent eligible subject matter because it is directed to a “computer storage medium.” The plain meaning of “computer storage medium” (and therefore its broadest reasonable interpretation) includes embodiments that are pure signals per se. See Ex parte Mewherter, BPAI Appeal No. 2012-007692 (May 8, 2013) (precedential).
Signals per se do not fall within any of the four categories listed in 35 U.S.C. § 101, and therefore, are not eligible for patenting.
II. Claims 1–18 are directed to a judicial exception without significantly more.
Claims 1–18 are rejected under 35 U.S.C. § 101 because the claimed invention is directed to a judicial exception (i.e., a law of nature, a natural phenomenon, or an abstract idea) without significantly more.
The Office applies a four-part test when examining claims for subject-matter eligibility under § 101. First, the claimed invention must be directed to one of the four statutory categories explicitly listed in § 101 (Step 1). MPEP § 2106(I.). Then, the claimed invention is analyzed to determine whether it is directed to one of § 101’s judicial exceptions (Step 2A Prong One) without reciting both a practical application of the judicial exception (Step 2A Prong Two) and significantly more than the judicial exception (Step 2B). MPEP § 2106(I.).
With this framework in mind, the claims will now be analyzed for subject matter eligibility under § 101.
Claim 1
Claim 1 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Claim 1 is considered a “process” within the meaning of 35 U.S.C. § 101. However, the claim recites a mathematical concept, and mathematical concepts are category of abstract ideas, which are considered to be judicial exceptions to 35 U.S.C. § 101. See MPEP § 2106.04(a). Specifically, the claim recites “obtaining,” by any means: to-be-processed data, a target neural network model, and weight values for a task, and then applying the model to the data with the weights by “performing a target operation on an output of the first attention head and the first weight value” or “perform[ing] a target operation on an output of the target FFN and the second weight value.”
The target neural network model and its transformer layer is simply a description of a mathematical relationship between the target neural network model’s inputs and its outputs. Indeed, by definition, every node of a neural network is simply a mathematical formula or equation that relates its input numbers/vectors to the numbers/vectors that they output. Likewise, the claimed “target operation” (which claim 9 confirms is a product operation (multiplication)), is a mathematical calculation.
This judicial exception is not integrated into a practical application. The claim recites no additional elements beyond the mathematical calculation itself. The recitation of the “target neural network model” and its components (transformer layer, residual branches, attention head, FFN) merely describes the mathematical model and framework within which the calculation is performed. The claim as a whole does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claim, as a whole, is merely an instruction to apply the mathematical calculation. This amounts to no more than mere instructions to apply the exception and does not provide an inventive concept.
Accordingly, claim 1 is rejected under 35 U.S.C. § 101 for failing to recite significantly more than a judicial exception thereof.
Claim 2
Claim 2 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
The claim depends from claim 1 and recites the abstract idea (mathematical relationship and mathematical calculation) identified therein. The claim adds limitations for training, specifically “obtaining a loss function” and “updating the first weight value according to the loss function.” These added steps are themselves mathematical calculations. (See Spec. ¶¶ 229–230).
This judicial exception is not integrated into a practical application. The additional elements (training steps) are themselves mathematical concepts. Adding one abstract idea (training calculations) to another (the target operation calculation) does not integrate the exception into a practical application. The claim is directed to an abstract idea.
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The additional elements are just more mathematical calculations. Adding more abstract ideas does not add “significantly more” to the exception recited in claim 1.
Claim 3
Claim 3 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
The claim depends from claim 1 and recites the abstract idea (mathematical calculation) identified therein. The claim adds limitations for training, specifically “obtaining a loss function” and “updating the second weight value according to the loss function.” These added steps are themselves mathematical calculations.
This judicial exception is not integrated into a practical application. The additional elements (training steps) are themselves mathematical concepts. Adding one abstract idea (training calculations) to another (the target operation calculation) does not integrate the exception into a practical application. The claim is directed to an abstract idea.
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The additional elements are just more mathematical calculations. Adding more abstract ideas does not add “significantly more” to the exception recited in claim 1.
Claim 4
Claim 4 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
The claim depends from claim 1 and recites the abstract idea (mathematical calculation) identified therein. The claim adds the limitation of “obtaining, based on a preset mapping relationship, the weight value corresponding to the target task, wherein the preset mapping relationship comprises a correspondence between a task and a weight value.”
This judicial exception is not integrated into a practical application. The added limitation merely describes the source of the data (the weight value) used in the mathematical calculation. This is, at best, a data-gathering step, which is considered insignificant extra-solution activity. Moreover, depending on the broadest reasonable interpretation, the preset mapping relationship may even be defined in the mind or with pen and paper, meaning it should not even be treated as an additional element within the 35 U.S.C. § 101 framework. This step does not integrate the mathematical calculation from claim 1 into a practical application. The claim is directed to an abstract idea.
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The additional element of “obtaining based on a preset mapping relationship” is insignificant extra-solution activity. It does not add “significantly more” to the abstract idea.
Claim 5
Claim 5 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
The claim depends from claim 1 and recites the abstract idea (mathematical calculation) identified therein. The claim adds limitations for “inputting an identifier of the target task and at least one of the to-be-processed data or the output of the first attention head into a first neural network to obtain the first weight value; or inputting an identifier of the target task and at least one of the to-be-processed data and the output of the target FFN into a second neural network to obtain the second weight value.”
This judicial exception is not integrated into a practical application. The additional elements (using a first or second neural network to obtain a weight) are themselves mathematical models/calculations used to generate an input for the mathematical calculation of claim 1. Adding one abstract idea to another does not integrate the exception into a practical application. The claim is directed to an abstract idea.
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The additional element is just another mathematical model/calculation. This does not add “significantly more” to the abstract idea of claim 1.
Claim 6
Claim 6 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
The claim depends from claim 5 and recites the abstract ideas (mathematical calculations) identified in claims 1 and 5. The claim adds limitations describing the training of the “first neural network” from claim 5. This training, which involves updating an initial neural network, is itself a mathematical calculation.
This judicial exception is not integrated into a practical application. The additional element is the mathematical calculation of training the first neural network. Adding further mathematical calculations to the abstract ideas of claims 1 and 5 does not integrate the exceptions into a practical application. The claim is directed to an abstract idea.
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The additional elements are more mathematical calculations and do not add “significantly more” to the abstract ideas.
Claim 7
Claim 7 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
The claim depends from claim 5 and recites the abstract ideas (mathematical calculations) identified in claims 1 and 5. The claim adds limitations describing the training of the “second neural network” from claim 5. This training, which involves updating an initial neural network, is itself a mathematical calculation.
This judicial exception is not integrated into a practical application. The additional element is the mathematical calculation of training the second neural network. Adding further mathematical calculations to the abstract ideas of claims 1 and 5 does not integrate the exceptions into a practical application. The claim is directed to an abstract idea.
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The additional elements are more mathematical calculations, and do not add “significantly more” to the abstract ideas.
Claim 8
Claim 8 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
The claim depends from claim 1 and recites the abstract idea (mathematical calculation) identified therein. The claim adds limitations specifying a “plurality of attention heads” and performing the “target operation on an output of each attention head and a corresponding weight value.”
This judicial exception is not integrated into a practical application. The additional element of using a “plurality” of attention heads is merely a recitation of performing the mathematical calculation of claim 1 multiple times. This describes the abstract idea in more detail but does not integrate it into a practical application. The claim is directed to an abstract idea.
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The additional element (plurality of heads) is just an expansion of the abstract idea itself (performing the calculation multiple times). It does not add “significantly more”.
Claim 9
Claim 9 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
The claim depends from claim 1 and recites the abstract idea (mathematical calculation) identified therein. The claim adds the limitation that “the target operation comprises a product operation.”
This judicial exception is not integrated into a practical application. This limitation (“product operation”) explicitly defines the mathematical calculation. It is the abstract idea, not an additional element that integrates it. It does not integrate the exception into a practical application. The claim is directed to an abstract idea.
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The limitation “product operation” is the abstract idea itself and cannot provide the “significantly more”.
Claim 10
Claim 10 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
The claim depends from claim 1 and recites the abstract idea (mathematical calculation) identified therein. The claim adds a list of “target tasks” such as “reading comprehension, text translation, text classification,” etc.
This judicial exception is not integrated into a practical application. The list of target tasks is a field of use limitation. It generally links the use of the mathematical calculation to a particular technological environment (natural language processing) but does not add any meaningful application beyond the calculation itself. It does not integrate the exception into a practical application. The claim is directed to an abstract idea.
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The additional element is a field of use limitation, which is not “significantly more” than the abstract idea.
Claims 11–18
Claims 11–17 are rejected for the same reasons as given above for claims 1–7 (hereby incorporated by reference), and claim 18 for the same reasons as given for claim 1, except that at steps 2A prong two and 2B, the Examiner finds that the general-purpose computer hardware recited therein (or the computer readable medium, in the case of claim 18) is tantamount to a mere instruction to apply the abstract idea on a general-purpose computer. A mere instruction to apply the abstract idea on a general-purpose computer is not a practical application, nor is it “significantly more.”
Claim Rejections – 35 U.S.C. § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. § 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
I. Tay discloses claims 1, 4, 5, 7–11, 14, 15, 17, and 18.
Claims 1, 4, 5, 7–11, 14, 15, 17, and 18 are rejected under 35 U.S.C. § 102(a)(1) as being anticipated by Yi Tay et al., HyperGrid: Efficient Multi-Task Transformers with Grid-wise Decomposable Hyper Projections, arXiv preprint, https://doi.org/10.48550/arXiv.2007.05891 (July 12, 2020) (hereafter “Tay”).
Tay’s method is an improvement technique for “the state-of-the-art Text-to-Text Transformers (T5)” model, and therefore incorporates by reference the “Raffel 2019” paper that announces and describes the T5 model. See Tay 2 ll. 23–25. Accordingly, this rejection will also make reference to the “Raffel 2019” paper, whose full citation is: Colin Raffel et al., Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, arXiv preprint arXiv:1910.10683, https://doi.org/10.48550/arXiv.1910.10683 (Oct. 23, 2019) (hereafter “Raffel”).
Copies of both references, modified by the Examiner to include line numbers, are attached to this Office Action as “Non-Patent Literature” in the file wrapper. The citations to line numbers in this rejection refer to the line numbers in the file wrapper versions of these two papers.
Claim 1
Tay discloses:
A data processing method, wherein the method comprises:
“In this paper, we propose HyperGrid, a new approach for highly effective multi-task learning.” Tay Abstract.
obtaining to-be-processed data and a target neural network model,
“In our experiments, we equip state-of-the-art pretrained Transformer models with our proposed HyperGrid layers during fine-tuning. Specifically, we imbue the state-of-the-art Text-to-Text Transformers (T5) with HyperGrid.” Tay 2 ll. 23–25 (citing Raffel for the T5 model). Additionally, the model(s) is/are also supplied with input data. See Tay 2 ll. 12 and Tay Figure 2 (illustrating a block of “Input” being fed into the Transformer). More specifically, Raffel clarifies that the input data comprises “an input sequence of tokens.” Raffel 4 line 2.
wherein the target neural network model comprises a first transformer layer, the first transformer layer comprises a first residual branch and a second residual branch, the first residual branch comprises a first attention head, and the second residual branch comprises a target feed-forward network (FFN) layer;
As shown in Figure 2 (page 4), the T5 model—like all Transformers—comprises one or more “feed-forward transformation layers,” as well as a “self-attention” block. Tay 4 ll. 17–19 and 26; see also Raffel 4 ll. 1–14 (explaining that the T5 Transformer comprises an encoder with “a self-attention layer followed by a small feed-forward network,” and likewise for its decoder).
obtaining a weight value corresponding to a target task, wherein the weight value comprises a first weight value corresponding to the first attention head or a second weight value corresponding to the target FFN; and
In the HypeGrid approach, a hypernetwork H (X) (i.e., the block labeled “HyperGrid” in Figure 2) generates a grid of weights W to be applied to the T5 model’s second positional FFN after the ReLU activations. Tay 3 ll. 19–23 and 4 ll. 17–19. Notably, these weights depend upon a task “prefix token” in the data (which is also part of the T5’s input), “which provides task information to the model.” Tay 4 ll. 10–27; see also Raffel 6 ll. 33–37 (providing further detail about the task prefix token). This corresponds to the claimed “second weight.”
Tay does not need to further disclose the “first weight” to anticipate the claim, because claim 1 only requires a first weight value “or” a second weight value. That being said, Tay ultimately does further disclose a first weight value corresponding to the first attention head (that also corresponds to a target task), because Tay’s HypeGrid approach uses the T5 model in its entirety, which includes a masking mechanism that applies a task prefix-specific “attention mask” during the self-attention block to help the model “attend[] over” the prefix, “i.e. some context provided to the model that is later used when making predictions.” Raffel p. 12 ll. 8–19, p. 13 ll. 12–24, and p. 14 ll. 1–19.
performing target task related processing on the to-be-processed data using the target neural network model to obtain a data processing result, wherein the target neural network model is configured to perform a target operation on an output of the first attention head and the first weight value to obtain an output of the first residual branch, or the target neural network model is configured to perform a target operation on an output of the target FFN and the second weight value to obtain an output of the second residual branch.
The HyperGrid-enhanced T5 model starts by processing the input data in a similar manner as the baseline T5: “[f]irst, an input sequence of tokens is mapped to a sequence of embeddings, which is then passed into the encoder. The encoder consists of a stack of ‘blocks’, each of which comprises two subcomponents: a self-attention layer followed by a small feed-forward network,” and then the decoder portion of the T5 does the same, except that it includes a standard attention mechanism after each self-attention layer that attends to the output of the encoder.” Raffel 4 ll. 1–10.
Importantly, however, “the hypernetwork generates a vector, i.e., UX ∈ ℝdf that is broadcast (multiplied by 1) and multiplied by W, acting as a row-wise scaling of W,” and multiplied against the weight matrices of the T5’s FFN, X, producing a transformed matrix Y. Tay 3 ll. 19–29.
Note that because claim 1 uses the word “or,” Tay anticipates the claim language simply by disclosing one side of the “or” statement (“the target neural network model is configured to perform a target operation on an output of the target FFN and the second weight value to obtain an output of the second residual branch”).
Claim 4
Tay discloses the method according to claim 1, wherein the obtaining a weight value corresponding to a target task comprises:
obtaining, based on a preset mapping relationship, the weight value corresponding to the target task, wherein the weight value corresponding to the target task comprises the first weight value or the second weight value, and the preset mapping relationship comprises a correspondence between a task and a weight value.
In the HypeGrid approach, a hypernetwork H (X) (i.e., the block labeled “HyperGrid” in Figure 2) generates a grid of weights W to be applied to the T5 model’s second positional FFN after the ReLU activations. Tay 3 ll. 19–23 and 4 ll. 17–19. Notably, these weights depend upon a task “prefix token” in the data (which is also part of the T5’s input), “which provides task information to the model.” Tay 4 ll. 10–27; see also Raffel 6 ll. 33–37 (providing further detail about the task prefix token). This corresponds to the claimed “second weight.”
Note that although the hypernetwork “generates” the weights when asked, it still falls within the broadly clamed scope of a “preset mapping relationship,” because the mapping relationship is defined by the mathematical model of the hypernetwork itself.
Claim 5
Tay discloses the method according to claim 1, wherein the obtaining a weight value corresponding to a target task comprises:
inputting an identifier of the target task and at least one of the to-be-processed data or the output of the first attention head into a first neural network to obtain the first weight value; or inputting an identifier of the target task and at least one of the to-be-processed data and the output of the target FFN into a second neural network to obtain the second weight value.
As shown in Figure 2, the same input for the T5 transformer (right side of the figure) is separately input into the HyperGrid. The input comprises a “prefix token,” in addition to the actual data being processed, which is extracted in a pooling layer P(.), Tay 4 ll. 10–27, and fed into HyperGrid’s local hypernetwork to produce the task-conditioned parameters (i.e., the grid-wise projections) that are multiplied at the second transform block (discussed earlier in the rejection of claim 1).
Claim 7
Tay discloses the method according to claim 5, further comprising
obtaining the second neural network by updating a second initial neural network when the target neural network model is trained for the target task,
The HyperGrid model is trained simultaneously with the T5 model, during the fine-tuning phase. Tay 2 ll. 23–25.
and in a training process of the target neural network model, the target neural network model is configured to input the identifier of the target task and at least one of the to-be-processed data and the output of the target FFN into the second initial neural network, and perform a target operation on an output of the second initial neural network and the output of the target FFN to obtain the output of the second residual branch.
As shown in Figure 2, the same input for the T5 transformer (right side of the figure) is separately input into the HyperGrid. The input comprises a “prefix token,” in addition to the actual data being processed, which is extracted in a pooling layer P(.), Tay 4 ll. 10–27, and fed into HyperGrid’s local hypernetwork to produce the task-conditioned parameters (i.e., the grid-wise projections) that are multiplied at the second transform block (discussed earlier in the rejection of claim 1).
Claim 8
Tay (with Raffel incorporated by reference) discloses the method according to claim 1,
wherein the first transformer layer comprises a plurality of attention heads, each of the plurality of attention heads corresponds to a weight value,
The T5 Transformer implementation of the originally-proposed Transformer includes an encoder with at least one self-attention layer, and a decoder layer that is similar to the encoder, but further includes “a standard attention mechanism after each self-attention layer that attends to the output of the encoder.” Raffel 4 ll. 1–10.
the target neural network model is configured to perform a target operation on an output of each attention head and a corresponding weight value to obtain the output of the first residual branch,
“Recall that the self-attention operation in a Transformer takes a sequence as input and outputs a new sequence of the same length. Each entry of the output sequence is produced by computing a weighted average of entries of the input sequence. Specifically, let yi refer to the ith element of the input sequence and xj refer to the jth entry of the input sequence. yi is computed as Σjwi,j xj, where wi,j is the scalar weight produced by the self-attention mechanism as a function of xi and xj.” Raffel 11 ll. 31–36.
and different attention heads correspond to different weight values.
“An attention mask is then used to zero out certain weights in order to constrain which entries of the input can be attended to at a given output timestep,” wherein the encoder uses a “fully-visible attention mask,” while the decoder’s attention head uses a “causal” masking pattern on the weights. Raffel 12 ll. 5–19.
Claim 9
Tay discloses the method according to claim 1,
wherein the target operation comprises a product operation.
“HyperGrid operates on weight matrices (linear transformations), i.e., Y = WX + b.” Tay 3 ll. 19–21.
Claim 10
Tay discloses the method according to claim 1,
wherein the target task comprises one of the following: reading comprehension, text translation, restatement recognition, named entity recognition, text emotion analysis, natural language reasoning, text automatic question and answer, text intention recognition, text classification, text simplification, and text story generation
“We conduct experiments on GLUE [Wang et al., 2018] and SuperGLUE [Wang et al., 2019] which are consolidated benchmarks of multiple challenging NLP and NLU tasks.” Tay 5 ll. 10–11. Specifically, the GLUE benchmarks test, among other things, sentence sentiment tasks, text similarity tasks, paraphrasing tasks, entailment tasks (given a premise sentence and a hypothesis sentence, the task is to predict whether the premise entails the hypothesis (entailment), contradicts the hypothesis (contradiction), or neither (neutral)), matching questions with their answers, the Winograd Schema Challenge (a reading comprehension task), and others. See Alex Wang et al., GLUE: A multi-task benchmark and analysis platform for natural language understanding, 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP, pages 3–5 (2018), available at https://doi.org/10.48550/arXiv.1804.07461.
Claims 11, 14, 15, 17
Claims 11, 14, 15, and 17 recite a general purpose computer comprising at least one processor and one or more memories coupled to the at least one processor and storing programming instructions for performing exactly the same method as set forth in corresponding claims 1, 4, 5, and 7. Tay discloses that method for the reasons given in the rejections of claims 1, 4, 5, and 7, and further teaches implementing such methods on a general purpose computer with a processor and memory See Tay 8 ll. 5–14. Therefore, claims 11, 14, 15, and 17 are anticipated by those combined findings.
Claim 18
Claim 18 recites a broader but fully encompassing version of the memory portion of the system of claim 11. Therefore, claim 18 is rejected over all of the findings and rationale provided above for the rejection of claim 11.
II. Shazeer discloses claims 1, 3, 5, 11, 13, 15, and 18.
Claims 1, 3, 5, 11, 13, 15, and 18 are rejected under 35 U.S.C. § 102(a)(2) as being anticipated by U.S. Patent Application Publication No. 2020/0279150 A1 (hereafter “Shazeer”).
In paragraph 70, Shazeer incorporates by reference the paper: Noam Shazeer et al., Outrageously large neural networks: The sparsely-gated mixture-of-experts Layer, arXiv preprint 1701.06538 (2017) (available at https://arxiv.org/abs/1701.06538). This rejection will refer to the portions of that paper as “Shazeer II.”
Claim 1
Shazeer discloses
A data processing method, wherein the method comprises:
“FIG. 6 is a flow diagram of an example process for performing a machine learning task on an input.” Shazeer ¶ 76.
obtaining to-be-processed data
“The system receives a request to perform a machine learning task on an input of a first modality (step 602). The machine learning task is a machine learning task from a particular machine learning domain that transforms inputs of the first modality to outputs of a second modality. For example, the system may receive a request to perform a machine translation of an input text segment in an input natural language to a corresponding text segment in a target natural language.” Shazeer ¶ 77.
and a target neural network model, wherein the target neural network model comprises a first transformer layer,
After an initial step 606 of mapping the input to a unified representation space, “[t]he system processes the mapped input of the unified representation space using an encoder neural network and a decoder neural network to generate a decoder output (step 608).” Shazeer ¶ 80.
the first transformer layer comprises a first residual branch and a second residual branch,
As shown in FIG. 3, the encoder network 104 includes a connection to layer 316, and another connection between modules 308 and 312. See Shazeer FIG. 3.
the first residual branch comprises a first attention head, and the second residual branch comprises a target feed-forward network (FFN) layer;
The connection to 316 leads to an “attention neural network layer 316,” whereas modules 308 and 312 “include[] multiple convolutional neural network layers, e.g., depth wise separable convolutional neural network layers, as described above with reference to FIGS. 1 and 2.” Shazeer ¶ 69.
obtaining a weight value corresponding to a target task, wherein the weight value comprises a first weight value corresponding to the first attention head or a second weight value corresponding to the target FFN; and
“Optionally, the encoder neural network 104 may include a sparsely-gated mixture of experts neural network layer 310. A mixture of experts neural network layer includes a number of feed-forward neural networks (experts) and a trainable gating network which selects a sparse combination of the experts to process each input.” Shazeer ¶ 70. Shazeer thus discloses at least the second weight value, which corresponds to the output of convolutional module 308, a feed forward network.
performing target task related processing on the to-be-processed data using the target neural network model to obtain a data processing result,
“Outputs from the mixture of experts layer 310 can be provided to a second convolutional module 312 (which may be similar to the convolutional module 200 described with reference to FIG. 2) and an attention neural network layer 316 for processing,” which are then added back together via residual connection 318 to generate the encoded input 320. Shazeer ¶ 71.
wherein the target neural network model is configured to perform a target operation on an output of the first attention head and the first weight value to obtain an output of the first residual branch, or
The word “or” in the claim language above means that the prior art only needs to disclose one of the two sides of the disjunctive (perform a target operation on an output of the first attention head and the first weight value, or on the output of the target FFN and the second weight value) to anticipate the claim. As discussed below, Shazeer at least discloses that its target neural network model performs the target operation on the output of the target FFN and the second weight value to obtain the output of the second residual branch, thereby satisfying the elements of the claim.
the target neural network model is configured to perform a target operation on an output of the target FFN and the second weight value to obtain an output of the second residual branch.
In this case, the encoder block 104 of multi task multi modal machine learning model 100 uses its mixture of experts neural network layer 310 “to process each input” it is fed from the convolutional module 308. Shazeer ¶ 70. The paper referenced in paragraph 70 further explains that the “trainable gating network” functions by multiplying the mixture of experts’ 310 input by a trainable weight matrix Wg and applying softmax. See Shazeer II at 4.
Claim 3
Shazeer discloses the method according to claim 1,
wherein the target neural network model is configured to perform the target operation on the output of the target FFN and the second weight value to obtain the output of the second residual branch, and further comprising:
The encoder block 104 of multi task multi modal machine learning model 100 uses its mixture of experts neural network layer 310 “to process each input” it is fed from the convolutional module 308. Shazeer ¶ 70. This “trainable gating network” functions by multiplying the mixture of experts’ 310 input by a trainable weight matrix Wg and applying softmax. Shazeer II p. 4.
Shazeer II, which Shazeer incorporates by reference, discloses that the mixture of experts neural network layer 310 is trained as follows.
obtaining a loss function according to the data processing result and a correct data processing result of the to-be-processed data; and updating the second weight value according to the loss function.
“[T]raining data may be used to adjust the input modality neural networks 102a-c, encoder neural network 104, decoder neural network 106, and output modality neural networks 108a-c weights from initial values to trained values, e.g., by processing the training examples and adjusting the neural network weights to minimize a corresponding loss function.” Shazeer ¶ 63.
Notably, with respect to the second weight value in particular, Shazeer II further discloses adding an L-importance term to the loss function, defined as “the square of the coefficient of variation of the set of importance values, multiplied by a hand-tuned scaling factor wimportance,” where the “importance values” are defined as the “batchwise sum of the gate values for [each] expert” relative to a batch of training examples. Shazeer II p. 5. Consequently, by “train[ing] the gating network by simple back-propagation, along with the rest of the model,” Shazeer II p. 4, the gating network produces weights (i.e., the claimed “second weight value”) in accordance with the loss function.
Claim 5
Shazeer discloses the method of claim 1, further comprising:
inputting an identifier of the target task and at least one of the to-be-processed data or the output of the first attention head into a first neural network to obtain the first weight value; or inputting an identifier of the target task and at least one of the to-be-processed data and the output of the target FFN into a second neural network to obtain the second weight value.
“The multi task multi modal machine learning model 100 is configured to receive as input machine learning model data inputs of different machine learning domains/modalities corresponding to different machine learning tasks,” (i.e., the claimed to-be-processed data) Shazeer ¶ 40, as well as “a command-token indicating the machine learning domain and specific machine learning task” (i.e., the claimed identifier of the target task). Shazeer ¶ 41. As explained in the rejection of claim 1, all of the data flows through the model, and ultimately causes the gating network in the mixture of experts layer 310 to produce a weight for the output of convolutional module 308 (a feed-forward network). See Shazeer ¶¶ 70–71 and Shazeer II at 4.
Claims 11, 13, and 15
Claims 11, 13, and 15 recite a general purpose computer comprising at least one processor and one or more memories coupled to the at least one processor and storing programming instructions for performing exactly the same method as set forth in corresponding claims 1, 3, and 5. Shazeer discloses that method for the reasons given in the rejections of claims 1, 3, and 5, and further teaches implementing such methods on a general purpose computer with a processor and memory See Shazeer ¶ 88. Therefore, claims 11, 13, and 15 are anticipated by those combined findings.
Claim 18
Claim 18 recites a broader but fully encompassing version of the memory portion of the system of claim 11. Therefore, claim 18 is rejected over all of the findings and rationale provided above for the rejection of claim 11.
Claim Rejections – 35 U.S.C. § 103
The following is a quotation of 35 U.S.C. § 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned at the time any inventions covered therein were effectively filed absent any evidence to the contrary. Applicant is advised of the obligation under 37 C.F.R. § 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned at the time a later invention was effectively filed in order for the examiner to consider the applicability of 35 U.S.C. § 102(b)(2)(C) for any potential 35 U.S.C. § 102(a)(2) prior art against the later invention.
Claims 2, 6, 12, and 16 are rejected under 35 U.S.C. § 103 as being unpatentable over Shazeer as applied to claims 1 and 11 above, and further in view of Jiejie Zhao et al., Multiple Relational Attention Network for Multi-task Learning, 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining https://doi.org/10.1145/3292500.3330861 (July 25, 2019) (hereafter “Zhao”).
Claim 2
Shazeer teaches the method according to claim 1, but not the embodiment where the target neural network model is configured to perform the target operation on the output of the first attention head and the first weight value to obtain the output of the first residual branch.
Zhao, however, teaches a target neural network model (Figure 2),
wherein the target neural network model is configured to perform the target operation on the output of the first attention head and the first weight value to obtain the output of the first residual branch, and further comprising:
As shown in Figure 2, and much like the claimed invention, Zhao obtains an input X and a deep multi-task learning framework called “MRAN” that includes a model (the “Feature Predictor”) that has at least one attention layer branch and at least one feedforward network branch, which process the input data to produce an output Y. Zhao 1125.
The Feature Predictor obtains “an attention weight vector αA based on new feature and task embedding which shows the dependence relationships between tasks and features,” from a Task-Feature dependence relationship learning module. Zhao 1125. Thus, when in use (e.g., a testing phase), “feature predictor network directly performs multiple prediction tasks based on input X′ and the learned attention weight.” Zhao 1125.
obtaining a loss function according to the data processing result and a correct data processing result of the to-be-processed data; and
“To train MRAN, we minimize a loss function ℒtot,” that includes a term for “task-specific losses
L
t
with task weightings λt,” defined as the following objective function:
L
t
X
,
Y
t
=
∑
i
=
1
n
y
^
i
t
-
y
i
t
2
2
“where
y
^
i
t
is the prediction value of task 𝒯t and
y
i
t
is the ground truth of 𝒯t.” Zhao 1127.
updating the first weight value according to the loss function.
The loss function ℒtot is used to train the network to produce an appropriate
α
A
t
. See Zhao 1126–27.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to supplement the attention neural network layer 316 in Shazeer’s multi task multi modal machine learning model 100 with the α weightings learned via Zhao’s MRAN model, and to train said MRAN model exactly in the manner described in Zhao’s paper. One would have been motivated to train and use the MRAN model with Shazeer’s attention neural network layer 316 because several of the different embodiments of MRAN produced lower root mean squared error scores relative to the MMoE model—i.e., the same model used in block 310 of Shazeer’s encoder 104—alone. See Zhao 1128–29.
Claim 6
Shazeer teaches the method according to claim 5, but only discloses how to obtain the second neural network of claim 5, not the