DETAILED ACTION
This action is in response to the application filed 12/12/2022. Claims 1-23 are pending and have been examined.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Objections
Claim 22 is objected to because of the following informalities:
Claim 22 should have a semi colon between “single feed-forward neural network” and “generating the respective layer” on the second to last line in order to keep the limitations separate.
Appropriate correction is required.
Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b).
The filing of a terminal disclaimer by itself is not a complete reply to a nonstatutory double patenting (NSDP) rejection. A complete reply requires that the terminal disclaimer be accompanied by a reply requesting reconsideration of the prior Office action. Even where the NSDP rejection is provisional the reply must be complete. See MPEP § 804, subsection I.B.1. For a reply to a non-final Office action, see 37 CFR 1.111(a). For a reply to final Office action, see 37 CFR 1.113(c). A request for reconsideration while not provided for in 37 CFR 1.113(c) may be filed after final for consideration. See MPEP §§ 706.07(e) and 714.13.
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The actual filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/apply/applying-online/eterminal-disclaimer.
Claims 1 and 23 are provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claims 5 and 17 of copending Application No. 19/358,980 (reference application). Although the claims at issue are not identical, they are not patentably distinct from each other because the ‘980 application discloses all the limitations of claims 1 and 23 as shown in the tables below.
This is a provisional nonstatutory double patenting rejection because the patentably indistinct claims have not in fact been patented.
Regarding claim 1:
Instant Application Claim 1
US Application No. 19/358980 Claim 5
A system for performing a machine learning task on a network input to generate a network output, the system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to implement: an attention neural network configured to perform the machine learning task, the attention neural network comprising a plurality of layers, each layer comprising an attention sub- layer and a feed-forward sub-layer, the attention sub-layer configured to:
1. A system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to implement: an attention neural network configured to process a network input to generate a network output for a machine learning task, the attention neural network comprising a plurality of layers, each layer comprising an attention sub-layer and a feed-forward sub-layer, the attention sub- layer configured to:
receive an input sequence for the layer comprising a respective layer input at each of one or more positions;
1. receive an input sequence for the layer comprising a respective layer input at each of one or more positions;
and generate an attended input sequence at least in part by applying an attention mechanism to the input sequence for the layer, the attended input sequence comprising a respective attended layer input at each of the one or more positions, and the feed-forward sub-layer configured to:
1. and generate an attended input sequence at least in part by applying a query-key- value (QKV) attention mechanism that uses a set of queries, a set of keys, and a set of values generated from the input sequence for the layer, the attended input sequence comprising a respective attended layer input at each of the one or more positions, and the feed-forward sub-layer configured to:
receive the attended input sequence;
1. receive the attended input sequence;
and generate an output sequence for the layer from the attended input sequence, the output sequence comprising a respective layer output at each of the one or more positions, wherein, for at least one each layer in a subset of the plurality of layers, the feed-forward sub-layer is a conditional computation sub-layer that (i) comprises a plurality of expert feed- forward neural networks and (ii) is configured to generate the output sequence for the layer by performing operations comprising, for each of the positions in the input sequence for the layer:
1. and generate an output sequence for the layer from the attended input sequence, the output sequence comprising a respective layer output at each of the one or more positions, wherein, for a first subset of the plurality of layers, the feed-forward sub-layer is a conditional computation sub-layer that (i) comprises a plurality of expert feed-forward neural networks and (ii) is configured to generate the output sequence for the layer by performing operations comprising, for each of the positions in the input sequence for the layer:
receiving the respective attended layer input at the position;
1. receiving the respective attended layer input at the position generated by the attention sub-layer at least in part by applying the QKV attention mechanism;
applying a gating function to the respective attended layer input at the position to generate a respective gate score for each of the plurality of expert feed-forward neural networks;
1. applying a gating function to the respective attended layer input at the position to generate a respective gate score for each of the plurality of expert feed-forward neural networks;
selecting, from the plurality of expert feed-forward neural networks, a proper subset of expert feed-forward neural networks based at least on the respective gate scores;
1. selecting, from the plurality of expert feed-forward neural networks, one or more expert feed-forward neural networks based at least on the respective gate scores;
processing the respective attended layer input at the position using each of the expert feed-forward neural networks in the proper subset to generate a respective expert output for each of the expert feed-forward neural networks;
1. processing the respective attended layer input at the position using each expert feed-forward neural network in a proper subset of the plurality of expert feed-forward neural networks to generate a respective expert output for the expert feed-forward neural network in the proper subset, wherein the proper subset of the plurality of expert feed-forward neural networks comprises the one or more expert feed-forward neural networks that have been selected based at least on the respective gate scores;
combining the respective expert outputs to generate a combined expert output;
1. combining the respective expert outputs to generate a combined expert output;
and generating the respective layer output at the position from the combined expert output,
1. generating the respective layer output at the position from the combined expert output.
And wherein, for each layer that is not in the subset of the plurality of layers, the feed-forward sub-layer is configured to generate the output sequence for the layer by processing each respective attended layer input at each of the positions in the input sequence for the layer using a single feed-forward neural network.
5. The system of claim 1, wherein, for a second subset of the plurality of layers, the feed-forward sub-layer includes a single feed forward neural network and is configured to generate the output sequence for the layer by processing the respective attended layer input at each of the one or more positions in the attended input sequence using the single feed forward neural network.
Regarding claim 23:
Instant Application Claim 23
US Application No. 19/358980 Claim 17
A method performed by one or more computers, the method comprising:
13. A method performed by one or more computers, the method comprising:
receive an input sequence for the layer comprising a respective layer input at each of one or more positions;
13. receive an input sequence for the layer comprising a respective layer input at each of one or more positions;
and generate an attended input sequence at least in part by applying an attention mechanism to the input sequence for the layer, the attended input sequence comprising a respective attended layer input at each of the one or more positions, and the feed-forward sub-layer configured to:
13. and generate an attended input sequence at least in part by applying a query-key- value (QKV) attention mechanism that uses a set of queries, a set of keys, and a set of values generated from the input sequence for the layer, the attended input sequence comprising a respective attended layer input at each of the one or more positions, and the feed-forward sub-layer configured to:
receive the attended input sequence;
13. receive the attended input sequence;
and generate an output sequence for the layer from the attended input sequence, the output sequence comprising a respective layer output at each of the one or more positions, wherein, for at least one each layer in a subset of the plurality of layers, the feed-forward sub-layer is a conditional computation sub-layer that (i) comprises a plurality of expert feed- forward neural networks and (ii) is configured to generate the output sequence for the layer by performing operations comprising, for each of the positions in the input sequence for the layer:
13. and generate an output sequence for the layer from the attended input sequence, the output sequence comprising a respective layer output at each of the one or more positions, wherein, for a first subset of the plurality of layers, the feed-forward sub-layer is a conditional computation sub-layer that (i) comprises a plurality of expert feed-forward neural networks and (ii) is configured to generate the output sequence for the layer by performing operations comprising, for each of the positions in the input sequence for the layer:
receiving the respective attended layer input at the position;
13. receiving the respective attended layer input at the position generated by the attention sub-layer at least in part by applying the QKV attention mechanism;
applying a gating function to the respective attended layer input at the position to generate a respective gate score for each of the plurality of expert feed-forward neural networks;
13. applying a gating function to the respective attended layer input at the position to generate a respective gate score for each of the plurality of expert feed-forward neural networks;
selecting, from the plurality of expert feed-forward neural networks, a proper subset of expert feed-forward neural networks based at least on the respective gate scores;
13. selecting, from the plurality of expert feed-forward neural networks, one or more expert feed-forward neural networks based at least on the respective gate scores;
processing the respective attended layer input at the position using each of the expert feed-forward neural networks in the proper subset to generate a respective expert output for each of the expert feed-forward neural networks;
13. processing the respective attended layer input at the position using each expert feed-forward neural network in a proper subset of the plurality of expert feed-forward neural networks to generate a respective expert output for the expert feed-forward neural network in the proper subset, wherein the proper subset of the plurality of expert feed-forward neural networks comprises the one or more expert feed-forward neural networks that have been selected based at least on the respective gate scores;
combining the respective expert outputs to generate a combined expert output;
13. combining the respective expert outputs to generate a combined expert output;
and generating the respective layer output at the position from the combined expert output,
13. generating the respective layer output at the position from the combined expert output.
And wherein, for each layer that is not in the subset of the plurality of layers, the feed-forward sub-layer is configured to generate the output sequence for the layer by processing each respective attended layer input at each of the positions in the input sequence for the layer using a single feed-forward neural network.
17. The method of claim 13, wherein, for a second subset of the plurality of layers, the feed-forward sub-layer includes a single feed forward neural network and is configured to generate the output sequence for the layer by processing the respective attended layer input at each of the one or more positions in the attended input sequence using the single feed forward neural network.
Claims 2, 4, 6-14, and 20-22 are provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claim 5 of copending Application No. 19/358,980 in view of Shazeer et al. (“Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of Experts Layer”) (hereafter referred to as Shazeer).
This is a provisional nonstatutory double patenting rejection.
Regarding claim 2, claim 5 of the ‘980 application discloses all of the limitations of claim 1 as shown above. Claim 5 of the ‘980 application does not explicitly disclose but Shazeer discloses:
wherein the subset of the plurality of layers includes fewer than all of the plurality of layers in the attention neural network (Shazeer, page 2, last paragraph, “Our approach to conditional computation is to introduce a new type of general purpose neural network component: a Sparsely-Gated Mixture-of-Experts Layer (MoE). The MoE consists of a number of experts, each with a simple feed-forward neural network, and a trainable gating network which selects a sparse combination of the experts to process each input” and Shazeer, page 2, Figure 1,
PNG
media_image1.png
343
624
media_image1.png
Greyscale
Examiner notes that the selected experts is the subset of the plurality of layers.).
The disclosures by claim 5 of application No. 19/358,980 and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified application No. 19/358,980 to use the mixture of experts like in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Regarding claim 4, claim 5 of the ‘980 application discloses all of the limitations of claim 1 as shown above. Claim 5 of the ‘980 application does not explicitly disclose but Shazeer discloses:
wherein the layers in the plurality of layers are arranged in a sequence and wherein every second layer in the sequence has a feed-forward sub-layer that is a conditional computation sub-layer (Shazeer, page 2, Figure 1,
PNG
media_image1.png
343
624
media_image1.png
Greyscale
and (Shazeer, page 2, last paragraph, “Our approach to conditional computation is to introduce a new type of general purpose neural network component: a Sparsely-Gated Mixture-of-Experts Layer (MoE). The MoE consists of a number of experts, each with a simple feed-forward neural network, and a trainable gating network which selects a sparse combination of the experts to process each input.” Examiner notes that the MoE layer is the second layer which has a feed forward sub-layer that is a conditional computation sub-layer.)
The disclosures by claim 5 of application No. 19/358,980 and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified application No. 19/358,980 to use the mixture of experts like in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Regarding claim 6, claim 5 of the ‘980 application in view of Shazeer discloses all of the limitations of claim 1 as shown above. Claim 5 of the ‘980 application further discloses:
Instant Application Claim 6
US Application No. 19/358980 Claim 5
wherein the system includes a plurality of hardware devices
1. A system comprising one or more computers and one or more storage devices storing instructions
Claim 5 of the ‘980 application does not disclose, but Shazeer does disclose
and wherein implementing the attention neural network comprises: sharding each conditional computational sub-layer across two or more of the plurality of devices (Shazeer, page 4, 2nd to last paragraph, “We distribute the standard layers of the model and the gating network according to conventional data-parallel schemes, but keep only one shared copy of each expert. Each expert in the MoE layer receives a combined batch consisting of the relevant examples from all of the data-parallel input batches. The same set of devices function as data-parallel replicas (for the standard layers and the gating networks) and as model-parallel shards (each hosting a subset of the experts). If the model is distributed over d devices, and each device processes a batch of size b, each expert receives a batch of approximately
k
b
d
n
examples. Thus, we achieve a factor of d improvement in expert batch size.” Examiner notes that the model parallel shards hosting subsets of experts is the sharding each conditional computation sub-layer.).
The disclosures by claim 5 of application No. 19/358,980 and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified claim 5 of application No. 19/358,980 to shard the layers across devices like in Shazeer. Doing so is advantageous because “[they] achieve a factor of d improvement in expert batch size” (Shazeer, page 4, 2nd to last paragraph).
Regarding claim 7, claim 5 of the ‘980 application in view of Shazeer discloses all of the limitations of claim 6 as shown above. Claim 5 of the ‘980 application in view of Shazeer further discloses:
wherein implementing the attention neural network comprises: replicating each attention sub-layer across two or more of the plurality of devices (Shazeer, page 4, 2nd to last paragraph, “We distribute the standard layers of the model and the gating network according to conventional data-parallel schemes, but keep only one shared copy of each expert. Each expert in the MoE layer receives a combined batch consisting of the relevant examples from all of the data-parallel input batches. The same set of devices function as data-parallel replicas (for the standard layers and the gating networks) and as model-parallel shards (each hosting a subset of the experts). If the model is distributed over d devices, and each device processes a batch of size b, each expert receives a batch of approximately
k
b
d
n
examples. Thus, we achieve a factor of d improvement in expert batch size.” Examiner notes that data parallel replicas is the replicating each sub-layer across two or more of the plurality of devices.).
The disclosures by claim 5 of application No. 19/358,980 and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified claim 5 of application No. 19/358,980 to replicate the layers across devices like in Shazeer. Doing so is advantageous because “[they] achieve a factor of d improvement in expert batch size” (Shazeer, page 4, 2nd to last paragraph).
Regarding claim 8, claim 5 of the ‘980 application discloses all of the limitations of claim 1 as shown above. Claim 5 of the ‘980 application does not explicitly disclose but Shazeer discloses:
wherein generating the layer output from the combined expert output comprises: applying a residual connection and normalization to the combined expert outputs at the positions to generate the output sequence (Shazeer, page 14, Section C 1 Billion Word Language Modeling Benchmark – Experimental Details, 1st paragraph, “Model Architecture: Our model consists of five layers: a word embedding layer, a recurrent Long Short-Term Memory (LSTM) layer (Hochreiter & Schmidhuber, 1997; Gers et al., 2000), a MoE layer, a second LSTM layer, and a softmax layer. The dimensionality of the embedding layer, the number of units in each LSTM layer, and the input and output dimensionality of the MoE layer are all equal to 512. For every layer other than the softmax, we apply drouput (Zaremba et al., 2014) to the layer output, dropping each activation with probability DropProb, otherwise dividing by (1 - DropProb). After dropout, the output of the previous layer is added to the layer output. This residual connection encourages gradient flow (He et al., 2015).” Examiner notes that the softmax layer is the normalization.).
The disclosures by claim 5 of application No. 19/358,980 and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified application No. 19/358,980 to replicate layers across devices like in Shazeer. Doing so is advantageous because “[they] achieve a factor of d improvement in expert batch size” (Shazeer, page 4, 2nd to last paragraph).
Regarding claim 9, claim 5 of the ‘980 application discloses all of the limitations of claim 1 as shown above. Claim 5 of the ‘980 application does not explicitly disclose but Shazeer discloses:
wherein selecting, from the plurality of expert feed-forward neural networks, a proper subset based at least on the respective gate scores comprises: selecting at most k of a total number E of expert feed-forward neural networks in the plurality of expert feed-forward neural networks (Shazeer, page 14, last paragraph, “We varied the number of experts between models, using ordinary MoE layers with 4, 32 and 256 experts and hierarchical MoE layers with 256, 1024 and 4096 experts. We call the resulting models MoE-4, MoE-32, MoE-256, MoE-256-h, MoE-1024-h and MoE-4096- h. For the hierarchical MoE layers, the first level branching factor was 16, corresponding to the number of GPUs in our cluster. We use Noisy-Top-K Gating (see Section 2.1) with k = 4 for the ordinary MoE layers and k = 2 at each level of the hierarchical MoE layers.”).
The disclosures by claim 5 of application No. 19/358,980 and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified application No. 19/358,980 to use the mixture of experts like in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Regarding claim 10, claim 5 of the ‘980 application in view of Shazeer discloses all of the limitations of claim 9 as shown above. Claim 5 of the ‘980 application in view of Shazeer also discloses:
wherein k is 2 (Shazeer, page 14, last paragraph, “We varied the number of experts between models, using ordinary MoE layers with 4, 32 and 256 experts and hierarchical MoE layers with 256, 1024 and 4096 experts. We call the resulting models MoE-4, MoE-32, MoE-256, MoE-256-h, MoE-1024-h and MoE-4096- h. For the hierarchical MoE layers, the first level branching factor was 16, corresponding to the number of GPUs in our cluster. We use Noisy-Top-K Gating (see Section 2.1) with k = 4 for the ordinary MoE layers and k = 2 at each level of the hierarchical MoE layers.” Examiner notes that k=2.).
The disclosures by claim 5 of application No. 19/358,980 and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified application No. 19/358,980 to use the mixture of experts like in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Regarding claim 11, claim 5 of the ‘980 application in view of Shazeer discloses all of the limitations of claim 10 as shown above. Claim 5 of the ‘980 application in view of Shazeer also discloses:
wherein E is at least 100 (Shazeer, page 14, last paragraph, “We varied the number of experts between models, using ordinary MoE layers with 4, 32 and 256 experts and hierarchical MoE layers with 256, 1024 and 4096 experts.”).
The disclosures by claim 5 of application No. 19/358,980 and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified application No. 19/358,980 to use the mixture of experts like in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Regarding claim 12, claim 5 of the ‘980 application in view of Shazeer discloses all of the limitations of claim 11 as shown above. Claim 5 of the ‘980 application in view of Shazeer also discloses:
wherein E is at least 500 (Shazeer, page 14, last paragraph, “We varied the number of experts between models, using ordinary MoE layers with 4, 32 and 256 experts and hierarchical MoE layers with 256, 1024 and 4096 experts.”).
The disclosures by claim 5 of application No. 19/358,980 and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified application No. 19/358,980 to use the mixture of experts like in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Regarding claim 13, claim 5 of the ‘980 application in view of Shazeer discloses all of the limitations of claim 12 as shown above. Claim 5 of the ‘980 application in view of Shazeer also discloses:
wherein E is at least 2000 (Shazeer, page 14, last paragraph, “We varied the number of experts between models, using ordinary MoE layers with 4, 32 and 256 experts and hierarchical MoE layers with 256, 1024 and 4096 experts.”).
The disclosures by claim 5 of application No. 19/358,980 and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified application No. 19/358,980 to use the mixture of experts like in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Regarding claim 14, claim 5 of the ‘980 application in view of Shazeer discloses all of the limitations of claim 9 as shown above. Claim 5 of the ‘980 application in view of Shazeer also discloses:
wherein, after training the attention neural network, selecting, from the plurality of expert feed-forward neural networks, a proper subset comprises: selecting the k experts with the k highest gating scores (Shazeer, page 4, 2nd paragraph, “Before taking the softmax function, we add tunable Gaussian noise, then keep only the top k values, setting the rest to -∞ (which causes the corresponding gate values to equal 0)” and “Let us denote by G(x) and Ei(x) the output of the gating network and the output of the i-th expert network for a given input x….We save computation based on the sparsity of the output G(x). Wherever G(x)i =0, we need not compute Ei(x). In our experiments, we have up to thousands of experts, but only need to evaluate a handful of them every example.”(Shazeer, page 3, 2nd to last paragraph). Examiner notes that the top k values are the k highest gating scores and the selection is the computing the experts on input x.).
The disclosures by claim 5 of application No. 19/358,980 and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified application No. 19/358,980 to use the mixture of experts like in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Regarding claim 20, claim 5 of the ‘980 application discloses all of the limitations of claim 1 as shown above. Claim 5 of the ‘980 application does not explicitly disclose but Shazeer discloses:
wherein combining the respective expert outputs to generate a combined expert output comprises: generating a respective normalized gate score for each selected expert feed-forward neural network (Shazeer, page 4, 1st paragraph, “A simple choice of non-sparse gating function (Jordan & Jacobs, 1994) is to multiply the input by a trainable weight matrix Wg and then apply the Softmax function…We add two components to the Softmax gating network: sparsity and noise” and Shazeer, page 2, Figure 1,
PNG
media_image1.png
343
624
media_image1.png
Greyscale
Examiner notes that the gating network is applied to each selected expert and generates a normalized gate score. Examiner further notes that the output of the gating function is the gating score and softmax is the normalization.);
and computing a weighted sum of the respective expert outputs, with each expert output weighted by the normalized gate score for the selected expert feed-forward neural network that generated the expert output (Shazeer, page 3, 2nd to last paragraph, “In a hierarchical MoE, a primary network chooses a sparse weighted combination of “experts”, each of which is itself a secondary mixture-of-experts with its own gating network” where “A simple choice of non-sparse gating function (Jordan & Jacobs, 1994) is to multiply the input by a trainable weight matrix Wg and then apply the Softmax function…We add two components to the Softmax gating network: sparsity and noise” (Shazeer, page 4, 1st paragraph) and Shazeer, page 2, Figure 1,
PNG
media_image1.png
343
624
media_image1.png
Greyscale
Examiner notes that the weighted combination of experts is the weighted sum.).
The disclosures by claim 5 of application No. 19/358,980 and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified application No. 19/358,980 to use the mixture of experts like in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Regarding claim 21, claim 5 of the ‘980 application discloses all of the limitations of claim 1 as shown above. Claim 5 of the ‘980 application does not explicitly disclose but Shazeer discloses:
wherein the machine learning task is multi-lingual neural machine translation, the network input is a sequence of text in a source language and data identifying a target language, and the network output is a sequence of text in the target language that is a translation of the source language text into the target language (Shazeer, page 3, 1st paragraph, “in this paper we focus on language modeling and machine translation tasks, which are known to benefit from very large models. In particular, we apply a MoE convolutionally between stacked LSTM layers (Hochreiter & Schmidhuber, 1997), as in Figure 1. The MoE is called once for each position in the text selecting a potentially different combination of experts at each position” where “(Johnson et al., 2016) train a single GNMT (Wu et al., 2016) model on a very large combined dataset of twelve language pairs. Results are somewhat worse than those for 12 separately trained single-pair GNMT models. This is not surprising, given that the twelve models have 12 times the capacity and twelve times the aggregate training of the one model. We repeat this experiment with a single MoE-augmented model” (Shazeer, page 9, Section 5.4 Multilingual Machine Translation, 1st paragraph) and where “we also tested the same model on a Google’s Production English to French data” (Shazeer, page 8, Section 5.3 Machine Translation (Single Language Pair), 2nd paragraph). Examiner notes that the input is a sequence of English data and the target language is the French data.).
The disclosures by claim 5 of application No. 19/358,980 and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified application No. 19/358,980 to use the mixture of experts like in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Regarding claim 22:
Instant Application Claim 22
US Application No. 19/358980 Claim 5
an attention neural network configured to perform the machine learning task, the attention neural network comprising a plurality of layers, each layer comprising an attention sub- layer and a feed-forward sub-layer, the attention sub-layer configured to:
1. an attention neural network configured to process a network input to generate a network output for a machine learning task, the attention neural network comprising a plurality of layers, each layer comprising an attention sub-layer and a feed-forward sub-layer, the attention sub- layer configured to:
receive an input sequence for the layer comprising a respective layer input at each of one or more positions;
1. receive an input sequence for the layer comprising a respective layer input at each of one or more positions;
and generate an attended input sequence at least in part by applying an attention mechanism to the input sequence for the layer, the attended input sequence comprising a respective attended layer input at each of the one or more positions, and the feed-forward sub-layer configured to:
1. and generate an attended input sequence at least in part by applying a query-key- value (QKV) attention mechanism that uses a set of queries, a set of keys, and a set of values generated from the input sequence for the layer, the attended input sequence comprising a respective attended layer input at each of the one or more positions, and the feed-forward sub-layer configured to:
receive the attended input sequence;
1. receive the attended input sequence;
and generate an output sequence for the layer from the attended input sequence, the output sequence comprising a respective layer output at each of the one or more positions, wherein, for at least one each layer in a subset of the plurality of layers, the feed-forward sub-layer is a conditional computation sub-layer that (i) comprises a plurality of expert feed- forward neural networks and (ii) is configured to generate the output sequence for the layer by performing operations comprising, for each of the positions in the input sequence for the layer:
1. and generate an output sequence for the layer from the attended input sequence, the output sequence comprising a respective layer output at each of the one or more positions, wherein, for a first subset of the plurality of layers, the feed-forward sub-layer is a conditional computation sub-layer that (i) comprises a plurality of expert feed-forward neural networks and (ii) is configured to generate the output sequence for the layer by performing operations comprising, for each of the positions in the input sequence for the layer:
receiving the respective attended layer input at the position;
1. receiving the respective attended layer input at the position generated by the attention sub-layer at least in part by applying the QKV attention mechanism;
applying a gating function to the respective attended layer input at the position to generate a respective gate score for each of the plurality of expert feed-forward neural networks;
1. applying a gating function to the respective attended layer input at the position to generate a respective gate score for each of the plurality of expert feed-forward neural networks;
selecting, from the plurality of expert feed-forward neural networks, a proper subset of expert feed-forward neural networks based at least on the respective gate scores;
1. selecting, from the plurality of expert feed-forward neural networks, one or more expert feed-forward neural networks based at least on the respective gate scores;
processing the respective attended layer input at the position using each of the expert feed-forward neural networks in the proper subset to generate a respective expert output for each of the expert feed-forward neural networks;
1. processing the respective attended layer input at the position using each expert feed-forward neural network in a proper subset of the plurality of expert feed-forward neural networks to generate a respective expert output for the expert feed-forward neural network in the proper subset, wherein the proper subset of the plurality of expert feed-forward neural networks comprises the one or more expert feed-forward neural networks that have been selected based at least on the respective gate scores;
combining the respective expert outputs to generate a combined expert output;
1. combining the respective expert outputs to generate a combined expert output;
and generating the respective layer output at the position from the combined expert output,
1. generating the respective layer output at the position from the combined expert output.
And wherein, for each layer that is not in the subset of the plurality of layers, the feed-forward sub-layer is configured to generate the output sequence for the layer by processing each respective attended layer input at each of the positions in the input sequence for the layer using a single feed-forward neural network.
5. The system of claim 1, wherein, for a second subset of the plurality of layers, the feed-forward sub-layer includes a single feed forward neural network and is configured to generate the output sequence for the layer by processing the respective attended layer input at each of the one or more positions in the attended input sequence using the single feed forward neural network.
Claim 5 of the ‘980 application does not explicitly disclose but Shazeer discloses:
One or more non-transitory computer-readable storage media storing instructions that when executed by one or more computers cause the one or more computers to implement (Shazeer, page 16, D 100 Billion Word Google News Corpus – Experimental Details, 2nd paragraph, “Models are trained on a cluster of 32 Tesla K40 GPUs” where “Our model is a modified version of the GNMT model described in (Wu et al., 2016). To reduce computation, we decrease the number of LSTM layers in the encoder and decoder from 9 and 8 to 3 and 2 respectively. We insert MoE layers in both the encoder (between layers 2 and 3) and the decoder (between layers 1 and 2). We use an attention mechanism between the encoder and the decoder” (Shazeer, page 17, 2nd paragraph). Examiner notes that the GPUs are the non-transitory computer-readable storage media and the GPUs are executed to implement the attention model.)
The disclosures by claim 5 of application No. 19/358,980 and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified application No. 19/358,980 to use the hardware in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Claims 5 and 15-18 are provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claim 5 of copending Application No. 19/358,980 in view of Shazeer et al. (“Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of Experts Layer”) (hereafter referred to as Shazeer) in further view of Vaswani et al. (“Attention Is All You Need”) (hereafter referred to as Vaswani).
This is a provisional nonstatutory double patenting rejection.
Regarding claim 5, claim 5 of the ‘980 application in view of Shazeer discloses all of the limitations of claim 4 as shown above. Claim 5 of the ‘980 application in view of Shazeer do not disclose, but Vaswani does discloses:
wherein the layers in the plurality of layers are arranged in a sequence and wherein every second layer in the sequence has a feed-forward sub-layer that is a conditional computation sub-layer wherein the sequence includes a plurality of encoder layers followed by a plurality of decoder layers (Vaswani, page 2, 2nd to last paragraph, “The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, showing in the left and right halves of Figure 1, respectively” and Vaswani, page 3, Figure 1,
PNG
media_image2.png
555
353
media_image2.png
Greyscale
Examiner notes that the encoder (on the left of Figure 1) has N layers and the decoder (on the right of Figure 1) has N layers.).
The disclosures by claim 5 of application No. 19/358,980, Shazeer and Vaswani are considered analogous to the claimed invention because they use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified application No. 19/358,980 and Shazeer to use the architecture in Vaswani. Doing so is advantageous because “the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers” (Vaswani, page 9, Conclusion, 2nd paragraph).
Regarding claim 15, claim 9 of the ‘980 application in view of Shazeer discloses all of the limitations of claim 4 as shown above. Claim 5 of the ‘980 application in view of Shazeer further discloses:
and selecting, from the plurality of expert feed-forward neural networks, a proper subset comprises: identifying the k expert feed-forward neural network with the k highest gating scores (Shazeer, page 14, last paragraph, “We varied the number of experts between models, using ordinary MoE layers with 4, 32 and 256 experts and hierarchical MoE layers with 256, 1024 and 4096 experts. We call the resulting models MoE-4, MoE-32, MoE-256, MoE-256-h, MoE-1024-h and MoE-4096- h. For the hierarchical MoE layers, the first level branching factor was 16, corresponding to the number of GPUs in our cluster. We use Noisy-Top-K Gating (see Section 2.1) with k = 4 for the ordinary MoE layers and k = 2 at each level of the hierarchical MoE layers” where “Before taking the softmax function, we add tunable Gaussian noise, then keep only the top k values, setting the rest to -∞ (which causes the corresponding gate values to equal 0)” (Shazeer, page 4, 2nd paragraph) and “Let us denote by G(x) and Ei(x) the output of the gating network and the output of the i-th expert network for a given input x….We save computation based on the sparsity of the output G(x). Wherever G(x)i =0, we need not compute Ei(x). In our experiments, we have up to thousands of experts, but only need to evaluate a handful of them every example.”(Shazeer, page 3, 2nd to last paragraph). );
and for each of the k expert feed-forward neural networks: determining whether the expert feed-forward neural network has already been selected a maximum number of times during the processing of the group (Shazeer, page 19, Batchwise Mask section, “To force each expert to receive the exact same number of examples, we introduce an alternative mask function, Mbatchwise(X,M), which operates over batches of input vectors. Instead of keeping the top k values per example, we keep the top m values per expert across the training batch….As our experiments suggest and also observed in (Ioffe & Szegedy, 2015), using a batchwise function during training (such as Mbatchwise) requires modifications to the inference when we may not have a large batch of examples. Our solution to this is to train a vector T of per-expert threshold values to approximate the effects of the batchwise mask….To learn the threshold values, we apply an additional loss at training time which is minimized when the batchwise mask and the threshold mask are identical.” Examiner notes that the threshold determines whether the expert has been selected a maximum number of times.), and
selecting the expert feed-forward neural network only when the expert feed- forward neural network has not already been selected a maximum number of times during the processing of the group (Shazeer, page 13, 1st paragraph, “as discussed in section 4, for load-balancing purposes, we want to define an additional loss function to encourage experts to receive roughly equal numbers of training examples.” Examiner notes that by having equal numbers of training examples, the experts are selected when the experts have not been selected a maximum number of times.).
The disclosures by claim 5 of application No. 19/358,980 and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified application No. 19/358,980 to use the mixture of experts like in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Claim 5 of the ‘980 application in view of Shazeer do not disclose, but Vaswani does disclose:
wherein, during training of the attention neural network, the attended layer input is one of a group of attended layer inputs (Vaswani, page 4, 3.2.2 Multi-Head Attention section, “On each of these projected versions of queries, keys, and values we then perform the attention function in parallel, yielding dv-dimensional output values. These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2. Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions” and “the encoder is composed of a stack of N=6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network” (Vaswani, page 2 last paragraph – page 3 first paragraph). Examiner notes that the attended input sequence is the output values. Examiner further notes that since there are N multi-head attention sublayers, the attended layer input is one of a group of attended layer inputs.),
The disclosures by claim 5 of application No. 19/358,980, Shazeer and Vaswani are considered analogous to the claimed invention because they use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified application No. 19/358,980 and Shazeer to use the architecture in Vaswani. Doing so is advantageous because “the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers” (Vaswani, page 9, Conclusion, 2nd paragraph).
Regarding claim 16, claim 9 of the ‘980 application in view of Shazeer and Vaswani discloses all of the limitations of claim 15 as shown above. Claim 5 of the ‘980 application in view of Shazeer further discloses:
wherein, during training of the attention neural network selecting, from the plurality of expert feed-forward neural networks, a proper subset further comprises: for one or more of the k expert feed-forward neural networks: determining a probability for the expert feed-forward neural network from at least the gating score for the expert feed-forward neural network (Shazeer, page 13, 1st paragraph, “We define P(x,i) as the probability that G(x)i is nonzero, given a new random choice of noise on element i, but keeping the already-sampled choices of noise on the other elements.” Examiner notes that G(x)i is the gating score.);
selecting the expert feed-forward neural network only when (i) the expert feed- forward neural network has not already been selected a maximum number of times during the processing of the group (Shazeer, page 13, 1st paragraph, “as discussed in section 4, for load-balancing purposes, we want to define an additional loss function to encourage experts to receive roughly equal numbers of training examples.” Examiner notes that by having equal numbers of training examples, the experts are selected when the experts have not been selected a maximum number of times.).
and (ii) the probability for the expert feed-forward neural network exceeds a randomly sampled value between zero and one (Shazeer, page 13, 1st paragraph, “We define P(x,i) as the probability that G(x)i is nonzero, given a new random choice of noise on element i, but keeping the already-sampled choices of noise on the other elements. To compute P(x,i), we note that the G(x)-i is nonzero if and only if H(x)-i is greater than the kth-greatest element of H(x) excluding itself” and Shazeer, page 13, Equation 8
PNG
media_image3.png
116
823
media_image3.png
Greyscale
).
The disclosures by claim 5 of application No. 19/358,980 and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified application No. 19/358,980 to use the mixture of experts like in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Regarding claim 17, claim 9 of the ‘980 application in view of Shazeer and Vaswani discloses all of the limitations of claim 15 as shown above. Claim 5 of the ‘980 application in view of Shazeer further discloses:
wherein the group of attended layer inputs includes attended layer inputs generated from a proper subset of the network inputs in a batch of training examples (Shazeer, page 5, Section 4 Balancing Expert Utilization, 1st paragraph, “We take a soft constraint approach. We define the importance of an expert relative to a batch of training examples to be the batchwise sum of the gate values for that expert” and Shazeer, page 2, Figure 1,
PNG
media_image1.png
343
624
media_image1.png
Greyscale
Examiner notes that attended layer inputs are the outputs of the MoE layer.);.
The disclosures by claim 5 of application No. 19/358,980 and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified application No. 19/358,980 to use the mixture of experts like in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Regarding claim 18, claim 9 of the ‘980 application in view of Shazeer and Vaswani discloses all of the limitations of claim 15 as shown above. Claim 5 of the ‘980 application in view of Shazeer further discloses:
wherein the attention neural network is trained on a loss function that includes a term that encourages the conditional computation feed-forward sub-layer to select each expert feed-forward neural network for a same fraction of attended layer inputs among a total number of attended layer inputs within the group (Shazeer, page 13, 1st paragraph, “as discussed in section 4, for load-balancing purposes, we want to define an additional loss function to encourage experts to receive roughly equal numbers of training examples.” Examiner notes that by having equal numbers of training examples, the experts are selected for a same fraction of attended layer inputs among a total number of attended layer inputs.).
The disclosures by claim 5 of application No. 19/358,980 and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified application No. 19/358,980 to use the mixture of experts like in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Claim 19 is provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claim 5 of copending Application No. 19/358,980 in view of Shazeer et al. (“Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of Experts Layer”) (hereafter referred to as Shazeer) in further view of Vaswani et al. (“Attention Is All You Need”) (hereafter referred to as Vaswani) and Drew (US 2014/0222736 A1) (hereafter referred to as Drew).
Regarding claim 19, claim 9 of the ‘980 application in view of Shazeer and Vaswani discloses all of the limitations of claim 15 as shown above. Claim 5 of the ‘980 application in view of Shazeer and Vaswani further discloses:
when each of the k identified expert feed-forward neural networks have already been selected a maximum number of times during the processing of the group, selecting a subset (Shazeer, page 19, Batchwise Mask section, “To force each expert to receive the exact same number of examples, we introduce an alternative mask function, Mbatchwise(X,M), which operates over batches of input vectors. Instead of keeping the top k values per example, we keep the top m values per expert across the training batch where m=
k
|
X
|
n
so that each example is sent to an average of k experts….As our experiments suggest and also observed in (Ioffe & Szegedy, 2015), using a batchwise function during training (such as Mbatchwise) requires modifications to the inference when we may not have a large batch of examples. Our solution to this is to train a vector T of per-expert threshold values to approximate the effects of the batchwise mask….To learn the threshold values, we apply an additional loss at training time which is minimized when the batchwise mask and the threshold mask are identical” and “we want to define an additional loss function to encourage experts to receive roughly equal numbers of training examples” (Shazeer, page 13, 1st paragraph) Examiner notes that the threshold determines whether the expert has been selected a maximum number of times. Examiner further notes that the k experts is the subset.)
The disclosures by claim 5 of application No. 19/358,980 and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified application No. 19/358,980 to use the mixture of experts like in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Claim 5 of the ‘980 application in view of Shazeer and Vaswani do not disclose, but Drew does disclose:
selecting a subset that includes zero expert feed-forward neural networks and setting the combined output to zero
Drew, however does teach
when each of the k identified …networks have already been selected a maximum number of times during the processing of the group, selecting a subset that includes zero…networks and setting the combined output to zero (Drew, page 10, paragraph 0031, “If a transitional Mapping Outputs 18 production capacity has been set and subsequently exceeded the Blocking Mechanism 20 will force transitional Mapping Outputs 18 producers, which are in this case the Mapping Operations Module 16, to “block” or wait until the production capacity falls back below the pre-determined production capacity threshold.” Examiner notes that the production capacity being exceeded is the networks selected a maximum number of times. Examiner further notes that the blocking or waiting is selecting a subset that includes zero networks. Additionally by blocking or waiting, the combined output is set to zero.)
The disclosures by claim 5 of application No. 19/358,980, Shazeer, Vaswani, and Drew are considered analogous to the claimed invention because they use machine learning in a distributed setting to schedule tasks. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified The disclosures by claim 5 of application No. 19/358,980, Shazeer, and Vaswani to select a subset that includes zero expert feed-forward neural networks and set the combined output to zero. Doing so is advantageous because this “allows performing multiple stages of learning or classification simultaneously by dividing up learning and classification work into individual independent units which facilitate concurrent and rapid parallel processing during multiple learning and classification stages” (Drew, page 9, paragraph 0020).
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claim(s) 1-18 and 20-23 is/are rejected under 35 U.S.C. 103 as being unpatentable over Vaswani et al. (“Attention Is All You Need”) (hereafter referred to as Vaswani) in view of Shazeer et al. (“Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of Experts Layer”) (hereafter referred to as Shazeer).
Regarding claim 1, Vaswani teaches
A system for performing a machine learning task on a network input to generate a network output, the system comprising one or more computers and one or more storage devices storing instructions (Vaswani, page 7, Section 5.2 Hardware and Schedule, “ We trained our models on one machine with 8 NVIDIA P100 GPUs” and Vaswani, page 3, Figure 1,
PNG
media_image2.png
555
353
media_image2.png
Greyscale
Examiner notes that the inputs in Figure 1 is the network input and the output probabilities in Figure 1 is the network output.)
that, when executed by the one or more computers, cause the one or more computers to implement: an attention neural network configured to perform the machine learning task, the attention neural network comprising a plurality of layers, each layer comprising an attention sub- layer and a feed-forward sub-layer, (Vaswani, page 2 last paragraph – page 3 first paragraph, “the encoder is composed of a stack of N=6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network.” Examiner notes that the multi-head self-attention layer is the attention sub-layer and the position-wise fully connected feed-forward network is the feed-forward sub-layer) the attention sub-layer configured to:
receive an input sequence for the layer comprising a respective layer input at each of one or more positions (Vaswani, page 5, 3.2.3 Applications of Attention in our Model section, “The Transformer uses multi-head attention in three different ways: In ‘encoder-decoder attention’ layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence.”);
and generate an attended input sequence at least in part by applying an attention mechanism to the input sequence for the layer, the attended input sequence comprising a respective attended layer input at each of the one or more positions (Vaswani, page 4, 3.2.2 Multi-Head Attention section, “On each of these projected versions of queries, keys, and values we then perform the attention function in parallel, yielding dv-dimensional output values. These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2. Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.” Examiner notes that the attended input sequence is the output values.),
and the feed-forward sub-layer configured to: receive the attended input sequence (Vaswani, page 5, 3.3 Position-wise Feed-Forward Networks, “In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between…While the linear transformations are the same across different positions, they use different parameters from layer to layer.”);
and generate an output sequence for the layer from the attended input sequence, the output sequence comprising a respective layer output at each of the one or more positions (Vaswani, page 5, 3.3 Position-wise Feed-Forward Networks, “In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between…While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1. The dimensionality of input and output is dmodel=512, and the inner-layer has dimensionality dff=2048m”),
and wherein, for each layer that is not in the subset of the plurality of layers, the feed-forward sub-layer is configured to generate the output sequence for the layer by processing each respective attended layer input at each of the positions in the input sequence for the layer using a single feed-forward neural network (Vaswani, page 5, 3.3 Position-wise Feed-Forward Networks, “In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically” and Vaswani, page 3, Figure 1,
PNG
media_image2.png
555
353
media_image2.png
Greyscale
Examiner notes that the layers not in the subset of the plurality of layers is the multi-head attention layer and feed-forward layer. Examiner further notes that the output sequence is the output probabilities and each position is the respective attended layer input.
Vaswani does not teach, but Shazeer does teach
wherein, for each layer in a subset of the plurality of layers, the feed-forward sub-layer is a conditional computation sub-layer that (i) comprises a plurality of expert feed-forward neural networks and (ii) is configured to generate the output sequence for the layer by performing operations (Shazeer, page 2, last paragraph, “Our approach to conditional computation is to introduce a new type of general purpose neural network component: a Sparsely-Gated Mixture-of-Experts Layer (MoE). The MoE consists of a number of experts, each a simple feed-forward neural network, and a trainable gating network which selects a sparse combination of the experts to process each input.”)
for each of the positions in the input sequence for the layer: receiving the respective attended layer input at the position (Shazeer, page 3, 1st paragraph, “The MoE is called once for each position in the text, selecting a potentially different combination of experts at each position” where “We insert MoE layers in both the encoder (between layers 2 and 3) and the decoder (between layers 1 and 2). We use an attention mechanism between the encoder and the decoder, with the first decoder LSTM receiving output from and providing input for the attention” (Shazeer, page 17, Section E Machine Translation – Experimental Details). Examiner notes that the attended layer is created by the attention mechanism. );
applying a gating function to the respective attended layer input at the position to generate a respective gate score for each of the plurality of expert feed-forward neural networks (Shazeer, page 5, Section 4 Balancing Expert Utilization, 1st paragraph, “We take a soft constraint approach. We define the importance of an expert relative to a batch of training examples to be the batchwise sum of the gate values for that expert” and Shazeer, page 2, Figure 1,
PNG
media_image1.png
343
624
media_image1.png
Greyscale
Examiner notes that the gating network is the gating function and the gate values are the gate scores. Examiner further notes that the attended layer input is x in Figure 1.);
selecting, from the plurality of expert feed-forward neural networks, a proper subset of expert feed-forward neural networks based at least on the respective gate scores (Shazeer, page 2, Figure 1: “A Mixture of Experts (MoE) layer embedded within a recurrent language model. In this case, the sparse gating function selects two experts to perform computations. Their outputs are modulated by the outputs of the gating network.” Examiner notes that the two experts are a proper subset.);
processing the respective attended layer input at the position using each of the expert feed-forward neural networks in the proper subset to generate a respective expert output for each of the expert feed-forward neural networks (Shazeer, page 2, 1st paragraph, “The MoE is called once for each position in the text, selecting a potentially different combination of experts at each position” and Shazeer, page 2, Figure 1,
PNG
media_image1.png
343
624
media_image1.png
Greyscale
Examiner notes that Expert 2 and Expert n-1 are in the proper subset and each generate an output. Examiner further notes that the attended layer input is x in Figure 1.);
combining the respective expert outputs to generate a combined expert output (Shazeer, page 2, Figure 1,
PNG
media_image1.png
343
624
media_image1.png
Greyscale
Examiner notes that Expert 2 and Expert n-1 are concatenated.);
and generating the respective layer output at the position from the combined expert output (Shazeer, page 2, Figure 1,
PNG
media_image1.png
343
624
media_image1.png
Greyscale
Examiner notes that the concatenation of Expert 2 and Expert n-1 is the respective layer output from the combined expert output.).
Vaswani and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Vaswani to use the mixture of experts like in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Regarding claim 2, Vaswani in view of Shazeer teach the system of claim 1. Vaswani does not teach, but Shazeer does teach
wherein the subset of the plurality of layers includes fewer than all of the plurality of layers in the attention neural network (Shazeer, page 2, last paragraph, “Our approach to conditional computation is to introduce a new type of general purpose neural network component: a Sparsely-Gated Mixture-of-Experts Layer (MoE). The MoE consists of a number of experts, each with a simple feed-forward neural network, and a trainable gating network which selects a sparse combination of the experts to process each input” and Shazeer, page 2, Figure 1,
PNG
media_image1.png
343
624
media_image1.png
Greyscale
Examiner notes that the selected experts is the subset of the plurality of layers.).
Vaswani and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Vaswani to use the mixture of experts like in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Regarding claim 4, Vaswani in view of Shazeer teach the system of claim 1. Vaswani does not teach, but Shazeer does teach
wherein the layers in the plurality of layers are arranged in a sequence and wherein every second layer in the sequence has a feed-forward sub-layer that is a conditional computation sub-layer (Shazeer, page 2, Figure 1,
PNG
media_image1.png
343
624
media_image1.png
Greyscale
and (Shazeer, page 2, last paragraph, “Our approach to conditional computation is to introduce a new type of general purpose neural network component: a Sparsely-Gated Mixture-of-Experts Layer (MoE). The MoE consists of a number of experts, each with a simple feed-forward neural network, and a trainable gating network which selects a sparse combination of the experts to process each input.” Examiner notes that the MoE layer is the second layer which has a feed forward sub-layer that is a conditional computation sub-layer.)
Vaswani and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Vaswani to use the mixture of experts like in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Regarding claim 5, Vaswani in view of Shazeer teach the system of claim 4. Vaswani teaches
wherein the layers in the plurality of layers are arranged in a sequence and wherein every second layer in the sequence has a feed-forward sub-layer that is a conditional computation sub-layer wherein the sequence includes a plurality of encoder layers followed by a plurality of decoder layers (Vaswani, page 2, 2nd to last paragraph, “The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, showing in the left and right halves of Figure 1, respectively” and Vaswani, page 3, Figure 1,
PNG
media_image2.png
555
353
media_image2.png
Greyscale
Examiner notes that the encoder (on the left of Figure 1) has N layers and the decoder (on the right of Figure 1) has N layers.).
Regarding claim 6, Vaswani in view of Shazeer teach the system of claim 1. Vaswani further teaches
wherein the system includes a plurality of hardware devices (Vaswani, page 7, Section 5.2 Hardware and Schedule, “We trained our models on one machine with 8 NVIDIA P100 GPUs.),
Vaswani does not teach, but Shazeer does teach
and wherein implementing the attention neural network comprises: sharding each conditional computational sub-layer across two or more of the plurality of devices (Shazeer, page 4, 2nd to last paragraph, “We distribute the standard layers of the model and the gating network according to conventional data-parallel schemes, but keep only one shared copy of each expert. Each expert in the MoE layer receives a combined batch consisting of the relevant examples from all of the data-parallel input batches. The same set of devices function as data-parallel replicas (for the standard layers and the gating networks) and as model-parallel shards (each hosting a subset of the experts). If the model is distributed over d devices, and each device processes a batch of size b, each expert receives a batch of approximately
k
b
d
n
examples. Thus, we achieve a factor of d improvement in expert batch size.” Examiner notes that the model parallel shards hosting subsets of experts is the sharding each conditional computation sub-layer.).
Vaswani and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Vaswani to shard the layers across devices like in Shazeer. Doing so is advantageous because “[they] achieve a factor of d improvement in expert batch size” (Shazeer, page 4, 2nd to last paragraph).
Regarding claim 7, Vaswani in view of Shazeer teach the system of claim 6. Vaswani does not teach, but Shazeer does teach
wherein implementing the attention neural network comprises: replicating each attention sub-layer across two or more of the plurality of devices (Shazeer, page 4, 2nd to last paragraph, “We distribute the standard layers of the model and the gating network according to conventional data-parallel schemes, but keep only one shared copy of each expert. Each expert in the MoE layer receives a combined batch consisting of the relevant examples from all of the data-parallel input batches. The same set of devices function as data-parallel replicas (for the standard layers and the gating networks) and as model-parallel shards (each hosting a subset of the experts). If the model is distributed over d devices, and each device processes a batch of size b, each expert receives a batch of approximately
k
b
d
n
examples. Thus, we achieve a factor of d improvement in expert batch size.” Examiner notes that data parallel replicas is the replicating each sub-layer across two or more of the plurality of devices.).
Vaswani and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Vaswani to replicating layers across devices like in Shazeer. Doing so is advantageous because “[they] achieve a factor of d improvement in expert batch size” (Shazeer, page 4, 2nd to last paragraph).
Regarding claim 8, Vaswani in view of Shazeer teach the system of claim 1. Vaswani does not teach, but Shazeer does teach
wherein generating the layer output from the combined expert output comprises: applying a residual connection and normalization to the combined expert outputs at the positions to generate the output sequence (Shazeer, page 14, Section C 1 Billion Word Language Modeling Benchmark – Experimental Details, 1st paragraph, “Model Architecture: Our model consists of five layers: a word embedding layer, a recurrent Long Short-Term Memory (LSTM) layer (Hochreiter & Schmidhuber, 1997; Gers et al., 2000), a MoE layer, a second LSTM layer, and a softmax layer. The dimensionality of the embedding layer, the number of units in each LSTM layer, and the input and output dimensionality of the MoE layer are all equal to 512. For every layer other than the softmax, we apply drouput (Zaremba et al., 2014) to the layer output, dropping each activation with probability DropProb, otherwise dividing by (1 - DropProb). After dropout, the output of the previous layer is added to the layer output. This residual connection encourages gradient flow (He et al., 2015).” Examiner notes that the softmax layer is the normalization.).
Vaswani and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Vaswani to use the mixture of experts like in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Regarding claim 9, Vaswani in view of Shazeer teach the system of claim 1. Vaswani does not teach, but Shazeer does teach
wherein selecting, from the plurality of expert feed-forward neural networks, a proper subset based at least on the respective gate scores comprises: selecting at most k of a total number E of expert feed-forward neural networks in the plurality of expert feed-forward neural networks (Shazeer, page 14, last paragraph, “We varied the number of experts between models, using ordinary MoE layers with 4, 32 and 256 experts and hierarchical MoE layers with 256, 1024 and 4096 experts. We call the resulting models MoE-4, MoE-32, MoE-256, MoE-256-h, MoE-1024-h and MoE-4096- h. For the hierarchical MoE layers, the first level branching factor was 16, corresponding to the number of GPUs in our cluster. We use Noisy-Top-K Gating (see Section 2.1) with k = 4 for the ordinary MoE layers and k = 2 at each level of the hierarchical MoE layers.”).
Vaswani and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Vaswani to use the mixture of experts like in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Regarding claim 10, Vaswani in view of Shazeer teach the system of claim 9. Vaswani does not teach, but Shazeer does teach
wherein k is 2 (Shazeer, page 14, last paragraph, “We varied the number of experts between models, using ordinary MoE layers with 4, 32 and 256 experts and hierarchical MoE layers with 256, 1024 and 4096 experts. We call the resulting models MoE-4, MoE-32, MoE-256, MoE-256-h, MoE-1024-h and MoE-4096- h. For the hierarchical MoE layers, the first level branching factor was 16, corresponding to the number of GPUs in our cluster. We use Noisy-Top-K Gating (see Section 2.1) with k = 4 for the ordinary MoE layers and k = 2 at each level of the hierarchical MoE layers.” Examiner notes that k=2.).
Vaswani and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Vaswani to use the mixture of experts like in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Regarding claim 11, Vaswani in view of Shazeer teach the system of claim 10. Vaswani does not teach, but Shazeer does teach
wherein E is at least 100 (Shazeer, page 14, last paragraph, “We varied the number of experts between models, using ordinary MoE layers with 4, 32 and 256 experts and hierarchical MoE layers with 256, 1024 and 4096 experts.”).
Vaswani and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Vaswani to use the mixture of experts like in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Regarding claim 12, Vaswani in view of Shazeer teach the system of claim 11. Vaswani does not teach, but Shazeer does teach
wherein E is at least 500 (Shazeer, page 14, last paragraph, “We varied the number of experts between models, using ordinary MoE layers with 4, 32 and 256 experts and hierarchical MoE layers with 256, 1024 and 4096 experts.”).
Vaswani and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Vaswani to use the mixture of experts like in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Regarding claim 13, Vaswani in view of Shazeer teach the system of claim 12. Vaswani does not teach, but Shazeer does teach
wherein E is at least 2000 (Shazeer, page 14, last paragraph, “We varied the number of experts between models, using ordinary MoE layers with 4, 32 and 256 experts and hierarchical MoE layers with 256, 1024 and 4096 experts.”).
Vaswani and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Vaswani to use the mixture of experts like in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Regarding claim 14, Vaswani in view of Shazeer teach the system of claim 9. Vaswani does not teach, but Shazeer does teach
wherein, after training the attention neural network, selecting, from the plurality of expert feed-forward neural networks, a proper subset comprises: selecting the k experts with the k highest gating scores (Shazeer, page 4, 2nd paragraph, “Before taking the softmax function, we add tunable Gaussian noise, then keep only the top k values, setting the rest to -∞ (which causes the corresponding gate values to equal 0)” and “Let us denote by G(x) and Ei(x) the output of the gating network and the output of the i-th expert network for a given input x….We save computation based on the sparsity of the output G(x). Wherever G(x)i =0, we need not compute Ei(x). In our experiments, we have up to thousands of experts, but only need to evaluate a handful of them every example.”(Shazeer, page 3, 2nd to last paragraph). Examiner notes that the top k values are the k highest gating scores and the selection is the computing the experts on input x.).
Vaswani and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Vaswani to use the mixture of experts like in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Regarding claim 15, Vaswani in view of Shazeer teach the system of claim 9. Vaswani further teaches
wherein, during training of the attention neural network, the attended layer input is one of a group of attended layer inputs (Vaswani, page 4, 3.2.2 Multi-Head Attention section, “On each of these projected versions of queries, keys, and values we then perform the attention function in parallel, yielding dv-dimensional output values. These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2. Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions” and “the encoder is composed of a stack of N=6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network” (Vaswani, page 2 last paragraph – page 3 first paragraph). Examiner notes that the attended input sequence is the output values. Examiner further notes that since there are N multi-head attention sublayers, the attended layer input is one of a group of attended layer inputs.),
Vaswani does not teach, but Shazeer does teach
and selecting, from the plurality of expert feed-forward neural networks, a proper subset comprises: identifying the k expert feed-forward neural network with the k highest gating scores (Shazeer, page 14, last paragraph, “We varied the number of experts between models, using ordinary MoE layers with 4, 32 and 256 experts and hierarchical MoE layers with 256, 1024 and 4096 experts. We call the resulting models MoE-4, MoE-32, MoE-256, MoE-256-h, MoE-1024-h and MoE-4096- h. For the hierarchical MoE layers, the first level branching factor was 16, corresponding to the number of GPUs in our cluster. We use Noisy-Top-K Gating (see Section 2.1) with k = 4 for the ordinary MoE layers and k = 2 at each level of the hierarchical MoE layers” where “Before taking the softmax function, we add tunable Gaussian noise, then keep only the top k values, setting the rest to -∞ (which causes the corresponding gate values to equal 0)” (Shazeer, page 4, 2nd paragraph) and “Let us denote by G(x) and Ei(x) the output of the gating network and the output of the i-th expert network for a given input x….We save computation based on the sparsity of the output G(x). Wherever G(x)i =0, we need not compute Ei(x). In our experiments, we have up to thousands of experts, but only need to evaluate a handful of them every example.”(Shazeer, page 3, 2nd to last paragraph). );
and for each of the k expert feed-forward neural networks: determining whether the expert feed-forward neural network has already been selected a maximum number of times during the processing of the group (Shazeer, page 19, Batchwise Mask section, “To force each expert to receive the exact same number of examples, we introduce an alternative mask function, Mbatchwise(X,M), which operates over batches of input vectors. Instead of keeping the top k values per example, we keep the top m values per expert across the training batch….As our experiments suggest and also observed in (Ioffe & Szegedy, 2015), using a batchwise function during training (such as Mbatchwise) requires modifications to the inference when we may not have a large batch of examples. Our solution to this is to train a vector T of per-expert threshold values to approximate the effects of the batchwise mask….To learn the threshold values, we apply an additional loss at training time which is minimized when the batchwise mask and the threshold mask are identical.” Examiner notes that the threshold determines whether the expert has been selected a maximum number of times.), and
selecting the expert feed-forward neural network only when the expert feed- forward neural network has not already been selected a maximum number of times during the processing of the group (Shazeer, page 13, 1st paragraph, “as discussed in section 4, for load-balancing purposes, we want to define an additional loss function to encourage experts to receive roughly equal numbers of training examples.” Examiner notes that by having equal numbers of training examples, the experts are selected when the experts have not been selected a maximum number of times.).
Vaswani and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Vaswani to use the mixture of experts like in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Regarding claim 16, Vaswani in view of Shazeer teach the system of claim 16. Vaswani does not teach, but Shazeer does teach
wherein, during training of the attention neural network selecting, from the plurality of expert feed-forward neural networks, a proper subset further comprises: for one or more of the k expert feed-forward neural networks: determining a probability for the expert feed-forward neural network from at least the gating score for the expert feed-forward neural network (Shazeer, page 13, 1st paragraph, “We define P(x,i) as the probability that G(x)i is nonzero, given a new random choice of noise on element i, but keeping the already-sampled choices of noise on the other elements.” Examiner notes that G(x)i is the gating score.);
selecting the expert feed-forward neural network only when (i) the expert feed- forward neural network has not already been selected a maximum number of times during the processing of the group (Shazeer, page 13, 1st paragraph, “as discussed in section 4, for load-balancing purposes, we want to define an additional loss function to encourage experts to receive roughly equal numbers of training examples.” Examiner notes that by having equal numbers of training examples, the experts are selected when the experts have not been selected a maximum number of times.).
and (ii) the probability for the expert feed-forward neural network exceeds a randomly sampled value between zero and one (Shazeer, page 13, 1st paragraph, “We define P(x,i) as the probability that G(x)i is nonzero, given a new random choice of noise on element i, but keeping the already-sampled choices of noise on the other elements. To compute P(x,i), we note that the G(x)-i is nonzero if and only if H(x)-i is greater than the kth-greatest element of H(x) excluding itself” and Shazeer, page 13, Equation 8
PNG
media_image3.png
116
823
media_image3.png
Greyscale
).
Vaswani and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Vaswani to use the mixture of experts like in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Regarding claim 17, Vaswani in view of Shazeer teach the system of claim 15. Vaswani does not teach, but Shazeer does teach
wherein the group of attended layer inputs includes attended layer inputs generated from a proper subset of the network inputs in a batch of training examples (Shazeer, page 5, Section 4 Balancing Expert Utilization, 1st paragraph, “We take a soft constraint approach. We define the importance of an expert relative to a batch of training examples to be the batchwise sum of the gate values for that expert” and Shazeer, page 2, Figure 1,
PNG
media_image1.png
343
624
media_image1.png
Greyscale
Examiner notes that attended layer inputs are the outputs of the MoE layer.);.
Vaswani and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Vaswani to use the mixture of experts like in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Regarding claim 18, Vaswani in view of Shazeer teach the system of claim 15. Vaswani does not teach, but Shazeer does teach
wherein the attention neural network is trained on a loss function that includes a term that encourages the conditional computation feed-forward sub-layer to select each expert feed-forward neural network for a same fraction of attended layer inputs among a total number of attended layer inputs within the group (Shazeer, page 13, 1st paragraph, “as discussed in section 4, for load-balancing purposes, we want to define an additional loss function to encourage experts to receive roughly equal numbers of training examples.” Examiner notes that by having equal numbers of training examples, the experts are selected for a same fraction of attended layer inputs among a total number of attended layer inputs.).
Vaswani and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Vaswani to use the mixture of experts like in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Regarding claim 20, Vaswani in view of Shazeer teach the system of claim 1. Vaswani does not teach, but Shazeer does teach
wherein combining the respective expert outputs to generate a combined expert output comprises: generating a respective normalized gate score for each selected expert feed-forward neural network (Shazeer, page 4, 1st paragraph, “A simple choice of non-sparse gating function (Jordan & Jacobs, 1994) is to multiply the input by a trainable weight matrix Wg and then apply the Softmax function…We add two components to the Softmax gating network: sparsity and noise” and Shazeer, page 2, Figure 1,
PNG
media_image1.png
343
624
media_image1.png
Greyscale
Examiner notes that the gating network is applied to each selected expert and generates a normalized gate score. Examiner further notes that the output of the gating function is the gating score and softmax is the normalization.);
and computing a weighted sum of the respective expert outputs, with each expert output weighted by the normalized gate score for the selected expert feed-forward neural network that generated the expert output (Shazeer, page 3, 2nd to last paragraph, “In a hierarchical MoE, a primary network chooses a sparse weighted combination of “experts”, each of which is itself a secondary mixture-of-experts with its own gating network” where “A simple choice of non-sparse gating function (Jordan & Jacobs, 1994) is to multiply the input by a trainable weight matrix Wg and then apply the Softmax function…We add two components to the Softmax gating network: sparsity and noise” (Shazeer, page 4, 1st paragraph) and Shazeer, page 2, Figure 1,
PNG
media_image1.png
343
624
media_image1.png
Greyscale
Examiner notes that the weighted combination of experts is the weighted sum.).
Vaswani and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Vaswani to use the mixture of experts like in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Regarding claim 21, Vaswani in view of Shazeer teach the system of claim 1. Vaswani does not teach, but Shazeer does teach
wherein the machine learning task is multi-lingual neural machine translation, the network input is a sequence of text in a source language and data identifying a target language, and the network output is a sequence of text in the target language that is a translation of the source language text into the target language (Shazeer, page 3, 1st paragraph, “in this paper we focus on language modeling and machine translation tasks, which are known to benefit from very large models. In particular, we apply a MoE convolutionally between stacked LSTM layers (Hochreiter & Schmidhuber, 1997), as in Figure 1. The MoE is called once for each position in the text selecting a potentially different combination of experts at each position” where “(Johnson et al., 2016) train a single GNMT (Wu et al., 2016) model on a very large combined dataset of twelve language pairs. Results are somewhat worse than those for 12 separately trained single-pair GNMT models. This is not surprising, given that the twelve models have 12 times the capacity and twelve times the aggregate training of the one model. We repeat this experiment with a single MoE-augmented model” (Shazeer, page 9, Section 5.4 Multilingual Machine Translation, 1st paragraph) and where “we also tested the same model on a Google’s Production English to French data” (Shazeer, page 8, Section 5.3 Machine Translation (Single Language Pair), 2nd paragraph). Examiner notes that the input is a sequence of English data and the target language is the French data.).
Vaswani and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Vaswani to use the mixture of experts like in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Regarding claim 22, Vaswani teaches
One or more non-transitory computer-readable storage media storing instructions (Vaswani, page 7, Section 5.2 Hardware and Schedule, “ We trained our models on one machine with 8 NVIDIA P100 GPUs.”)
that, when executed by the one or more computers, cause the one or more computers to implement: an attention neural network configured to perform the machine learning task, the attention neural network comprising a plurality of layers, each layer comprising an attention sub- layer and a feed-forward sub-layer, (Vaswani, page 2 last paragraph – page 3 first paragraph, “the encoder is composed of a stack of N=6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network.” Examiner notes that the multi-head self-attention layer is the attention sub-layer and the position-wise fully connected feed-forward network is the feed-forward sub-layer) the attention sub-layer configured to:
receive an input sequence for the layer comprising a respective layer input at each of one or more positions (Vaswani, page 5, 3.2.3 Applications of Attention in our Model section, “The Transformer uses multi-head attention in three different ways: In ‘encoder-decoder attention’ layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence.”);
and generate an attended input sequence at least in part by applying an attention mechanism to the input sequence for the layer, the attended input sequence comprising a respective attended layer input at each of the one or more positions (Vaswani, page 4, 3.2.2 Multi-Head Attention section, “On each of these projected versions of queries, keys, and values we then perform the attention function in parallel, yielding dv-dimensional output values. These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2. Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.” Examiner notes that the attended input sequence is the output values.),
and the feed-forward sub-layer configured to: receive the attended input sequence (Vaswani, page 5, 3.3 Position-wise Feed-Forward Networks, “In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between…While the linear transformations are the same across different positions, they use different parameters from layer to layer.”);
and generate an output sequence for the layer from the attended input sequence, the output sequence comprising a respective layer output at each of the one or more positions (Vaswani, page 5, 3.3 Position-wise Feed-Forward Networks, “In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between…While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1. The dimensionality of input and output is dmodel=512, and the inner-layer has dimensionality dff=2048m”),
and wherein, for each layer that is not in the subset of the plurality of layers, the feed-forward sub-layer is configured to generate the output sequence for the layer by processing each respective attended layer input at each of the positions in the input sequence for the layer using a single feed-forward neural network (Vaswani, page 5, 3.3 Position-wise Feed-Forward Networks, “In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically” and Vaswani, page 3, Figure 1,
PNG
media_image2.png
555
353
media_image2.png
Greyscale
Examiner notes that the layers not in the subset of the plurality of layers is the multi-head attention layer and feed-forward layer. Examiner further notes that the output sequence is the output probabilities and each position is the respective attended layer input.
Vaswani does not teach, but Shazeer does teach
wherein, for each layer in a subset of the plurality of layers, the feed-forward sub-layer is a conditional computation sub-layer that (i) comprises a plurality of expert feed-forward neural networks and (ii) is configured to generate the output sequence for the layer by performing operations (Shazeer, page 2, last paragraph, “Our approach to conditional computation is to introduce a new type of general purpose neural network component: a Sparsely-Gated Mixture-of-Experts Layer (MoE). The MoE consists of a number of experts, each a simple feed-forward neural network, and a trainable gating network which selects a sparse combination of the experts to process each input.”)
for each of the positions in the input sequence for the layer: receiving the respective attended layer input at the position (Shazeer, page 3, 1st paragraph, “The MoE is called once for each position in the text, selecting a potentially different combination of experts at each position” where “We insert MoE layers in both the encoder (between layers 2 and 3) and the decoder (between layers 1 and 2). We use an attention mechanism between the encoder and the decoder, with the first decoder LSTM receiving output from and providing input for the attention” (Shazeer, page 17, Section E Machine Translation – Experimental Details). Examiner notes that the attended layer is created by the attention mechanism. );
applying a gating function to the respective attended layer input at the position to generate a respective gate score for each of the plurality of expert feed-forward neural networks (Shazeer, page 5, Section 4 Balancing Expert Utilization, 1st paragraph, “We take a soft constraint approach. We define the importance of an expert relative to a batch of training examples to be the batchwise sum of the gate values for that expert” and Shazeer, page 2, Figure 1,
PNG
media_image1.png
343
624
media_image1.png
Greyscale
Examiner notes that the gating network is the gating function and the gate values are the gate scores. Examiner further notes that the attended layer input is x in Figure 1.);
selecting, from the plurality of expert feed-forward neural networks, a proper subset of expert feed-forward neural networks based at least on the respective gate scores (Shazeer, page 2, Figure 1: “A Mixture of Experts (MoE) layer embedded within a recurrent language model. In this case, the sparse gating function selects two experts to perform computations. Their outputs are modulated by the outputs of the gating network.” Examiner notes that the two experts are a proper subset.);
processing the respective attended layer input at the position using each of the expert feed-forward neural networks in the proper subset to generate a respective expert output for each of the expert feed-forward neural networks (Shazeer, page 2, 1st paragraph, “The MoE is called once for each position in the text, selecting a potentially different combination of experts at each position” and Shazeer, page 2, Figure 1,
PNG
media_image1.png
343
624
media_image1.png
Greyscale
Examiner notes that Expert 2 and Expert n-1 are in the proper subset and each generate an output. Examiner further notes that the attended layer input is x in Figure 1.);
combining the respective expert outputs to generate a combined expert output (Shazeer, page 2, Figure 1,
PNG
media_image1.png
343
624
media_image1.png
Greyscale
Examiner notes that Expert 2 and Expert n-1 are concatenated.);
and generating the respective layer output at the position from the combined expert output (Shazeer, page 2, Figure 1,
PNG
media_image1.png
343
624
media_image1.png
Greyscale
Examiner notes that the concatenation of Expert 2 and Expert n-1 is the respective layer output from the combined expert output.).
Vaswani and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Vaswani to use the mixture of experts like in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Regarding claim 23, Vaswani teaches
A method performed by one or more computers (Vaswani, page 7, Section 5.2 Hardware and Schedule, “ We trained our models on one machine with 8 NVIDIA P100 GPUs.”)
the method comprising: receiving a network input (Vaswani, page 3, Figure 1,
PNG
media_image2.png
555
353
media_image2.png
Greyscale
Examiner notes the inputs in Figure 1 is the network input.);
and processing the network input using an attention neural network to generate a network output for the network input (Vaswani, page 3, Figure 1,
PNG
media_image2.png
555
353
media_image2.png
Greyscale
Examiner notes the output probabilities in Figure 1 is the network output and the transformer is the attention neural network.);,
the attention neural network comprising a plurality of layers, each layer comprising an attention sub- layer and a feed-forward sub-layer, (Vaswani, page 2 last paragraph – page 3 first paragraph, “the encoder is composed of a stack of N=6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network.” Examiner notes that the multi-head self-attention layer is the attention sub-layer and the position-wise fully connected feed-forward network is the feed-forward sub-layer) the attention sub-layer configured to:
receive an input sequence for the layer comprising a respective layer input at each of one or more positions (Vaswani, page 5, 3.2.3 Applications of Attention in our Model section, “The Transformer uses multi-head attention in three different ways: In ‘encoder-decoder attention’ layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence.”);
and generate an attended input sequence at least in part by applying an attention mechanism to the input sequence for the layer, the attended input sequence comprising a respective attended layer input at each of the one or more positions (Vaswani, page 4, 3.2.2 Multi-Head Attention section, “On each of these projected versions of queries, keys, and values we then perform the attention function in parallel, yielding dv-dimensional output values. These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2. Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.” Examiner notes that the attended input sequence is the output values.),
and the feed-forward sub-layer configured to: receive the attended input sequence (Vaswani, page 5, 3.3 Position-wise Feed-Forward Networks, “In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between…While the linear transformations are the same across different positions, they use different parameters from layer to layer.”);
and generate an output sequence for the layer from the attended input sequence, the output sequence comprising a respective layer output at each of the one or more positions (Vaswani, page 5, 3.3 Position-wise Feed-Forward Networks, “In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between…While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1. The dimensionality of input and output is dmodel=512, and the inner-layer has dimensionality dff=2048m”),
and wherein, for each layer that is not in the subset of the plurality of layers, the feed-forward sub-layer is configured to generate the output sequence for the layer by processing each respective attended layer input at each of the positions in the input sequence for the layer using a single feed-forward neural network (Vaswani, page 5, 3.3 Position-wise Feed-Forward Networks, “In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically” and Vaswani, page 3, Figure 1,
PNG
media_image2.png
555
353
media_image2.png
Greyscale
Examiner notes that the layers not in the subset of the plurality of layers is the multi-head attention layer and feed-forward layer. Examiner further notes that the output sequence is the output probabilities and each position is the respective attended layer input.
Vaswani does not teach, but Shazeer does teach
wherein, for each layer in a subset of the plurality of layers, the feed-forward sub-layer is a conditional computation sub-layer that (i) comprises a plurality of expert feed-forward neural networks and (ii) is configured to generate the output sequence for the layer by performing operations (Shazeer, page 2, last paragraph, “Our approach to conditional computation is to introduce a new type of general purpose neural network component: a Sparsely-Gated Mixture-of-Experts Layer (MoE). The MoE consists of a number of experts, each a simple feed-forward neural network, and a trainable gating network which selects a sparse combination of the experts to process each input.”)
for each of the positions in the input sequence for the layer: receiving the respective attended layer input at the position (Shazeer, page 3, 1st paragraph, “The MoE is called once for each position in the text, selecting a potentially different combination of experts at each position” where “We insert MoE layers in both the encoder (between layers 2 and 3) and the decoder (between layers 1 and 2). We use an attention mechanism between the encoder and the decoder, with the first decoder LSTM receiving output from and providing input for the attention” (Shazeer, page 17, Section E Machine Translation – Experimental Details). Examiner notes that the attended layer is created by the attention mechanism. );
applying a gating function to the respective attended layer input at the position to generate a respective gate score for each of the plurality of expert feed-forward neural networks (Shazeer, page 5, Section 4 Balancing Expert Utilization, 1st paragraph, “We take a soft constraint approach. We define the importance of an expert relative to a batch of training examples to be the batchwise sum of the gate values for that expert” and Shazeer, page 2, Figure 1,
PNG
media_image1.png
343
624
media_image1.png
Greyscale
Examiner notes that the gating network is the gating function and the gate values are the gate scores. Examiner further notes that the attended layer input is x in Figure 1.);
selecting, from the plurality of expert feed-forward neural networks, a proper subset of expert feed-forward neural networks based at least on the respective gate scores (Shazeer, page 2, Figure 1: “A Mixture of Experts (MoE) layer embedded within a recurrent language model. In this case, the sparse gating function selects two experts to perform computations. Their outputs are modulated by the outputs of the gating network.” Examiner notes that the two experts are a proper subset.);
processing the respective attended layer input at the position using each of the expert feed-forward neural networks in the proper subset to generate a respective expert output for each of the expert feed-forward neural networks (Shazeer, page 2, 1st paragraph, “The MoE is called once for each position in the text, selecting a potentially different combination of experts at each position” and Shazeer, page 2, Figure 1,
PNG
media_image1.png
343
624
media_image1.png
Greyscale
Examiner notes that Expert 2 and Expert n-1 are in the proper subset and each generate an output. Examiner further notes that the attended layer input is x in Figure 1.);
combining the respective expert outputs to generate a combined expert output (Shazeer, page 2, Figure 1,
PNG
media_image1.png
343
624
media_image1.png
Greyscale
Examiner notes that Expert 2 and Expert n-1 are concatenated.);
and generating the respective layer output at the position from the combined expert output (Shazeer, page 2, Figure 1,
PNG
media_image1.png
343
624
media_image1.png
Greyscale
Examiner notes that the concatenation of Expert 2 and Expert n-1 is the respective layer output from the combined expert output.).
Vaswani and Shazeer are considered analogous to the claimed invention because they both use machine learning and attention mechanisms to perform machine translation. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Vaswani to use the mixture of experts like in Shazeer. Doing so is advantageous because “[they] obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets” (Shazeer, page 2, 2nd to last paragraph).
Claim(s) 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Vaswani in view of Shazeer in further view of Drew (US 2014/0222736 A1) (hereafter referred to as Drew).
Regarding claim 19, Vaswani in view of Shazeer teach the system of claim 15, wherein, during training of the attention neural network, the attended layer input is one of a group of attended layer inputs, and selecting, from the plurality of expert feed-forward neural networks, a proper subset. Vaswani in view of Shazeer further teach
when each of the k identified expert feed-forward neural networks have already been selected a maximum number of times during the processing of the group, selecting a subset (Shazeer, page 19, Batchwise Mask section, “To force each expert to receive the exact same number of examples, we introduce an alternative mask function, Mbatchwise(X,M), which operates over batches of input vectors. Instead of keeping the top k values per example, we keep the top m values per expert across the training batch where m=
k
|
X
|
n
so that each example is sent to an average of k experts….As our experiments suggest and also observed in (Ioffe & Szegedy, 2015), using a batchwise function during training (such as Mbatchwise) requires modifications to the inference when we may not have a large batch of examples. Our solution to this is to train a vector T of per-expert threshold values to approximate the effects of the batchwise mask….To learn the threshold values, we apply an additional loss at training time which is minimized when the batchwise mask and the threshold mask are identical” and “we want to define an additional loss function to encourage experts to receive roughly equal numbers of training examples” (Shazeer, page 13, 1st paragraph) Examiner notes that the threshold determines whether the expert has been selected a maximum number of times. Examiner further notes that the k experts is the subset.)
Vaswani in view of Shazeer does not teach
selecting a subset that includes zero expert feed-forward neural networks and setting the combined output to zero
Drew, however does teach
when each of the k identified …networks have already been selected a maximum number of times during the processing of the group, selecting a subset that includes zero…networks and setting the combined output to zero (Drew, page 10, paragraph 0031, “If a transitional Mapping Outputs 18 production capacity has been set and subsequently exceeded the Blocking Mechanism 20 will force transitional Mapping Outputs 18 producers, which are in this case the Mapping Operations Module 16, to “block” or wait until the production capacity falls back below the pre-determined production capacity threshold.” Examiner notes that the production capacity being exceeded is the networks selected a maximum number of times. Examiner further notes that the blocking or waiting is selecting a subset that includes zero networks. Additionally by blocking or waiting, the combined output is set to zero.)
Vaswani, Shazeer, and Drew are considered analogous to the claimed invention because they use machine learning in a distributed setting to schedule tasks. It would have been obvious to one having ordinary skill in the art prior to the effective filing date to have modified Vaswani in view of Shazeer to select a subset that includes zero expert feed-forward neural networks and set the combined output to zero. Doing so is advantageous because this “allows performing multiple stages of learning or classification simultaneously by dividing up learning and classification work into individual independent units which facilitate concurrent and rapid parallel processing during multiple learning and classification stages” (Drew, page 9, paragraph 0020).
Response to Arguments
The previous 112(b) rejections have been overcome in light of the instant amendments.
On page 10-11, Applicant argues:
The independent claims have each been amended to recite several new features, including the features of "wherein, for each layer in a subset of the plurality of layers, the feed-forward sub-layer is a conditional computation sub-layer that (i) comprises a plurality of expert feed forward neural networks" and "wherein, for each layer that is not in the subset of the plurality of layers, the feed-forward sub-layer is configured to generate the output sequence for the layer by processing each respective attended layer input at each of the positions in the input sequence for the layer using a single feed-forward neural network."
For at least the reasons agreed upon in the interview, Applicant submits that the applied references do not disclose or suggest this combination of features.
Accordingly, Applicant respectfully submits that claim 1 and its dependent claims are in condition for allowance. Independent claims 20 and 21 are allowable for corresponding reasons.
Regarding the Applicant’s argument that the prior art does not disclose or suggest the newly amended limitations of the independent claim, Examiner respectfully disagrees. Specifically, Examiner notes that Shazeer teaches wherein, for each layer in a subset of the plurality of layers, the feed-forward sub-layer is a conditional computation sub-layer that (i) comprises a plurality of expert feed-forward neural networks (Shazeer, page 2, last paragraph, “Our approach to conditional computation is to introduce a new type of general purpose neural network component: a Sparsely-Gated Mixture-of-Experts Layer (MoE). The MoE consists of a number of experts, each a simple feed-forward neural network, and a trainable gating network which selects a sparse combination of the experts to process each input.”). Examiner further notes that by broadest reasonable interpretation, the experts are the layers in the subset of the plurality of layers, and the experts are the conditional computation sub-layer.
Examiner notes that Vaswani teaches wherein, for each layer that is not in the subset of the plurality of layers, the feed-forward sub-layer is configured to generate the output sequence for the layer by processing each respective attended layer input at each of the positions in the input sequence for the layer using a single feed-forward neural network (Vaswani, page 5, 3.3 Position-wise Feed-Forward Networks, “In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically” and Vaswani, page 3, Figure 1,
PNG
media_image2.png
555
353
media_image2.png
Greyscale
Examiner notes that the layers not in the subset of the plurality of layers is the multi-head attention layer and feed-forward layer. Examiner further notes that the output sequence is the output probabilities and each position is the respective attended layer input.)
Additionally, Examiner respectfully notes that no agreements were reached during the interview, and thus there were no “reasons agreed upon in the interview”.
Regarding the Applicant’s argument that the dependent claims are allowable at least due in part to their dependency on the independent claims, the Examiner respectfully disagrees and notes the instant rejections and response to arguments regarding the independent claims above.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Shazeer et al. (US 2018/0341860 A1) also describes and attention neural network for machine translation.
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KAITLYN R HAEFNER whose telephone number is (571)272-1429. The examiner can normally be reached Monday - Thursday: 7:15 am - 5:15 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michelle Bechtold can be reached at (571) 431-0762. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/K.R.H./ Examiner, Art Unit 2148
/MICHELLE T BECHTOLD/ Supervisory Patent Examiner, Art Unit 2148