DETAILED ACTION
Claims 1-20 are pending and have been examined.
--
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statements (IDS) submitted on 05/16/2023 and 06/19/2025 are in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statements are being considered by the examiner.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
-
Claims 1-20 rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more
Step 1: Claims 1-11 recite a system. Claims 12-19 recite a method. Claim 20 recites a system. Therefore, claims 1-11 and 20 are directed to a machine, and claims 12-19 are directed to a process.
With respect to claims 1 and 12:
2A Prong 1: The claim recites a judicial exception.
computing a gating function output vector based at least in part on the input tensor (mental process – evaluation,--- a human can manually compute a gating function output vector based at least in part on the input tensor)
computing a sparse encoding of the input tensor and the gating function output vector, wherein the sparse encoding indicates one or more destination expert sub-models included among a plurality of expert sub-models in the MoE layer (mental process – evaluation,--- a human can manually compute a sparse encoding of the input tensor and the gating function output vector)
computing an expert output tensor… (mental process – evaluation,--- a human can manually compute an expert output tensor)
computing an MoE layer output at least in part by computing a sparse decoding of the expert output tensor (mental process – evaluation,--- a human can manually compute an MoE layer output)
2A Prong 2: The judicial exception is not integrated into a practical application.
a plurality of processing devices configured to execute a Mixture-of-Experts (MoE) layer included in an MoE model at least in part by (mere instructions to apply an exception, (2) Whether the claim invokes computers - MPEP 2106.05(f); generic computer components)
receiving an input tensor including a plurality of input tokens (insignificant extra-solution activity – MPEP 2106.05(g), (3) data gathering and outputting)
dispatching the input tensor for processing at the one or more destination expert sub-models (insignificant extra-solution activity – MPEP 2106.05(g), (3) data gathering and outputting)
at the one or more destination expert sub-models (mere instructions to apply an exception – MPEP 2106.05(f), (3) The particularity or generality of the application of the judicial exception; high level recitation of using a sub-model to compute a tensor)
conveying the MoE layer output to an additional computing process (insignificant extra-solution activity – MPEP 2106.05(g), (3) data gathering and outputting)
Since the claim as a whole, looking at the additional elements individually and in combination, does not contain any other additional elements that are indicative of integration into a practical application, the claim is directed to an abstract idea.
2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception.
a plurality of processing devices configured to execute a Mixture-of-Experts (MoE) layer included in an MoE model at least in part by (mere instructions to apply an exception, (2) Whether the claim invokes computers - MPEP 2106.05(f); generic computer components)
receiving an input tensor including a plurality of input tokens (insignificant extra-solution activity – MPEP 2106.05(g), (3) data gathering and outputting, and WURC: receiving or transmitting data over a network, e.g., using the Internet to gather data, Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362 - MPEP 2106.05(d)(II)(i))
dispatching the input tensor for processing at the one or more destination expert sub-models (insignificant extra-solution activity – MPEP 2106.05(g), (3) data gathering and outputting, and WURC: receiving or transmitting data over a network, e.g., using the Internet to gather data, Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362 - MPEP 2106.05(d)(II)(i))
at the one or more destination expert sub-models (mere instructions to apply an exception – MPEP 2106.05(f), (3) The particularity or generality of the application of the judicial exception; high level recitation of using a sub-model to compute a tensor)
conveying the MoE layer output to an additional computing process (insignificant extra-solution activity – MPEP 2106.05(g), (3) data gathering and outputting, and WURC: receiving or transmitting data over a network, e.g., using the Internet to gather data, Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362 - MPEP 2106.05(d)(II)(i))
Considering the additional elements individually and in combination, and the claim as a whole, the additional elements do not provide significantly more than the abstract idea. Therefore, the claim is not patent eligible.
With respect to claims 2 and 13:
2A Prong 1: The claim recites a judicial exception.
compute a SoftMax output vector based at least in part on the gating function output vector (mental process – evaluation,--- a human can manually compute a SoftMax output vector)
compute the sparse encoding at least in part by computing a sparse SoftMax encoding of the SoftMax output vector (mental process – evaluation,--- a human can manually compute the sparse encoding)
2A Prong 2: The judicial exception is not integrated into a practical application.
wherein the plurality of processing devices are further configured to (mere instructions to apply an exception, (2) Whether the claim invokes computers - MPEP 2106.05(f); generic computer components)
Since the claim as a whole, looking at the additional elements individually and in combination, does not contain any other additional elements that are indicative of integration into a practical application, the claim is directed to an abstract idea.
2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception.
wherein the plurality of processing devices are further configured to (mere instructions to apply an exception, (2) Whether the claim invokes computers - MPEP 2106.05(f); generic computer components)
Considering the additional elements individually and in combination, and the claim as a whole, the additional elements do not provide significantly more than the abstract idea. Therefore, the claim is not patent eligible.
With respect to claims 3 and 14:
2A Prong 1: The claim recites a judicial exception.
compute the sparse SoftMax encoding at least in part by setting each SoftMax output element of the SoftMax output vector, other than a predetermined number k of one or more selected SoftMax output elements, equal to zero (mental process – evaluation,--- a human can manually compute the sparse SoftMax encoding)
2A Prong 2: The judicial exception is not integrated into a practical application.
wherein the plurality of processing devices are configured to (mere instructions to apply an exception, (2) Whether the claim invokes computers - MPEP 2106.05(f); generic computer components)
Since the claim as a whole, looking at the additional elements individually and in combination, does not contain any other additional elements that are indicative of integration into a practical application, the claim is directed to an abstract idea.
2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception.
wherein the plurality of processing devices are configured to (mere instructions to apply an exception, (2) Whether the claim invokes computers - MPEP 2106.05(f); generic computer components)
Considering the additional elements individually and in combination, and the claim as a whole, the additional elements do not provide significantly more than the abstract idea. Therefore, the claim is not patent eligible.
With respect to claims 4 and part of 15:
2A Prong 1: The claim recites a judicial exception.
assign the input tokens to the one or more destination expert sub-models as specified by the selected SoftMax output elements (mental process – evaluation or judgement,--- a human can manually assign the input tokens to the models)
the predetermined number k is equal to a number of the one or more destination expert sub-models (mental process – evaluation or judgement,--- claim 3 recites setting element other than a number of k elements to zero, a human can manually set the number k equal to a number of experts)
2A Prong 2: The judicial exception is not integrated into a practical application.
wherein: the plurality of processing devices are configured to (mere instructions to apply an exception, (2) Whether the claim invokes computers - MPEP 2106.05(f); generic computer components)
Since the claim as a whole, looking at the additional elements individually and in combination, does not contain any other additional elements that are indicative of integration into a practical application, the claim is directed to an abstract idea.
2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception.
wherein: the plurality of processing devices are configured to (mere instructions to apply an exception, (2) Whether the claim invokes computers - MPEP 2106.05(f); generic computer components)
Considering the additional elements individually and in combination, and the claim as a whole, the additional elements do not provide significantly more than the abstract idea. Therefore, the claim is not patent eligible.
With respect to claims 5 and part of 15:
2A Prong 1: The claim recites a judicial exception.
wherein the predetermined number k of the SoftMax output elements are the top-k largest SoftMax output elements among the plurality of SoftMax output elements (mental process – evaluation or judgement,--- claim 3 recites setting element other than a number of k elements to zero, a human can manually set the number k as the top-k softmax output)
With respect to claims 6 and 16:
2A Prong 1: The claim recites a judicial exception.
wherein: the predetermined number k is equal to one (mental process – evaluation or judgement,--- a human can manually set the number k to 1)
subsequently to setting each of the plurality of SoftMax output elements other than one SoftMax output element to zero… compress the sparse SoftMax encoding into a scalar equal to the nonzero SoftMax output element (mental process – evaluation or judgement,--- a human can manually set each of the plurality of SoftMax output elements other than one SoftMax output element to zero and compress the encoding into a scalar)
2A Prong 2: The judicial exception is not integrated into a practical application.
the plurality of processing devices are further configured to (mere instructions to apply an exception, (2) Whether the claim invokes computers - MPEP 2106.05(f); generic computer components)
Since the claim as a whole, looking at the additional elements individually and in combination, does not contain any other additional elements that are indicative of integration into a practical application, the claim is directed to an abstract idea.
2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception.
the plurality of processing devices are further configured to (mere instructions to apply an exception, (2) Whether the claim invokes computers - MPEP 2106.05(f); generic computer components)
Considering the additional elements individually and in combination, and the claim as a whole, the additional elements do not provide significantly more than the abstract idea. Therefore, the claim is not patent eligible.
With respect to claims 7 and 17:
2A Prong 1: The claim recites a judicial exception.
compute the MoE layer output at least in part by combining respective sparse decodings of a plurality of expert output tensors across the plurality of processing devices in an all-to-all combine operation (mental process – evaluation,--- a human can manually compute the output by combining decodings generated from all the devices)
2A Prong 2: The judicial exception is not integrated into a practical application.
wherein the plurality of processing devices are further configured to (mere instructions to apply an exception, (2) Whether the claim invokes computers - MPEP 2106.05(f); generic computer components)
dispatch the sparse encoding across the plurality of processing devices in an all-to-all dispatch operation (insignificant extra-solution activity – MPEP 2106.05(g), (3) data gathering and outputting)
Since the claim as a whole, looking at the additional elements individually and in combination, does not contain any other additional elements that are indicative of integration into a practical application, the claim is directed to an abstract idea.
2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception.
wherein the plurality of processing devices are further configured to (mere instructions to apply an exception, (2) Whether the claim invokes computers - MPEP 2106.05(f); generic computer components)
dispatch the sparse encoding across the plurality of processing devices in an all-to-all dispatch operation (insignificant extra-solution activity – MPEP 2106.05(g), (3) data gathering and outputting, and WURC: receiving or transmitting data over a network, e.g., using the Internet to gather data, Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362 - MPEP 2106.05(d)(II)(i))
Considering the additional elements individually and in combination, and the claim as a whole, the additional elements do not provide significantly more than the abstract idea. Therefore, the claim is not patent eligible.
With respect to claims 8 and part of 18:
2A Prong 1: The claim recites a judicial exception.
compute the sparse encoding at least in part… compute a respective expert input tensor as a product of the input tensor and the sparse SoftMax encoding of the SoftMax output vector (mental process – evaluation,--- a human can manually compute the sparse encoding and compute a respective expert input tensor)
2A Prong 2: The judicial exception is not integrated into a practical application.
wherein the plurality of processing devices are configured to… by executing a first kernel via which each processing device of the plurality of processing devices is configured to (mere instructions to apply an exception, (2) Whether the claim invokes computers - MPEP 2106.05(f); generic computer components to execute a function in light of spec [0046])
Since the claim as a whole, looking at the additional elements individually and in combination, does not contain any other additional elements that are indicative of integration into a practical application, the claim is directed to an abstract idea.
2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception.
wherein the plurality of processing devices are configured to… by executing a first kernel via which each processing device of the plurality of processing devices is configured to (mere instructions to apply an exception, (2) Whether the claim invokes computers - MPEP 2106.05(f); generic computer components to execute a function in light of spec [0046])
Considering the additional elements individually and in combination, and the claim as a whole, the additional elements do not provide significantly more than the abstract idea. Therefore, the claim is not patent eligible.
With respect to claims 9 and part of 18:
2A Prong 1: The claim recites a judicial exception.
compute the sparse decoding at least in part… compute a product of the expert output tensor and the sparse SoftMax encoding of the SoftMax output vector (mental process – evaluation,--- a human can manually compute the sparse encoding and compute a product of the expert output tensor and the sparse SoftMax encoding)
2A Prong 2: The judicial exception is not integrated into a practical application.
wherein the plurality of processing devices are further configured to… by executing a second kernel via which each processing device of the plurality of processing devices is configured to (mere instructions to apply an exception, (2) Whether the claim invokes computers - MPEP 2106.05(f); generic computer components to execute a function, in light of spec [0047])
Since the claim as a whole, looking at the additional elements individually and in combination, does not contain any other additional elements that are indicative of integration into a practical application, the claim is directed to an abstract idea.
2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception.
wherein the plurality of processing devices are further configured to… by executing a second kernel via which each processing device of the plurality of processing devices is configured to (mere instructions to apply an exception, (2) Whether the claim invokes computers - MPEP 2106.05(f); generic computer components to execute a function, in light of spec [0047])
Considering the additional elements individually and in combination, and the claim as a whole, the additional elements do not provide significantly more than the abstract idea. Therefore, the claim is not patent eligible.
With respect to claims 10 and 19:
2A Prong 1: The claim recites a judicial exception.
compute a training-time sparse decoding… (mental process – evaluation,--- a human can manually compute a training-time sparse decoding)
compute a training-time input tensor… (mental process – evaluation,--- a human can manually compute a training-time input tensor)
compute a training-time SoftMax output vector… compute a dot product of: a training-time expert output tensor and the training-time sparse decoding; or the training-time input tensor and the training-time sparse decoding (mental process – evaluation,--- a human can manually compute a training-time SoftMax output vector and a dot product)
2A Prong 2: The judicial exception is not integrated into a practical application.
wherein: the plurality of processing devices are further configured to perform a backward pass through the MoE layer during training of the MoE layer; and during the backward pass, the plurality of processing devices are further configured to (mere instructions to apply an exception, (2) Whether the claim invokes computers - MPEP 2106.05(f); generic computer components to perform a pass)
at least in part by executing the first kernel… at least in part by executing the second kernel… at least in part by executing a third kernel, wherein, via the third kernel, each processing device of the plurality of processing devices is configured to… (mere instructions to apply an exception, (2) Whether the claim invokes computers - MPEP 2106.05(f); generic computer components to execute a fucntion, in light of spec [0046][0047][0054])
Since the claim as a whole, looking at the additional elements individually and in combination, does not contain any other additional elements that are indicative of integration into a practical application, the claim is directed to an abstract idea.
2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception.
wherein: the plurality of processing devices are further configured to perform a backward pass through the MoE layer during training of the MoE layer; and during the backward pass, the plurality of processing devices are further configured to (mere instructions to apply an exception, (2) Whether the claim invokes computers - MPEP 2106.05(f); generic computer components to perform a pass)
at least in part by executing the first kernel… at least in part by executing the second kernel… at least in part by executing a third kernel, wherein, via the third kernel, each processing device of the plurality of processing devices is configured to… (mere instructions to apply an exception, (2) Whether the claim invokes computers - MPEP 2106.05(f); generic computer components to execute a fucntion, in light of spec [0046][0047][0054])
Considering the additional elements individually and in combination, and the claim as a whole, the additional elements do not provide significantly more than the abstract idea. Therefore, the claim is not patent eligible.
With respect to claim 11:
2A Prong 1: The claim recites a judicial exception.
compute a doubled gating function output tensor including two copies of the gating function output vector (mental process – evaluation,--- a human can manually compute a doubled gating function output tensor)
compute the sparse encoding based at least in part on the doubled gating function output vector (mental process – evaluation,--- a human can manually compute the sparse encoding)
2A Prong 2: The judicial exception is not integrated into a practical application.
wherein the plurality of processing devices are further configured to (mere instructions to apply an exception, (2) Whether the claim invokes computers - MPEP 2106.05(f); generic computer components)
Since the claim as a whole, looking at the additional elements individually and in combination, does not contain any other additional elements that are indicative of integration into a practical application, the claim is directed to an abstract idea.
2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception.
wherein the plurality of processing devices are further configured to (mere instructions to apply an exception, (2) Whether the claim invokes computers - MPEP 2106.05(f); generic computer components)
Considering the additional elements individually and in combination, and the claim as a whole, the additional elements do not provide significantly more than the abstract idea. Therefore, the claim is not patent eligible.
With respect to claim 20:
2A Prong 1: The claim recites a judicial exception.
computing a gating function output vector based at least in part on the input tensor (mental process – evaluation,--- a human can manually compute a gating function output vector)
computing a sparse encoding of the input tensor and the gating function output vector, wherein: the sparse encoding indicates one or more destination expert sub-models included among a plurality of expert sub-models in the MoE layer (mental process – evaluation,--- a human can manually compute a sparse encoding of the input tensor and the gating function output vector)
computing the sparse encoding includes: computing a SoftMax output vector based at least in part on the gating function output vector (mental process – evaluation,--- a human can manually compute a SoftMax output vector)
computing a sparse SoftMax encoding of the SoftMax output vector (mental process – evaluation,--- a human can manually compute a sparse SoftMax encoding)
… compute a respective expert input tensor as a product of the input tensor and the sparse SoftMax encoding of the SoftMax output vector (mental process – evaluation,--- a human can manually compute a respective expert input tensor)
computing an expert output tensor… (mental process – evaluation,--- a human can manually compute an expert output tensor)
computing an MoE layer output at least in part by computing a sparse decoding of the expert output tensor, wherein computing the sparse decoding includes… compute a product of the expert output tensor and the sparse SoftMax encoding of the SoftMax output vector (mental process – evaluation,--- a human can manually compute an MoE layer output, and compute a product of the expert output tensor and the sparse SoftMax encoding)
2A Prong 2: The judicial exception is not integrated into a practical application.
a plurality of processing devices configured to execute a Mixture-of-Experts (MoE) layer included in an MoE model at least in part by (mere instructions to apply an exception, (2) Whether the claim invokes computers - MPEP 2106.05(f); generic computer components)
receiving an input tensor including a plurality of input tokens (insignificant extra-solution activity – MPEP 2106.05(g), (3) data gathering and outputting)
executing a first kernel via which each processing device of the plurality of processing devices is configured to (mere instructions to apply an exception, (2) Whether the claim invokes computers - MPEP 2106.05(f); generic computer components to execute a function, in light of spec [0046])
dispatching the input tensor for processing at the one or more destination expert sub-models (insignificant extra-solution activity – MPEP 2106.05(g), (3) data gathering and outputting)
… at the one or more destination expert sub-models (mere instructions to apply an exception – MPEP 2106.05(f), (3) The particularity or generality of the application of the judicial exception; high level recitation of using a model to compute a tensor)
executing a second kernel via which each processing device of the plurality of processing devices is configured to (mere instructions to apply an exception, (2) Whether the claim invokes computers - MPEP 2106.05(f); generic computer components to execute a function, in light of spec [0046])
conveying the MoE layer output to an additional computing process (insignificant extra-solution activity – MPEP 2106.05(g), (3) data gathering and outputting)
Since the claim as a whole, looking at the additional elements individually and in combination, does not contain any other additional elements that are indicative of integration into a practical application, the claim is directed to an abstract idea.
2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception.
a plurality of processing devices configured to execute a Mixture-of-Experts (MoE) layer included in an MoE model at least in part by (mere instructions to apply an exception, (2) Whether the claim invokes computers - MPEP 2106.05(f); generic computer components)
receiving an input tensor including a plurality of input tokens (insignificant extra-solution activity – MPEP 2106.05(g), (3) data gathering and outputting, and WURC: receiving or transmitting data over a network, e.g., using the Internet to gather data, Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362 - MPEP 2106.05(d)(II)(i))
executing a first kernel via which each processing device of the plurality of processing devices is configured to (mere instructions to apply an exception, (2) Whether the claim invokes computers - MPEP 2106.05(f); generic computer components to execute a function, in light of spec [0046])
dispatching the input tensor for processing at the one or more destination expert sub-models (insignificant extra-solution activity – MPEP 2106.05(g), (3) data gathering and outputting, and WURC: receiving or transmitting data over a network, e.g., using the Internet to gather data, Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362 - MPEP 2106.05(d)(II)(i))
… at the one or more destination expert sub-models (mere instructions to apply an exception – MPEP 2106.05(f), (3) The particularity or generality of the application of the judicial exception; high level recitation of using a model to compute a tensor)
executing a second kernel via which each processing device of the plurality of processing devices is configured to (mere instructions to apply an exception, (2) Whether the claim invokes computers - MPEP 2106.05(f); generic computer components to execute a function, in light of spec [0046])
conveying the MoE layer output to an additional computing process (insignificant extra-solution activity – MPEP 2106.05(g), (3) data gathering and outputting and WURC: receiving or transmitting data over a network, e.g., using the Internet to gather data, Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362 - MPEP 2106.05(d)(II)(i))
Considering the additional elements individually and in combination, and the claim as a whole, the additional elements do not provide significantly more than the abstract idea. Therefore, the claim is not patent eligible.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.
Claims 1-5, 7, 12-15 and 17 rejected under 35 U.S.C. 102 (a)(1) as being anticipated by Fedus ("Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity" 20220616)
In regard to claims 1 and 12, Fedus teaches: A computing system comprising: a plurality of processing devices configured to execute a Mixture-of-Experts (MoE) layer included in an MoE model at least in part by: (Fedus, p. 6, 2.2 Efficient Sparse Routing "We design our model with TPUs [TPU devices: a plurality of processing devices] in mind, which require statically declared sizes. Below we describe our distributed Switch Transformer implementation."; p.1 Abstract "In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) models defy this and instead select different parameters for each in coming example... We address these with the introduction of the Switch Transformer. [an MoE model]"; p. 5, Figure 2 "Illustration of a Switch Transformer encoder block. We replace the dense feed forward network (FFN) layer present in the Transformer with a sparse Switch FFN layer (light blue)... [a Mixture-of-Experts (MoE) layer] We diagram two tokens (x1 = 'More' and x2 = 'Parameters' below) being routed (solid lines) across four FFN experts, where the router independently routes each token.")
PNG
media_image1.png
608
1232
media_image1.png
Greyscale
[AltContent: rect]receiving an input tensor including a plurality of input tokens; (Fedus, p. 35, Figure 16: Pseudo code of the Switch Transformer layer... "
# inputs shape: [batch, seq len, d model]batch, seq_len, d_model = inputs.get_shape()# Each core will route tokens_per_core tokens to the correct experts.tokens_per_core = batch * seq_len / num_cores";
receiving inputs [an input tensor] including batch * seq_len tokens [a plurality of input tokens]; the total number of tokens in the input tensor is the product of the batch size and the sequence length)
[AltContent: rect]computing a gating function output vector based at least in part on the input tensor; computing… the gating function output vector… (Fedus, p. 34, Figure 15: Pseudo code for the router... "
# router_logits shape: [num_cores, tokens_per_core, num_experts]router_logits = mtf.einsum([inputs, router_weights], reduced_dim=d model)...# Probabilities for each token of what expert it should be sent to.router_probs = mtf.softmax(router_logits, axis=-1)";
computing router_probs [a gating function output vector] based on the inputs [the input tensor])
computing a sparse encoding of the input tensor and… , wherein the sparse encoding indicates one or more destination expert sub-models included among a plurality of expert sub-models in the MoE layer; (Fedus, p. 35, Figure 16: Pseudo code of the Switch Transformer layer... "[AltContent: rect]
# Matmul with large boolean tensor to assign tokens to the correct expert...# expert_inputs shape: [num_experts, num_cores, expert_capacity, d_model]expert_inputs = mtf.einsum([inputs, dispatch_tensor], reduce_dims=[tokens_per_core])";
computing expert_inputs [a sparse encoding], where the input data are organized by 'which' expert [one or more destination expert sub-models, top-n expert] is assigned to process it)
dispatching the input tensor for processing at the one or more destination expert sub-models;
(Fedus, p. 35, Figure 16: Pseudo code of the Switch Transformer layer... "
[AltContent: rect]# All-to-All communication. Cores split across num_cores and now we want to split# across num_experts. This sends tokens, routed locally, to the correct expert now# split across different cores...expert_inputs = mtf.reshape(expert_inputs, [num_experts, num_cores, expert_capacity, d_model])";
All-to-All communication, sending tokens to the correct expert [dispatching the input tensor for processing at the expert])
computing an expert output tensor at the one or more destination expert sub-models;
(Fedus, p. 35, Figure 16: Pseudo code of the Switch Transformer layer... "
[AltContent: rect]
# Standard feed forward computation, where each expert will have its own# unique set of parameters...# expert_outputs shape: [num_experts, num_cores, expert_capacity, d_model]expert_outputs = feed_forward(expert_inputs)";
computing expert_outputs [an expert output tensor] at the expert)
computing an MoE layer output at least in part by computing a sparse decoding of the expert output tensor; and (Fedus, p. 35, Figure 16: Pseudo code of the Switch Transformer layer... "
[AltContent: rect]# Convert back to input shape and multiply outputs of experts by the routing probability.# expert_outputs shape: [num_experts, num_cores, tokens_per_core, d_model]# expert_outputs_combined shape: [num_cores, tokens_per_core, d_model]...expert_outputs_combined = mtf.einsum([expert_outputs, combine_tensor], reduce_dims=[tokens_per_core])# Remove tokens_per_core shapes used for local routing dispatching to match input shape... outputs = mtf.reshape(expert_outputs_combined, [batch, seq_len, d_model])";
computing outputs [an MoE layer output] by first computing expert_outputs_combined [a sparse decoding])
conveying the MoE layer output to an additional computing process. (Fedus, p. 13, 4.1 Fine-Tuning "We then fine-tune across a diverse set of tasks using a dropout rate of 0.1 for all layers except the Switch layers"; there are multiple layers in the switch transformer model, i.e. the output y of a switch layer is conveyed to another layer [an additional computing process])
In regard to claims 2 and 13, Fedus teaches: wherein the plurality of processing devices are further configured to: (Fedus, p. 6, 2.2 Efficient Sparse Routing "We design our model with TPUs [processing devices] in mind...")
compute a SoftMax output vector based at least in part on the gating function output vector; and (Fedus, p. 34, Figure 15: Pseudo code for the router... "
[AltContent: rect]# Get the top-1 expert for each token. expert gate is the top-1 probability# from the router for each token. expert index is what expert each token# is going to be routed to...expert_gate, expert_index = mtf.top_1(router_probs, reduced_dim=num_experts)# expert_mask shape: [num_cores, tokens_per_core, num_experts]expert_mask = mtf.one_hot(expert_index, dimension=num_experts)";
computing expert_mask [a SoftMax output vector] based on router_probs [the gating function output vector])
compute the sparse encoding at least in part by computing a sparse SoftMax encoding of the SoftMax output vector. (Fedus, p. 35, Figure 16: Pseudo code of the Switch Transformer layer... "
[AltContent: rect]# Matmul with large boolean tensor to assign tokens to the correct expert...# expert_inputs shape: [num_experts, num_cores, expert_capacity, d_model]expert_inputs = mtf.einsum([inputs, dispatch_tensor], reduce_dims=[tokens_per_core])";
computing expert_inputs [the sparse encoding] by first computing dispatch_tensor [a sparse SoftMax encoding])
In regard to claims 3 and 14, Fedus teaches: wherein the plurality of processing devices are configured to (Fedus, p. 6, 2.2 Efficient Sparse Routing "We design our model with TPUs [processing devices] in mind...")
compute the sparse SoftMax encoding at least in part by setting each SoftMax output element of the SoftMax output vector, other than a predetermined number k of one or more selected SoftMax output elements, equal to zero. (Fedus, p. 34, Figure 15: Pseudo code for the router... "
[AltContent: rect]# Get the top-1 expert for each token...# expert_mask shape: [num_cores, tokens_per_core, num_experts]expert_mask = mtf.one_hot(expert_index, dimension=num_experts)...combine_tensor = mtf.to_bfloat16(combine_tensor)...dispatch_tensor = mtf.cast(combine_tensor, tf.bool)";
compute combine_tensor and dispatch_tensor [the sparse SoftMax encoding] (in the top-1 routing [k= 1, a predetermined number k]) by first using one_hot function to set the elelemts in expert_mask [the SoftMax output vector]. As a result, expert_mask is a one-hot vector, having one element equal to 1 and all other elements equal to 0)
In regard to claims 4 and part of 15, Fedus teaches: wherein: the plurality of processing devices are configured to (Fedus, p. 6, 2.2 Efficient Sparse Routing "We design our model with TPUs [processing devices] in mind...")
assign the input tokens to the one or more destination expert sub-models as specified by the selected SoftMax output elements; and (Fedus, p. 35, Figure 16: Pseudo code of the Switch Transformer layer... "
[AltContent: rect]# Matmul with large boolean tensor to assign tokens to the correct expert...# expert_inputs shape: [num_experts, num_cores, expert_capacity, d_model]expert_inputs = mtf.einsum([inputs, dispatch_tensor], reduce_dims=[tokens_per_core])";
assign the input tokens to the expert as specified by dispatch_tensor, which is generated from the 'router' function, which including the calcualtion for the selected SoftMax output elements)
the predetermined number k is equal to a number of the one or more destination expert sub-models. (Fedus, p. 34, Figure 15: Pseudo code for the router... "
[AltContent: rect]# Get the top-1 expert for each token. expert gate is the top-1 probability# from the router for each token. expert index is what expert each token# is going to be routed to...expert_gate, expert_index = mtf.top_1(router_probs, reduced_dim=num_experts)";
the top-1 [k=1, a predetermined number k] expert)
In regard to claims 5 and part of 15, Fedus teaches: wherein the predetermined number k of the SoftMax output elements are the top-k largest SoftMax output elements among the plurality of SoftMax output elements. (Fedus, p. 34, Figure 15: Pseudo code for the router... "
[AltContent: rect]# Get the top-1 expert for each token. expert gate is the top-1 probability# from the router for each token. expert index is what expert each token# is going to be routed to...expert_gate, expert_index = mtf.top_1(router_probs, reduced_dim=num_experts)# expert_mask shape: [num_cores, tokens_per_core, num_experts]expert_mask = mtf.one_hot(expert_index, dimension=num_experts)";
elements in router_probs corresponds to elements in expert_mask, top-1 element in expert_mask is the top-1 largest element)
In regard to claims 7 and 17, Fedus teaches: wherein the plurality of processing devices are further configured to: (Fedus, p. 6, 2.2 Efficient Sparse Routing "We design our model with TPUs [processing devices] in mind…")
[AltContent: rect]dispatch the sparse encoding across the plurality of processing devices in an all-to-all dispatch operation; and (Fedus, p. 35, Figure 16: Pseudo code of the Switch Transformer layer... "
# All-to-All communication. Cores split across num_cores and now we want to split# across num_experts. This sends tokens, routed locally, to the correct expert now# split across different cores...expert_inputs = mtf.reshape(expert_inputs, [num_experts, num_cores, expert_capacity, d_model])";
All-to-All communication, dispatching expert_inputs [the sparse encoding] to specialized experts distributed across different devices and cores)
compute the MoE layer output at least in part by combining respective sparse decodings of a plurality of expert output tensors across the plurality of processing devices in an all-to-all combine operation. (Fedus, p. 35, Figure 16: Pseudo code of the Switch Transformer layer... "
[AltContent: rect]# All-to-All communication. Cores are currently split across the experts# dimension, which needs to be switched back to being split across num_cores...expert_outputs = mtf.reshape(expert_outputs, [num_experts, num_cores, expert_capacity, d_model])# Convert back to input shape and multiply outputs of experts by the routing probability.# expert_outputs shape: [num_experts, num_cores, tokens_per_core, d_model][AltContent: rect]# expert_outputs_combined shape: [num_cores, tokens_per_core, d_model]...expert_outputs_combined = mtf.einsum([expert_outputs, combine_tensor], reduce_dims=[tokens_per_core])# Remove tokens_per_core shapes used for local routing dispatching to match input shape...outputs = mtf.reshape(expert_outputs_combined, [batch, seq_len, d_model])";
computing outputs [the MoE layer output] by first combining expert_outputs [respective sparse decodings of a plurality of expert output tensors] across num-cores in All-to-All communication, i.e. aggregate the outputs from the selected experts)
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 6, 8-9, 16, 18 and 20 rejected under 35 U.S.C. 103 as being unpatentable over Fedus as applied to claims 3 and 14, and in further view of Jiang ("A novel data transformation and execution strategy for accelerating sparse matrix multiplication on GPUs" 20200219)
[AltContent: rect]In regard to claims 6 and 16, Fedus teaches: wherein: the predetermined number k is equal to one; and (Fedus, p. 34, Figure 15: Pseudo code for the router... "
# Get the top-1 expert for each token. expert gate is the top-1 probability# from the router for each token. expert index is what expert each token# is going to be routed to...expert_gate, expert_index = mtf.top_1(router_probs, reduced_dim=num_experts)";
the top-1 [a predetermined number k] expert)
[AltContent: rect]subsequently to setting each of the plurality of SoftMax output elements other than one SoftMax output element to zero, the plurality of processing devices are further configured to (Fedus, p. 34, Figure 15: Pseudo code for the router... "
# Get the top-1 expert for each token...# expert_mask shape: [num_cores, tokens_per_core, num_experts]expert_mask = mtf.one_hot(expert_index, dimension=num_experts)";
...
dispatch_tensor = mtf.cast(combine_tensor, tf.bool)";
after setting the elements in expert_mask)
Fedus does not teach, but Jiang teaches: compress the sparse SoftMax encoding into a scalar equal to the nonzero SoftMax output element. (Jiang, p. 377, 2.1 Compressed Sparse Row Representation "The compressed sparse row (CSR) representation is one of the most widely used data structures for storing sparse matrices [17, 36]. As shown in Fig 1... The corresponding values of these nonzeros are stored in the array value."; compressing the sparse binary matrix)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Fedus to incorporate the teachings of Jiang by including compressed sparse row (CSR) representation and matrix multiplication on GPU devices. Doing so would allow us to easily access a row of a sparse matrix, improve data locality for SpMM and SDDMM and speedup for those operations. (Jiang, p. 377, 2.1 Compressed Sparse Row Representation "With CSR format, we can easily access a row of the sparse matrix"; p. 376, Abstract "Experimental evaluation using 1084 sparse matrices from SuiteSparse collection and Network Repository shows that our technique achieves up to 2.91x speedup for SpMM and up to 3.19x speedup for SDDMM against the state-of-the-art alternatives on an Nvidia P100 GPU.")
In regard to claims 8 and part of 18, Fedus teaches: wherein the plurality of processing devices are configured to… (Fedus, p. 6, 2.2 Efficient Sparse Routing "We design our model with TPUs [processing devices] in mind...")
… compute the sparse encoding at least in part… compute a respective expert input tensor as a product of the input tensor and the sparse SoftMax encoding of the SoftMax output vector. (Fedus, p. 35, Figure 16: Pseudo code of the Switch Transformer layer... "
[AltContent: rect]# Matmul with large boolean tensor to assign tokens to the correct expert...# expert_inputs shape: [num_experts, num_cores, expert_capacity, d_model]expert_inputs = mtf.einsum([inputs, dispatch_tensor], reduce_dims=[tokens_per_core])";
p. 22. "This binary matrix is then used to do a gather via matrix multiplication [a product] with the input tensor of [n, B/n, d_model]. einsum(...) (7)"; compute expert_inputs [the sparse encoding, a respective expert input tensor] as einsum [a product] of inputs [the input tensor] and dispatch_tensor [the sparse SoftMax encoding])
Fedus does not teach, but Jiang teaches: by executing a first kernel via which each processing device of the plurality of processing devices is configured to (Jiang, p. 376, Abstract "In this work, we propose a novel row-reordering technique to improve data locality for SpMM and SDDMM on GPUs. [executing a kernel via (GPU) devices]"; p. 386, 7 Conclusion "In this paper, we have developed a row-reordering technique to improve the performance of two important sparse matrix multiplication kernels: [a kernel] SpMM and SDDMM."; in light of spec [0044][0046]; a kernel can be a function running on GPU, executing a kernel is performing matrix multiplication)
The rationale for combining the teachings of Fedus and Jiang is the same as set forth in the rejection of claim 6.
In regard to claims 9 and part of 18, Fedus teaches: wherein the plurality of processing devices are further configured to… (Fedus, p. 6, 2.2 Efficient Sparse Routing "We design our model with TPUs [processing devices] in mind...") compute the sparse decoding … compute a product of the expert output tensor and the sparse SoftMax encoding of the SoftMax output vector. (Fedus, p. 35, Figure 16: Pseudo code of the Switch Transformer layer... "
[AltContent: rect]# Convert back to input shape and multiply outputs of experts by the routing probability.# expert_outputs shape: [num_experts, num_cores, tokens_per_core, d_model]# expert_outputs_combined shape: [num_cores, tokens_per_core, d_model]...expert_outputs_combined = mtf.einsum([expert_outputs, combine_tensor], reduce_dims=[tokens_per_core])";
compute einsum [a product] of expert_outputs [the expert output tensor] and combine_tensor [the sparse SoftMax encoding])
Fedus does not teach, but Jiang teaches: at least in part by executing a second kernel via which each processing device of the plurality of processing devices is configured to (Jiang, p. 376, Abstract "In this work, we propose a novel row-reordering technique to improve data locality for SpMM and SDDMM on GPUs. [executing a kernel via (GPU) devices]"; p. 386, 7 Conclusion "In this paper, we have developed a row-reordering technique to improve the performance of two important sparse matrix multiplication kernels: [a kernel] SpMM and SDDMM."; in light of spec [0044][0047]; a kernel can be a function running on GPU, executing a kernel is performing matrix multiplication)
The rationale for combining the teachings of Fedus and Jiang is the same as set forth in the rejection of claim 6.
In regard to claim 20, Fedus teaches: A computing system comprising: a plurality of processing devices configured to execute a Mixture-of-Experts (MoE) layer included in an MoE model at least in part by: (Fedus, p. 6, 2.2 Efficient Sparse Routing "We design our model with TPUs [TPU devices: a plurality of processing devices] in mind, which require statically declared sizes. Below we describe our distributed Switch Transformer implementation."; p.1 Abstract "In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) models defy this and instead select different parameters for each in coming example... We address these with the introduction of the Switch Transformer. [an MoE model]"; p. 5, Figure 2 "Illustration of a Switch Transformer encoder block. We replace the dense feed forward network (FFN) layer present in the Transformer with a sparse Switch FFN layer (light blue)... [a Mixture-of-Experts (MoE) layer] We diagram two tokens (x1 = 'More' and x2 = 'Parameters' below) being routed (solid lines) across four FFN experts, where the router independently routes each token.")
[AltContent: rect]receiving an input tensor including a plurality of input tokens; (Fedus, p. 35, Figure 16: Pseudo code of the Switch Transformer layer... "
# inputs shape: [batch, seq len, d model]batch, seq_len, d_model = inputs.get_shape()# Each core will route tokens_per_core tokens to the correct experts.tokens_per_core = batch * seq_len / num_cores"
receiving inputs [an input tensor] including batch * seq_len tokens [a plurality of input tokens]; the total number of tokens in the input tensor is the product of the batch size and the sequence length)
[AltContent: rect]computing a gating function output vector based at least in part on the input tensor; computing... the gating function output vector (Fedus, p. 34, Figure 15: Pseudo code for the router... "
# router_logits shape: [num_cores, tokens_per_core, num_experts]router_logits = mtf.einsum([inputs, router_weights], reduced_dim=d model)...# Probabilities for each token of what expert it should be sent to.router_probs = mtf.softmax(router_logits, axis=-1)";
computing router_probs [a gating function output vector] based on the inputs [the input tensor])
computing a sparse encoding of the input tensor and..., wherein: the sparse encoding indicates one or more destination expert sub-models included among a plurality of expert sub-models in the MoE layer; and computing the sparse encoding includes: (Fedus, p. 35, Figure 16: Pseudo code of the Switch Transformer layer... "
[AltContent: rect]# Matmul with large boolean tensor to assign tokens to the correct expert...# expert_inputs shape: [num_experts, num_cores, expert_capacity, d_model]expert_inputs = mtf.einsum([inputs, dispatch_tensor], reduce_dims=[tokens_per_core])";
computing expert_inputs [a sparse encoding], where the input data are organized by 'which' expert [one or more destination expert sub-models, top-n expert] is assigned to process it)
[AltContent: rect]computing a SoftMax output vector based at least in part on the gating function output vector; (Fedus, p. 34, Figure 15: Pseudo code for the router... "
# Get the top-1 expert for each token. expert gate is the top-1 probability# from the router for each token. expert index is what expert each token# is going to be routed to...expert_gate, expert_index = mtf.top_1(router_probs, reduced_dim=num_experts)# expert_mask shape: [num_cores, tokens_per_core, num_experts]expert_mask = mtf.one_hot(expert_index, dimension=num_experts)";
computing expert_mask [a SoftMax output vector] based on router_probs [the gating function output vector])
[AltContent: rect]computing a sparse SoftMax encoding of the SoftMax output vector; and (Fedus, p. 34, Figure 15: Pseudo code for the router... "
# Cast back outputs to bfloat16 for the rest of the layer.combine_tensor = mtf.to_bfloat16(combine_tensor)# Create binary dispatch tensor that is 1 if the token gets routed to the corresponding expert.# dispatch_tensor shape: [num_cores, tokens_per_core, num_experts, expert_capacity]dispatch_tensor = mtf.cast(combine_tensor, tf.bool)";
computing combine_tensor and dispatch_tensor [a sparse SoftMax encoding])
compute a respective expert input tensor as a product of the input tensor and the sparse SoftMax encoding of the SoftMax output vector; (Fedus, p. 35, Figure 16: Pseudo code of the Switch Transformer layer... "
[AltContent: rect]# Matmul with large boolean tensor to assign tokens to the correct expert...# expert_inputs shape: [num_experts, num_cores, expert_capacity, d_model]expert_inputs = mtf.einsum([inputs, dispatch_tensor], reduce_dims=[tokens_per_core])";
p. 22. "This binary matrix is then used to do a gather via matrix multiplication [a product] with the input tensor of [n, B/n, d_model]. einsum(...) (7)"; compute expert_inputs [a respective expert input tensor] as einsum [a product] of inputs [the input tensor] and dispatch_tensor [the sparse SoftMax encoding])
[AltContent: rect]dispatching the input tensor for processing at the one or more destination expert sub-models; (Fedus, p. 35, Figure 16: Pseudo code of the Switch Transformer layer... "
# All-to-All communication. Cores split across num_cores and now we want to split# across num_experts. This sends tokens, routed locally, to the correct expert now# split across different cores...expert_inputs = mtf.reshape(expert_inputs, [num_experts, num_cores, expert_capacity, d_model])";
All-to-All communication, sending tokens to the correct expert [dispatching the input tensor for processing at the expert])
[AltContent: rect]computing an expert output tensor at the one or more destination expert sub-models; (Fedus, p. 35, Figure 16: Pseudo code of the Switch Transformer layer... "
# Standard feed forward computation, where each expert will have its own# unique set of parameters...# expert_outputs shape: [num_experts, num_cores, expert_capacity, d_model]expert_outputs = feed_forward(expert_inputs)";
computing expert_outputs [an expert output tensor] at the expert)
[AltContent: rect]computing an MoE layer output at least in part by computing a sparse decoding of the expert output tensor, (Fedus, p. 35, Figure 16: Pseudo code of the Switch Transformer layer... "
# Convert back to input shape and multiply outputs of experts by the routing probability.# expert_outputs shape: [num_experts, num_cores, tokens_per_core, d_model]# expert_outputs_combined shape: [num_cores, tokens_per_core, d_model]...expert_outputs_combined = mtf.einsum([expert_outputs, combine_tensor], reduce_dims=[tokens_per_core])# Remove tokens_per_core shapes used for local routing dispatching to match input shape...outputs = mtf.reshape(expert_outputs_combined, [batch, seq_len, d_model])";
computing outputs [an MoE layer output] by first computing expert_outputs_combined [a sparse decoding])
[AltContent: rect]compute a product of the expert output tensor and the sparse SoftMax encoding of the SoftMax output vector; and (Fedus, p. 35, Figure 16: Pseudo code of the Switch Transformer layer... "
# Convert back to input shape and multiply outputs of experts by the routing probability.# expert_outputs shape: [num_experts, num_cores, tokens_per_core, d_model]# expert_outputs_combined shape: [num_cores, tokens_per_core, d_model]...expert_outputs_combined = mtf.einsum([expert_outputs, combine_tensor], reduce_dims=[tokens_per_core])
ccompute einsum [a product] of expert_outputs [the expert output tensor] and combine_tensor [the sparse SoftMax encoding])
conveying the MoE layer output to an additional computing process. (Fedus, p. 13, 4.1 Fine-Tuning "We then fine-tune across a diverse set of tasks using a dropout rate of 0.1 for all layers except the Switch layers"; there are multiple layers in the switch transformer model, i.e. the output y of a switch layer is conveyed to another layer [an additional computing process])
Fedus does not teach, but Jiang teaches: executing a first kernel via which each processing device of the plurality of processing devices is configured to (Jiang, p. 376, Abstract "In this work, we propose a novel row-reordering technique to improve data locality for SpMM and SDDMM on GPUs. [executing a kernel via (GPU) devices]"; p. 386, 7 Conclusion "In this paper, we have developed a row-reordering technique to improve the performance of two important sparse matrix multiplication kernels: [a kernel] SpMM and SDDMM."; in light of spec [0044][0046] a kernel can be a function running on GPU, executing a kernel is performing matrix multiplication)
… wherein computing the sparse decoding includes executing a second kernel via which each processing device of the plurality of processing devices is configured to (Jiang, p. 376, Abstract "In this work, we propose a novel row-reordering technique to improve data locality for SpMM and SDDMM on GPUs. [executing a kernel via (GPU) devices]"; p. 386, 7 Conclusion "In this paper, we have developed a row-reordering technique to improve the performance of two important sparse matrix multiplication kernels: [a kernel] SpMM and SDDMM."; in light of spec [0044][0047] a kernel can be a function running on GPU, executing a kernel is performing matrix multiplication)
The rationale for combining the teachings of Fedus and Jiang is the same as set forth in the rejection of claim 6.
Claim 11 rejected under 35 U.S.C. 103 as being unpatentable over Fedus as applied to claim 1, and in further view of Googler ("MoE.py" 20220615)
In regard to claim 11, Fedus teaches: wherein the plurality of processing devices are further configured to: (Fedus, p. 6, 2.2 Efficient Sparse Routing "We design our model with TPUs [processing devices] in mind...")
[AltContent: rect]Fedus does not teach, but Googler teaches: wherein the plurality of processing devices are further configured to: compute a doubled gating function output tensor including two copies of the gating function output vector; and (Googler,
line 1442 "def _top_2_gating(...";line 1523 "gate_logits = mtf.layers.dense(...)";line 1528 "raw_gates = mtf.softmax(gate_logits, experts_dim)";line 1532 "# FIND TOP 2 EXPERTS PER POSITON# Find the top expert for each position. shape=[batch, group]gate_1, index_1 = mtf.top_1(raw_gates, experts_dim)# [batch, group, experts]mask_1 = mtf.one_hot(index_1, experts_dim, dtype=raw_gates.dtype)...gate_2, index_2 = mtf.top_1(gates_without_top_1, experts_dim)# [batch, group, experts] mask_2 = mtf.one_hot(index_2, experts_dim, dtype=raw_gates.dtype)lin 1707 "return dispatch_tensor, combine_tensor, loss";
compute gate_1, index_1, mask_1 and gate_2, index_2, mask_2 [a doubled gating function output tensor including two copies of the gating function output vector])
compute the sparse encoding based at least in part on the doubled gating function output vector. (Googler,
[AltContent: rect]line 481 "expert_inputs = mtf.einsum([inputs, dispatch_tensor]...";line 741 "if hparams.moe_gating == "top_2": dispatch_tensor_x, combine_tensor_x, loss_outer = _top_2_gating (...)";
compute expert_inputs [the sparse encoding] based on the dispatch_tensor, which can be the output of the top_2 gating function when moe_gating == 'top_2')
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Fedus to incorporate the teachings of Googler by including top-2 gating function. Doing so would allow most of the tokens to be routed and guarantee virtually no tokens being dropped at all.") (Fedus, p. 30, Figure 11 "Stage 1 is equivalent to Switch routing where tokens are routed to the expert with the highest probability from the router. In Stage 2 we look at all tokens that have overflowed and route them to the expert with which has the second highest probability. Tokens can still be overflowed if their second highest expert has too many tokens, but this allows most of the tokens to be routed. This process can be iterated to guarantee virtually no tokens are dropped at all.")
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Huang ("GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism" 20190725) teaches (Huang, p.3, 2.1 Interface "The set of parameters corresponding to pk is equivalent to the union of wi, wi+1, ..., wj, and its forward function would be Fk = fj◦...◦fi+1◦fi. The corresponding back-propagation function Bk can be computed from Fk using automatic symbolic differentiation."; p. 4, 2.3 Performance Optimization "During the backward pass, the k-th accelerator recomputes the composite forward function Fk"; p. 5, 3 Performance Analyses "We next trained Transformer models using Cloud TPUv3s with 16GB memory per accelerator core.")
Lepikhin ("GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding" 20200630) teaches (Lepikhin, p. 20, 5.1 Memory Efficiency and Scalability "When the memory requirement exceeds available memory on each device, compiler-based rematerialization will automatically recompute part of the activations in the backward pass [activation re-computation] in order to reduce peak activation memory.")
Huang and Lepikhin are the closest prior arts to claim 10, both teach re-computing the intermediate values on-the-fly during backward pass.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SU-TING CHUANG whose telephone number is (408)918-7519. The examiner can normally be reached Monday - Thursday 8-5 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Usmaan Saeed can be reached at (571) 272-4046. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/SU-TING CHUANG/Examiner, Art Unit 2146