DETAILED ACTION
Claims 1-20 are presented for examination
This office action is in response to submission of application on 10-NOVEMBER-2022.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
The amendment filed on 24-OCTOBER-2025 in response to the non-final office action mailed 24-JULY-2025 has been entered. Claims 1-20 remain pending in the application.
With regards to the 101 rejection, the rejection to claim 1 has not been overcome by the applicant’s amendments. Despite applicant’s amendments, claim 1 still remains rejected under 35 U.S.C. 101 on the basis of being an abstract idea.
With regards to the 102(a)(I) rejections, the applicant’s amendments to the claims have not overcome the rejections to claims 1, 6-7, 11-12, 16-17, and 19-20 as the former prior art sufficiently teaches the newly added limitations of the amended claims.
With regards to the 103 rejections, the applicant’s amendments and arguments have not overcome the rejections to claims 2-5 and 12-15 as the former prior art sufficiently teaches the newly added limitations of the amended claims.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-19 are rejected under 35 U.S.C. 101 because the claimed invention is directed to
an abstract idea (Abstract Idea) without significantly more.
Regarding claim 1, in Step 1 of the 101 analysis set forth in MPEP 2106, the claim recites a method for executing a MoE model. A method is one of the four statutory categories of invention.
In Step 2a Prong 1 of the 101 analysis set forth in the MPEP 2106, the examiner has determined
that the following limitations recite a process that, under the broadest reasonable interpretation, covers
a mental process but for recitation of generic computer components:
selecting, from among a plurality of expert sub-models of the MoE layer, one or more destination expert sub -models associated with the plurality of input tokens, wherein respective numbers k differ across the plurality of iterations, k being the number of expert sub-models selected as the one or more destination expert sub-models; (one can mentally select a group of things as a process of simply evaluating data and making a judgement on that data)
If claim limitations, under their broadest reasonable interpretation, covers performance of the
limitations as a mental process but for the recitation of generic computer components, then it falls
within the mental process grouping of abstract ideas. According, the claim “recites” an abstract idea.
In Step 2a Prong 2 of the 101 analysis set forth in MPEP 2106, the examiner has determined that
the following additional elements do not integrate this judicial exception into a practical application:
A computing system comprising: a plurality of processing devices configured to execute a Mixture-of-Experts (MoE) layer included in an MoE model at least in part by: in each of a plurality of iterations: at each of the plurality of processing devices: (Generally linking the use of the judicial exception to a particular technological environment or field of use (MPEP 2106.05(h))
receiving a respective plurality of input tokens; (Adding insignificant extra-solution activity (mere data gathering) to the judicial exception (MPEP 2106.05(g))
and conveying the plurality of input tokens to the one or more destination expert sub-models; (Adding insignificant extra-solution activity (mere data gathering) to the judicial exception (MPEP 2106.05(g))
generating one or more respective expert sub-model outputs at the one or more destination expert sub-models based at least in part on the respective input tokens received at the one or more destination expert sub-models; generating an MoE layer output based at least in part on the one or more expert sub-model outputs; (In step 2A prong 2, generating a model is a mere application of a computer tool (M.L. Model), which is not indicative of integration into a practical application. In step 2B, merely applying a computer tool is not indicative of significantly more.)
and outputting the MoE layer output to an additional computing process. (Adding insignificant extra-solution activity (mere data gathering) to the judicial exception (MPEP 2106.05(g))
Since the claim does not contain any other additional elements that are indicative of integration into a practical application, the claim is “directed” to an abstract idea.
In Step 2b of the 101 analysis set forth in the 2019 PEG, the examiner has determined that the claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.
As discussed above, the additional element (ii) recites generally linking the use of the judicial exception to a particular technological environment or field of use, (iii), (iv), and (vi) recites adding insignificant extra-solution activity, and (v) recites a mere application of a computer tool, which is not indicative of significantly more. Considering the additional elements individually and in combination, and the claim as a whole, the additional elements do not provide significantly more than the abstract idea. Therefore, the claim is not patent eligible
Regarding claim 2, it is dependent upon claim 1, and thereby incorporates the limitations of, and
corresponding analysis applied to claim 1. Further, claim 2 recites The computing system of claim 1, wherein: the plurality of processing devices are further configured to set an expert capacity shared by the one or more destination expert sub-models; and the expert capacity is a maximum number of input tokens configured to be processed at each of the one or more destination expert sub-models during an iteration of the plurality of iterations. (Generally linking the use of the judicial exception to a particular technological environment or field of use (MPEP 2106.05(h)) Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible.
Regarding claim 2, it is dependent upon claim 1, and thereby incorporates the limitations of, and
corresponding analysis applied to claim 1. Further, claim 2 recites The computing system of claim 1, wherein: the plurality of processing devices are further configured to set an expert capacity shared by the one or more destination expert sub-models; and the expert capacity is a maximum number of input tokens configured to be processed at each of the one or more destination expert sub-models during an iteration of the plurality of iterations. (Generally linking the use of the judicial exception to a particular technological environment or field of use (MPEP 2106.05(h)) Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible.
Regarding claim 3, it is dependent upon claim 2, and thereby incorporates the limitations of, and
corresponding analysis applied to claim 2. Further, claim 3 recites The computing system of claim 2, wherein the plurality of processing devices are further configured to: compute the expert capacity based at least in part on a capacity factor of the MoE layer; and dynamically modify the capacity factor of the one or more destination expert sub-models over the plurality of iterations. (In step 2A, prong 1, this recites a mathematical concept but for recitation of generic computer components which is not indicative of integration into a practical application.) Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible.
Regarding claim 4, it is dependent upon claim 3, and thereby incorporates the limitations of, and
corresponding analysis applied to claim 3. Further, claim 4 recites The computing system of claim 3, wherein the plurality of processing devices are further configured to dynamically modify the capacity factor over the plurality of iterations at least in part by, during each of the iterations, setting the capacity factor to a maximum among one or more respective numbers of the input tokens respectively received at the one or more destination expert sub-models during the iteration. (In step 2A, prong 1, this recites a mathematical concept but for recitation of generic computer components which is not indicative of integration into a practical application.) Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible.
Regarding claim 5, it is dependent upon claim 4, and thereby incorporates the limitations of, and
corresponding analysis applied to claim 4. Further, claim 5 recites The computing system of claim 3, wherein the plurality of processing devices are further configured to set a predefined upper bound on the capacity factor. (Generally linking the use of the judicial exception to a particular technological environment or field of use (MPEP 2106.05(h)) Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible.
Regarding claim 6, it is dependent upon claim 1, and thereby incorporates the limitations of, and
corresponding analysis applied to claim 1. Further, claim 6 recites The computing system of claim 1, wherein the plurality of processing devices are further configured to select the one or more destination expert sub-models at least in part by identifying the one or more expert sub-models corresponding to the k highest routing scores included in a gating function output vector of a gating function. (In step 2A, prong 1, this recites an abstract idea but for recitation of generic computer components which is not indicative of integration into a practical application.) Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible.
Regarding claim 7, it is dependent upon claim 6, and thereby incorporates the limitations of, and
corresponding analysis applied to claim 6. Further, claim 7 recites The computing system of claim 6, wherein the gating function includes a linear layer configured to receive the plurality of input tokens. (Generally linking the use of the judicial exception to a particular technological environment or field of use (MPEP 2106.05(h)) Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible.
Regarding claim 8, it is dependent upon claim 7, and thereby incorporates the limitations of, and
corresponding analysis applied to claim 7. Further, claim 8 recites The computing system of claim 7, wherein the gating function further includes: a cosine similarity function configured to receive a linear layer output from the linear layer; and a SoftMax activation function that is computed on a cosine similarity function output of the cosine similarity function to obtain the plurality of routing scores included in the gating function output vector. (Generally linking the use of the judicial exception to a particular technological environment or field of use (MPEP 2106.05(h)) Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible.
Regarding claim 9, it is dependent upon claim 1, and thereby incorporates the limitations of, and
corresponding analysis applied to claim 1. Further, claim 9 recites The computing system of claim 1, wherein the number k at the iteration is specified via a user input received at an MoE layer application-programming interface (API). (Generally linking the use of the judicial exception to a particular technological environment or field of use (MPEP 2106.05(h)) Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible.
Regarding claim 10, it is dependent upon claim 1, and thereby incorporates the limitations of, and corresponding analysis applied to claim 1. Further, claim 10 recites The computing system of claim 1, wherein: the MoE layer is included among a plurality of MoE layers in the MoE model; and during the iteration, the numbers k of expert sub-models selected as the one or more destination expert sub-models differ between the plurality of MoE layers. (Generally linking the use of the judicial exception to a particular technological environment or field of use (MPEP 2106.05(h)) Since the claim does not recite additional elements that either integrate the judicial exception into a practical application, nor provide significantly more than the judicial exception, the claim is not patent eligible.
Regarding claims 11-19 they comprise of limitations similar to those of claims 1-7 and claims 9-10 and are therefore rejected for similar rationale.
Regarding claim 20:
Claim 20 is rejected under 35 U.S.C. 101 because the claimed invention is directed to
an abstract idea (Abstract Idea) without significantly more.
Regarding claim 20, in Step 1 of the 101 analysis set forth in MPEP 2106, the claim recites a method for executing a MoE model. A method is one of the four statutory categories of invention.
In Step 2a Prong 1 of the 101 analysis set forth in the MPEP 2106, the examiner has determined
that the following limitations recite a process that, under the broadest reasonable interpretation, covers
a mental process but for recitation of generic computer components:
selecting, from among a plurality of expert sub-models of the MoE layer, one or more destination expert sub-models associated with the plurality of input tokens; (one can mentally select a group of things as a process of simply evaluating data and making a judgement on that data)
If claim limitations, under their broadest reasonable interpretation, covers performance of the
limitations as a mental process but for the recitation of generic computer components, then it falls
within the mental process grouping of abstract ideas. According, the claim “recites” an abstract idea.
In Step 2a Prong 2 of the 101 analysis set forth in MPEP 2106, the examiner has determined that
the following additional elements do not integrate this judicial exception into a practical application:
A computing system comprising: a plurality of processing devices configured to execute a Mixture-of-Experts (MoE) layer included in an MoE model at least in part by: in each of a plurality of iterations: at each of the plurality of processing devices: (Generally linking the use of the judicial exception to a particular technological environment or field of use (MPEP 2106.05(h))
receiving a respective plurality of input tokens; (Adding insignificant extra-solution activity (mere data gathering) to the judicial exception (MPEP 2106.05(g))
and conveying the plurality of input tokens to the one or more destination expert sub-models, wherein the expert capacity of the one or more destination expert sub-models is equal to a maximum among one or more respective numbers of the input tokens respectively received at the one or more destination expert sub-models during the iteration; (Adding insignificant extra-solution activity (mere data gathering) to the judicial exception (MPEP 2106.05(g))
generating one or more respective expert sub-model outputs at the one or more destination expert sub-models based at least in part on the respective input tokens received at the one or more destination expert sub-models; generating an MoE layer output based at least in part on the one or more expert sub-model outputs; (In step 2A prong 2, generating a model is a mere application of a computer tool (M.L. Model), which is not indicative of integration into a practical application. In step 2B, merely applying a computer tool is not indicative of significantly more.)
and outputting the MoE layer output to an additional computing process. (Adding insignificant extra-solution activity (mere data gathering) to the judicial exception (MPEP 2106.05(g))
Since the claim does not contain any other additional elements that are indicative of integration into a practical application, the claim is “directed” to an abstract idea.
In Step 2b of the 101 analysis set forth in the 2019 PEG, the examiner has determined that the claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.
As discussed above, the additional element (ii) recites generally linking the use of the judicial exception to a particular technological environment or field of use, (iii), (iv), and (vi) recites adding insignificant extra-solution activity, and (v) recites a mere application of a computer tool, which is not indicative of significantly more. Considering the additional elements individually and in combination, and the claim as a whole, the additional elements do not provide significantly more than the abstract idea. Therefore, the claim is not patent eligible
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claim(s) 1, 6-7, 10, 11, 16-17, and 20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by HUANG (U.S. Pub. No. US 20220237435 A1).
Regarding claim 1, HUANG substantially teaches the claim including:
A computing system comprising: a plurality of processing devices configured to execute a Mixture-of-Experts (MoE) layer included in an MoE model at least in part by: in each of a plurality of iterations: at each of the plurality of processing devices: receiving a respective plurality of input tokens; ([0005] The present technology concerns systems and methods for routing in mixture-of-expert models. In that regard, in some aspects of the technology, a transformer may have at least one MoE layer in each of its encoder and decoder, with the at least one MoE layer of the encoder having a learned gating function configured to route each token of a task to two or more selected expert FFNs, [0006] In one aspect, the disclosure describes a computer-implemented method of processing an input sequence in a transformer having an encoder and a decoder, the encoder and the decoder each having one or more mixture-of-experts sublayers, the method comprising: (a) generating, by one or more processors of a processing system, a first tokenized input sequence based on the input sequence, the first tokenized input sequence comprising a plurality of tokens; (it should be noted that the encoders and decoders are iterative as they employ an iterative process in the FFNs as FFNs are inherently iterative.)) selecting, from among a plurality of expert sub-models of the MoE layer, one or more destination expert sub-models associated with the plurality of input tokens, wherein respective numbers k differ across the plurality of iterations, k being the number of expert sub-models selected as the one or more destination expert sub-models; and conveying the plurality of input tokens to the one or more destination expert sub-models; ([0023] The MoE sublayer 212 comprises a learned gating function 214 and a set of E expert feed-forward networks 216a-216e (FFN.sub.1 through FFN.sub.E). E may be any suitable number such as 32, 128, etc. The learned gating function 214 is configured to process the output of the first normalization sublayer(the plurality of input tokens are the output here) 210, route it to two or more selected expert feed-forward networks (from amongst the set of expert feed-forward networks 216a-216e), [0036] The MoE sublayer 318 comprises a learned gating function 320 and a set of F expert feed-forward networks 322a-322f (FFN.sub.1 through FFN.sub.F). F may be any suitable number such as 32, 128, etc. The learned gating function 318 is configured to process the output of the second normalization sublayer 320 (as seen here, there are multiple iterations that expect the number of experts to differ, as shown by the separate naming of the variables. It should be further noted that this is a second sublayer going to a third, and HUANG teaches going up to five different sublayers, making it a plurality, where these steps are repeated)) generating one or more respective expert sub-model outputs at the one or more destination expert sub-models based at least in part on the respective input tokens received at the one or more destination expert sub-models; generating an MoE layer output based at least in part on the one or more expert sub-model outputs; and outputting the MoE layer output to an additional computing process. ([0023]… and combine the output of those two or more selected expert feed-forward networks to create a single vector to be output from the MoE sublayer 212. In that regard, in some examples, the learned gating function 214 may be configured to compute a vector identifying which expert feed-forward networks the output of the first normalization sublayer 210 should be routed to, and what weight should be accorded to each selected expert's output in order to create a final output for the MoE sublayer 212. [0029] FIG. 3 depicts an exemplary decoder architecture 300 for a transformer according to aspects of the present technology. In the example of FIG. 3, the inputs to the decoder are a combined encoder output vector 302 created by combining (e.g., stacking) all encoder outputs 238 of FIG. 2 for a given task, and a target sequence 304. (here, part of the encoder output is the MoE output which is then used for ‘a given task’, i.e. an additional computing process))
Regarding claim 6, HUANG further teaches:
The computing system of claim 1, wherein the plurality of processing devices are further configured to select the one or more destination expert sub-models at least in part by identifying the one or more expert sub-models corresponding to the k highest routing scores included in a gating function output vector of a gating function.([0023]… In that regard, in some examples, the learned gating function 214 may be configured to compute a vector identifying which expert feed-forward networks the output of the first normalization sublayer 210 should be routed to, and what weight should be accorded to each selected expert's output in order to create a final output for the MoE sublayer 212. (it should be noted that in routing based on a vector and a weight, this is a standard way of routing in gating functions that use the highest ‘routing scores’ (weights) to decide the function routed to.))
Regarding claim 7, HUANG further teaches:
The computing system of claim 6, wherein the gating function includes a linear layer configured to receive the plurality of input tokens. ([0024] As the learned gating function 214's routing decisions are based on its training, it will determine which expert feed-forward networks to route to based on whatever criteria it has been trained to prioritize. (the layer receiving the routing is the input layer of a FFN, which is linear in the context of HUANG.))
Regarding claim 10, HUANG further teaches:
The computing system of claim 1, wherein: the MoE layer is included among a plurality of MoE layers in the MoE model; ([0006] In one aspect, the disclosure describes a computer-implemented method of processing an input sequence in a transformer having an encoder and a decoder, the encoder and the decoder each having one or more mixture-of-experts sublayers,) and during the iteration, the numbers k of expert sub-models selected as the one or more destination expert sub-models differ between the plurality of MoE layers. ([0022] Likewise, the output of the first normalization sublayer 210 is connected to the MoE sublayer 212, as well as to the second normalization sublayer 218 through another residual connection. The second normalization sublayer 218 concatenates the output of the first normalization sublayer 210 with the output of the MoE sublayer 212, and normalizes the resulting vector.[0023] The MoE sublayer 212 comprises a learned gating function 214 and a set of E expert feed-forward networks 216a-216e [0035] Further, the output of the second normalization sublayer 316 is connected to the MoE sublayer 318, as well as to the third normalization sublayer 324 through yet another residual connection. The third normalization sublayer 324 concatenates the output of the second normalization sublayer 316 with the output of the MoE sublayer 318, and normalizes the resulting vector. [0036] The MoE sublayer 318 comprises a learned gating function 320 and a set of F expert feed-forward networks 322a-322f (as denoted here, it can be seen that the FFNs of each sublayer are different as they are labeled differently.))
Regarding claim 11, it comprises of limitations similar to those of claim 1 and are therefore rejected for similar rationale. Regarding claims 16-17, they comprise of limitations similar to those of claim 6-7 and are therefore rejected for similar rationale. Regarding claim 19, it comprises of limitations similar to those of claim 10 and is therefore rejected for similar rationale.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 2-5, 12-15, and claim 20 are rejected under 35 U.S.C. 103 as being unpatentable over HUANG (U.S. Pub. No. US 20220237435 A1) in view of SHAZEER (N.P.L. ‘OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER’)
Regarding claim 2, while HUANG teaches claim 1, which claim 2 is dependent upon, it does not explicitly teach:
The computing system of claim 1, wherein: the plurality of processing devices are further configured to set an expert capacity shared by the one or more destination expert sub-models; and the expert capacity is a maximum number of input tokens configured to be processed at each of the one or more destination expert sub-models during an iteration of the plurality of iterations.
However, in analogous art that similarly uses an iterative MoE layer, SHAZEER teaches:
The computing system of claim 1, wherein: the plurality of processing devices are further configured to set an expert capacity shared by the one or more destination expert sub-models; and the expert capacity is a maximum number of input tokens configured to be processed at each of the one or more destination expert sub-models during an iteration of the plurality of iterations. ((Section 2, paragraph 1)The Mixture-of-Experts (MoE) layer consists of a set of n “expert networks" E1, · · · , En, and a “gating network" G whose output is a sparse n-dimensional vector. Figure 1 shows an overview
of the MoE module. The experts are themselves neural networks, each with their own parameters.
Although in principle we only require that the experts accept the same sized inputs (i.e. a capacity as the inputs can be no larger than this given number) and produce the same-sized outputs, in our initial investigations in this paper, we restrict ourselves to the case where the models are feed-forward networks with identical architectures, but with separate parameters. (further, given the context, these parameters are not input, but rules/limitations on the size of the input))
It would have been obvious to a person skilled in the art before the effective filing date of the invention to have combined with SHAZEER‘s models with capacity set and, with HUANG‘s teaching of a method of using MoE models, with a reasonable expectation of success, an expert capacity, as in SHAZEER, set to each of the expert models, as found in HUANG. A person of ordinary skill would have been motivated to make this combination to improve model capacity (SHAZEER, abstract).
SHAZEER further teaches:
The computing system of claim 2, wherein the plurality of processing devices are further configured to: compute the expert capacity based at least in part on a capacity factor of the MoE layer; and dynamically modify the capacity factor of the one or more destination expert sub-models over the plurality of iterations. ((section 3.1, paragraph 2) In a conventional distributed training setting, multiple copies of the model on different devices asynchronously process distinct batches of data, and parameters are synchronized through a set of parameter servers. In our technique, these different batches run synchronously so that they can be combined for the MoE layer. We distribute the standard layers of the model and the gating network according to conventional data-parallel schemes, but keep only one shared copy of each expert. Each expert in the MoE layer receives a combined batch consisting of the relevant examples from all of the data-parallel input batches. The same set of devices function as data-parallel replicas (for the standard layers and the gating networks) and as model-parallel shards (each hosting a subset of the experts). If the model is distributed over d devices, and each device processes a batch of size b, each expert receives a batch of approximately kbd/n examples. Thus, we achieve a factor of d improvement in expert batch size. (the number of devices increases the capacity and is hard set based on the number of devices in the MoE system, making it a hyperparameter.))
SHAZEER further teaches:
The computing system of claim 3, wherein the plurality of processing devices are further configured to dynamically modify the capacity factor over the plurality of iterations at least in part by, during each of the iterations, setting the capacity factor to a maximum among one or more respective numbers of the input tokens respectively received at the one or more destination expert sub-models during the iteration. ((section 3.1, paragraph 2) Each expert in the MoE layer receives a combined batch consisting of the relevant examples from all of the data-parallel input batches. The same set of devices function as data-parallel replicas (for the standard layers and the gating networks) and as model-parallel shards (each hosting a subset of the experts). If the model is distributed over d devices, and each device processes a batch of size b, each expert receives a batch of approximately kbd n examples. Thus, we achieve a factor of d improvement in expert batch size. (the maximum, in this case, would be the number of devices being used) (section 3.1, paragraph 3) In the case of a hierarchical MoE (Section B), the primary gating network employs data parallelism, and the secondary MoEs employ model parallelism. Each secondary MoE resides on one device. This technique allows us to increase the number of experts (and hence the number of parameters) by proportionally increasing the number of devices in the training cluster. The total batch size increases, keeping the batch size per expert constant. The memory and bandwidth requirements per device also remain constant, as do the step times, as does the amount of time necessary to process a number of training examples equal to the number of parameters in the model.)
SHAZEER further teaches:
The computing system of claim 3, wherein the plurality of processing devices are further configured to set a predefined upper bound on the capacity factor. ((appendix D, paragraph 1) For the hierarchical MoE layers, the first level branching factors are 32, 64, 128, 256 and 256, respectively. (these are upper bounds on the factors set at each layer))
Regarding claim 20, HUANG further teaches:
A computing system comprising: a plurality of processing devices configured to execute a Mixture-of-Experts (MoE) layer included in an MoE model at least in part by: In each of a plurality of iterations: at each of the plurality of processing devices: receiving a respective plurality of input tokens; ([0005] The present technology concerns systems and methods for routing in mixture-of-expert models. In that regard, in some aspects of the technology, a transformer may have at least one MoE layer in each of its encoder and decoder, with the at least one MoE layer of the encoder having a learned gating function configured to route each token of a task to two or more selected expert FFNs, [0006] In one aspect, the disclosure describes a computer-implemented method of processing an input sequence in a transformer having an encoder and a decoder, the encoder and the decoder each having one or more mixture-of-experts sublayers, the method comprising: (a) generating, by one or more processors of a processing system, a first tokenized input sequence based on the input sequence, the first tokenized input sequence comprising a plurality of tokens; (it should be noted that the encoders and decoders are iterative as they employ an iterative process in the FFNs as FFNs are inherently iterative.))
SHAZEER further teaches:
setting an expert capacity of the plurality of expert sub-models; ((Section 3.1, paragraph 2)The same set of devices function as data-parallel replicas (for the standard layers and the gating networks) and as model-parallel shards (each hosting a subset of the experts). If the model is distributed over d devices, and each device processes a batch of size b, each expert receives a batch of approximately kbd/n examples.))
HUANG further teaches:
selecting, from among a plurality of expert sub-models of the MoE layer, one or more destination expert sub-models associated with the plurality of input tokens; ([0023] The MoE sublayer 212 comprises a learned gating function 214 and a set of E expert feed-forward networks 216a-216e (FFN.sub.1 through FFN.sub.E). E may be any suitable number such as 32, 128, etc. The learned gating function 214 is configured to process the output of the first normalization sublayer(the plurality of input tokens are the output here) 210, route it to two or more selected expert feed-forward networks (from amongst the set of expert feed-forward networks 216a-216e), [0036] The MoE sublayer 318 comprises a learned gating function 320 and a set of F expert feed-forward networks 322a-322f (FFN.sub.1 through FFN.sub.F). F may be any suitable number such as 32, 128, etc. The learned gating function 318 is configured to process the output of the second normalization sublayer 320)
SHAZEER further teaches:
and conveying the plurality of input tokens to the one or more destination expert sub-models, wherein the expert capacity of the one or more destination expert sub-models is equal to a maximum among one or more respective numbers of the input tokens respectively received at the one or more destination expert sub-models during the iteration; ((Section 3.1, paragraph 3) In the case of a hierarchical MoE (Section B), the primary gating network employs data parallelism, and the secondary MoEs employ model parallelism. Each secondary MoE resides on one device. This technique allows us to increase the number of experts (and hence the number of parameters) by proportionally increasing the number of devices in the training cluster. The total batch size increases, keeping the batch size per expert constant. The memory and bandwidth requirements per device also remain constant, as do the step times, as does the amount of time necessary to process a number of training examples equal to the number of parameters in the model.)
HUANG further teaches:
generating one or more respective expert sub-model outputs at the one or more destination expert sub-models based at least in part on the respective input tokens received at the one or more destination expert sub-models; generating an MoE layer output based at least in part on the one or more expert sub-model outputs; and outputting the MoE layer output to an additional computing process ([0023]… and combine the output of those two or more selected expert feed-forward networks to create a single vector to be output from the MoE sublayer 212. In that regard, in some examples, the learned gating function 214 may be configured to compute a vector identifying which expert feed-forward networks the output of the first normalization sublayer 210 should be routed to, and what weight should be accorded to each selected expert's output in order to create a final output for the MoE sublayer 212. [0029] FIG. 3 depicts an exemplary decoder architecture 300 for a transformer according to aspects of the present technology. In the example of FIG. 3, the inputs to the decoder are a combined encoder output vector 302 created by combining (e.g., stacking) all encoder outputs 238 of FIG. 2 for a given task, and a target sequence 304. (here, part of the encoder output is the MoE output which is then used for ‘a given task’, i.e. an additional computing process))
Regarding claims 12-15, they comprise of limitations similar to those of claims 2-5 and are therefore rejected for similar rationale.
Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over HUANG (U.S. Pub. No. US 20220237435 A1) in view of CHEUNG (U.S. Pub. No. US 20220237435 A1) in view of SHAZEER (N.P.L. ‘OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER’)
While HUANG does teach claim 7, which claim 8 is dependent upon, it does not explicitly teach:
The computing system of claim 7, wherein the gating function further includes: a cosine similarity function configured to receive a linear layer output from the linear layer; and a SoftMax activation function that is computed on a cosine similarity function output of the cosine similarity function to obtain (the output vector).
However, in analogous art that similarly uses a plurality of networks, CHEUNG teaches:
The computing system of claim 7, wherein the gating function further includes: a cosine similarity function configured to receive a linear layer output from the linear layer; and a SoftMax activation function that is computed on a cosine similarity function output of the cosine similarity function (to obtain the output vector) ((column 9, lines 21-25)Further, the relevant data determination module 206 may compute a probability distribution over data in the data vectors 703 using the SoftMax function based on the cosine similarity computed for each entry in the data vector.)
It would have been obvious to a person skilled in the art before the effective filing date of the invention to have combined with CHEUNG‘s softmax function and, with HUANG‘s teaching of gaining routing scores, with a reasonable expectation of success, a softmax function that uses cosine similarity, as in CHEUNG, for a gating function, as found in HUANG. A person of ordinary skill would have been motivated to make this combination to better access and utilize memory (CHEUNG, column 1, lines 32-35).
While HUANG, as modified by CHEUNG, does teach using a softmax function with cosine similarity, it does not explicitly teach:
(using a softmax function) to obtain the plurality of routing scores included in the gating function output vector.
However, in analogous art that similarly uses MoE layers, SHAZEER teaches:
(using a softmax function) to obtain the plurality of routing scores included in the gating function output vector. (Softmax Gating: A simple choice of non-sparse gating function (Jordan & Jacobs, 1994) is to multiply the input by a trainable weight matrix Wg and then apply the Softmax function.)
It would have been obvious to a person skilled in the art before the effective filing date of the invention to have combined with SHAZEER‘s softmax for routing and, with HUANG‘s, as modified by CHEUNG, teaching of gaining routing scores and using a cosine similarity output with the softmax, with a reasonable expectation of success, a softmax function that gains routing scores, as in SHAZEER, that uses cosine similarity, as found in HUANG. A person of ordinary skill would have been motivated to make this combination to improve model capacity (SHAZEER, abstract).
Claims 9 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over HUANG (U.S. Pub. No. US 20220237435 A1) in view of PADALA (U.S. Pub. No. US 20190304596 A1)
Regarding claim 9, HUANG further teaches:
user input received at an MoE layer application-programming interface (API). ([0016] In all cases, the computing devices described herein may further include any other components normally used in connection with a computing device such as a user interface subsystem. (The computing device relays this input to the processing device used for MoE layers, this process would have to be carried out by an api, thus an api connected to a MoE processing device))
While HUANG does teach receiving user input to a MoE API, it does not explicitly teach:
The computing system of claim 1, wherein the number k at the iteration is specified via a user input
However, in an analogous art that similarly uses a plurality of models, PADALA teaches:
The computing system of claim 1, wherein the number k at the iteration is specified via a user input ([0092] Parameter array 850 can be used to select parameter options for the building and validation of models. Parameter option 852 allows a user to set the maximum number of models the system will construct and validate, for example 12.)
It would have been obvious to a person skilled in the art before the effective filing date of the invention to have combined with PADALA‘s gain of user input and, with HUANG‘s teaching of user input processed to a MoE API, with a reasonable expectation of success, user input specifying the number of models, as in PADALA, which is brought into the MoE through an API, as found in HUANG. A person of ordinary skill would have been motivated to make this combination to better control the outcome (PADALA [0006]).
Regarding claim 18, it comprises of limitations similar to those of claim 9 and is therefore rejected for similar rationale.
Response to Arguments
Applicant’s arguments filed 24-OCTOBER-2025 have been fully considered, but they are found to be non-persuasive
With regards to the applicant’s remarks regarding the 101 rejection towards an abstract idea, the applicant argues that the amendments to claim 1 overcome the rejection of identifying the overlap and generating data
B The Office action states, on p. 2, that the features "selecting, from among a plurality of expert sub-models of the MoE layer, one or more destination expert sub-models associated with the plurality of input tokens, wherein respective numbers k of expert sub- models selected as the one or more destination expert sub-models differ across the plurality of iterations" recites a mental process. However, Applicant respectfully submits that the above limitations are not directed to a mental process and could not practically be performed in the human mind. Since an artificial neural network such as the MoE model of claim 1 is significantly different in structure and activity from the human brain, the expert sub-model selection performed at the MoE layer would not be practical to perform in the human mind or by manual calculation. Claim 1 also recites that the MoE layer is executed across a plurality of processing devices, each of which performs the operations on the input tokens in each of the plurality of iterations. This execution across multiple processing devices provides an additional structural difference between the MoE layer and mental processes. Applicant therefore respectfully submits that claim 1 does not recite any judicial exceptions and is directed to eligible subject matter at Step 2A Prong 1 of the Alice Mayo subject matter eligibility test.
With regards to this argument, while the MOE models do indeed have a separate structure to a human mind, that alone does not provide enough. The claim limitations directed to claiming the MOE models and processing devices merely provide a field of use for the following methodology, and the claim limitation directed to as a 101 rejection is not a limitation directed towards the structure of the MOE model. Further, simply because it is repeated across multiple processing devices does not mean that a human mind cannot repeat a methodology across multiple different data sets. The field of use does not provide significant enough structure, by itself, to integrate the abstract idea into a practical application.
Applicant further respectfully submits that the MoE layer included in the3 MoE model is not a generic computer component or a general link to a particular technological environment or field of use. Claim 1 instead recites a machine learning arc3hitecture that differs from conventional architectures. The variable number of expert sub-models recited in claim 1 achieves the advantage of adjusting for changes to the workload of the MoE model, as disclosed, for example, in Para. [0053] of the subject application. Thus, any alleged judicial exceptions that may be recited in claim 1 are integrated into a practical application, and claim 1 is eligible at Step 2A Prong 2 of the Alice Mayo subject matter eligibility test. As recognized in Exparte Desjardins (Appeal 2024-000567, ARP Sept. 26, 2025), AI-related patent claims that are directed to improving how the machine learning model itself operates are not generic implementations but rather technological improvements to computer functionality. In Desjardins, the Appeals Review Panel vacated a§ 101 rej ection because the claims improved "how the machine-learning model operates." The Desjardins decision cautioned against equating any machine learning with an unpatentable "algorithm" and the remaining additional elements as "generic computer components" without adequate explanation.
With regards to this argument, while the specification may state this, it must also be found in the claims themselves. The specification citing the improvement, by itself, does not make the improvement clear within the claims. To use the improvement as justification to integrate the abstract idea into a practical application, the improvement itself must be found withing the claims, MPEP 2106(a)(II) “To show that the involvement of a computer assists in improving the technology, the claims must recite the details regarding how a computer aids the method, the extent to which the computer aids the method, or the significance of a computer to the performance of the method. Merely a3dding generic computer components to perform the method is not sufficient. Thus, the claim must include more than mere instructions to perform the method on a generic component or machinery to qualify as an improvement to an existing technology. “. Further, the improvement cannot be the abstract idea itself. Finally, while there may be similarities that exist within Desjardins, it should be remembered that each case is examined and prosecuted separately. The simple existence of similarities does not mean that the decision of Desjardins constitutes overcoming the 101 rejection of the applicant’s case.
Claim 20 omits the features of claim 1 related to the number k of expert sub-models and instead recites features related to setting the expert capacity of the plurality of expert sub-models. Applicant respectfully submits that claim 20, similarly to claim 1, does not recite any mental processes. Applicant also respectfully submits that claim 20 recites an architectural improvement to the MoE model and is therefore directed to an improvement in the functioning of the computing system. Thus, claim 20 is directed to eligible subject matter..
With regards to this argument, the Examiner acknowledges, and apologizes for, the mistake made in the lack of a separate 101 rejection analysis of claim 20 and has included a separate analysis of claim 20 in the above rejection. However, for similar reasons as the arguments above, claim 20 remains rejected under 35 U.S.C. 101.
With regards to the applicant’s remarks regarding the 102(a)(1) rejection in the non-final action, the applicant argues that the prior art does not teach the newly amended claims 1, 6-7, 10-11, 16, 17, 19, and 20. The examiner acknowledges this argument and has adjusted the prior art of HUANG to disclose the newly added limitations. Namely, paragraph 36 has been added to show that HUANG does have a differing number of models for each iteration and does have multiple iterations. Regarding the claim 20 argument, the examiner acknowledges and apologizes for the mistake in mapping and has made a separate 103 rejection for independent claim 20.
With regards to the applicant’s remarks regarding the 103 rejection in the non-final action, the applicant argues that the prior art does not teach claims 2-5 and 12-15.
Applicant further submits that the cited references do not disclose or suggest all features of claims 3 and 13. In the rejection of claims 3 and 13, the Office action cites an excerpt of Shazeer that refers to combining inputs into a concurrently processed batch (section 3.1, paragraph 5). However, this cited excerpt of Shazeer does not refer to a capacity factor of the MoE layer, nor does it disclose or suggest modifying the capacity factor between iterations. The Office interprets a number of unrolled time steps as an example of a capacity factor. However, this interpretation does not match the definition of the capacity factor provided in Para. [0060] of the subject application, where the capacity factor is a scaling factor used as a hyperparameter in the computation of the expert capacity. Thus, Applicant respectfully submits that Huang in view of Shazeer does not disclose or suggest the features of claims 3 and 13.
With regards to this argument, the Examiner acknowledges this argument and has adjusted prior art SHAZEER to better represent the claimed subject matter. Specifically, SHAZEER has been adjusted to make a more clear mapping to the capacity factor, instead indicating that the capacity factor is not the unrolled time steps but rather the maximum total computers cited in the network. As defined by the spec, that is a hyperparameter that determines the capacity of the MoE as a factor within the formula, therefore teaching the claim limitation in its entirety.
Applicant respectfully submits that the cited references do not disclose or suggest all features of claims 4, 5, 14, and 15. As discussed above, Huang in view of Shazeer does not disclose or suggest the capacity factor. Thus, Huang in view of Shazeer also does not disclose or suggest dynamically modifying the capacity factor over the plurality of iterations, as recited in claims 4 and 14, or setting an upper bound on the capacity factor, as recited in claims 5 and 15. Claim 20 also recites features related to setting the capacity factor. Since the cited references do not appear to disclose or suggest the capacity factor, Applicant respectfully submits that the cited combination of references does not disclose or suggest all features of claim 20.
With regards to this argument, the readjusted mapping of claim 3 has rendered the claim language of claim 3 disclosed by SHAZEER, and as such the capacity factor is sufficiently taught. Further, SHAZEER does in fact teach the capacity factor by changing the number of devices participating in a time step and teaches an upper bound on the factor as it will always be bound by the number of devices in a system.
The rejections of claims 9 and 18 cite Para. [0016] of Huang, which discloses a user interface instead of an API. The rejections of claims 9 and 18 further cite Padala, which is not related to MoE models. Para. [0092] of Padala, which is cited in the rejection of claim 9, instead refers to multiple machine learning models that are not disclosed as having an MoE configuration. Padala also does not disclose or suggest that the number of these models changes between iterations. Thus, Applicant respectfully submits that claims 9 and 18 are not obvious over Huang in view of Padala.
With regards to this argument, a 103 can be used when the intent of the prior art aligns with that of the subject application. Justification for this combination has already been disclosed in the above rejection, namely, that the aim to improve efficiency in a similar neural network exists within PADALA. Further, PADALA has no burden to teach MoE configurations nor does it have a burden to teach the models changing between iterations as that is not what it has been mapped to.
In the rejection of claims 10 and 19, the Office interprets the number k in the limitation "during the iteration, the numbers k of expert sub-models selected as the one or more destination expert sub-models differ between the plurality of MoE layers" as referring to an index number rather than to how many expert sub-models are used during an iteration. This interpretation conflicts with the use of the number k in claim 1, where k is used to refer to the number of selected sub-models instead of an index number. The specification and drawings also consistently use k to refer to how many expert sub-models are utilized. The cited references do not appear to disclose or suggest using different numbers of expert sub-models at different MoE layers of an MoE model during the same iteration. Thus, Applicant respectfully submits that claims 10 and 19 are not obvious over the currently cited references.
In regards to this argument, the Examiner acknowledges this argument and has adjusted the mapping of HUANG to make the office interpretation clearer. Namely, the office is not interpreting K as the index and, as stated above, has shown HUANG to use separate variables to show the usage of multiple models of differing numbers.
Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SKIELER A KOWALIK whose telephone number is (571)272-1850. The examiner can normally be reached 8-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Mariela D Reyes can be reached at (571)270-1006. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/SKIELER ALEXANDER KOWALIK/Examiner, Art Unit 2142 /Mariela Reyes/Supervisory Patent Examiner, Art Unit 2142