Last updated: May 29, 2026
Application No. 18/109,156
SELF-BALANCING MIXTURE OF EXPERTS

Non-Final OA §102§103
Filed
Feb 13, 2023
Examiner
KAMRAN, MEHRAN
Art Unit
2196
Tech Center
2100 — Computer Architecture & Software
Assignee
Microsoft Technology Licensing, LLC
OA Round
1 (Non-Final)
Interview Optional

— +14.3% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 90% grant rate with +14.3% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 491 resolved cases, 2023–2026
Examiner Intelligence

KAMRAN, MEHRAN View full profile →
Grants 90% — above average
Career Allowance Rate
441 granted / 491 resolved
+34.8% vs TC avg
Moderate +14% lift
Without
With
+14.3%
Interview Lift
resolved cases with interview
Typical timeline
2y 7m
Avg Prosecution
13 currently pending
Career history
514
Total Applications
across all art units
Statute-Specific Performance

§101
1.1%
-38.9% vs TC avg
§103
90.6%
+50.6% vs TC avg
§102
1.5%
-38.5% vs TC avg
§112
5.3%
-34.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 491 resolved cases
Office Action

§102 §103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1-20 are presented for examination.
Claim Objections
Claim 5 is objected to because of the following reason. Claim 5 recites the “method of claim 5”, the examiner believes this should be “the method of claim 1”. Appropriate correction is requested. 
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1, 6, 8, 12-14, 16-18 are rejected under 35 U.S.C. 102(a)(1) as anticipated by He et al. (He-2022.pdf: "FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models”. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 120–134 (hereinafter He))

As per claim 12, He teaches A method for distributing experts of a machine learning model on a computing system, the method comprising: 
accessing a computing system comprising a plurality of accelerators; (He page 121 “FAsTERMoE is tested on 2 different clusters with up to 64 GPUs.” page 128 “We evaluate F ASTERMoE on 2 representative clusters. johnny is a cluster with 16 GPUs on 2 worker nodes”)
accessing a machine learning model comprising a plurality of experts distributed on the plurality of accelerators; (He page 121 "To reduce training time, expert parallelism is introduced to train MoE models distributedly, where experts are partitioned onto different workers"; page 124 "Given that there are commonly multiple accelerators within a node, each being a worker, ... ")
 identifying a set of input tokens to be routed to the plurality of experts; executing one or more computer-executable instructions configured to cause the computing system to apply the machine learning model to the set of input tokens based on the routing assignment; (He page 121 “With increasing model size, experts are commonly distributed across different workers. A popular expert receives more tokens than others, which causes its resident worker to be heavily loaded while other workers may idle.” and page 122 “When processing a token, only a few experts that best fit their domain are activated.”)
identifying a real-time processing imbalance of the set of input tokens based on a current distribution of the plurality of experts on the plurality of accelerators; (He page 123 “As shown in Figure 3, expert 0 receives 3 tokens, 3x more workload than expert 2. As a result, worker 2 idles for a long time before the next communication starts, not making full use of its available computational power. Given that training data naturally follow a skewed distribution, some experts are more likely to be selected more than others” and  page 123 “Therefore, the first challenge presented by an MoE training system is to handle dynamic load imbalance caused by skew expert selection”)
 determining a new distribution of the plurality of experts on the plurality of accelerators which will result in an improved balanced processing of the set of input tokens by the plurality of accelerators based on the routing assignment of the set of tokens to the plurality of experts as compared to the current distribution of the plurality of experts on the plurality of accelerators; and applying the new distribution of the plurality of experts on the plurality of accelerators. (He page 121 “Dynamic shadowing is enabled to reduce the idling caused by imbalanced expert selection” and page 121 “Guided by our performance model, we invent a dynamic shadowing approach to reduce the impact of skew expert popularity” and page 125 “As shown in Figure 6, some experts are replicated on all workers, namely shadowed experts, so that their parameters, instead of their input tokens, are transferred over the network. Their related computation is performed locally, intuitively reducing the load of the worker that contains a popular expert” and page 126 “To address this challenge, we leverage our performance model to analyze whether an expert should be shadowed at runtime. We predict the end-to-end latency of a training iteration to check performance gain, and act accordingly”)

As per claim 6, He teaches wherein applying the new distribution of the plurality of experts on the plurality of accelerators comprises duplicating an overloaded expert on a particular accelerator to perform parallel processing of input tokens by a set of duplicate experts. (He page 125 “As shown in Figure 6, some experts are replicated on all workers, namely shadowed experts, so that their parameters, instead of their input tokens, are transferred over the network. Their related computation is performed locally, intuitively reducing the load of the worker that contains a popular expert” and page 126 “To address this challenge, we leverage our performance model to analyze whether an expert should be shadowed at runtime. We predict the end-to-end latency of a training iteration to check performance gain, and act accordingly)

As per claim 8, He teaches segmenting one or more experts into a plurality of shards, wherein applying the new distribution of the plurality of experts on the plurality of accelerators comprises relocating at least one shard from a first accelerator to a second accelerator and leaving at least one shard located on the first accelerator. (He Fig 6 Expert 1 has shadow 1 in both Worker 0 and Worker 2)

As per claim 13, He teaches wherein identifying a real-time processing imbalance of the set of tokens comprises identifying an accelerator that is determined to be overloaded. (He Fig 3 Expert 0 (overloaded))

As per claim 14, He teaches wherein identifying a real-time processing imbalance of the set of tokens comprises identifying an underutilized accelerator. (He page 121 “A popular expert receives more tokens than others, which causes its resident worker to be heavily loaded while other workers may idle”)

As per claim 16, He teaches wherein identifying a real-time processing imbalance of the set of tokens is based on determining that a first accelerator is processing more tokens than a second accelerator. (He page 121 “A popular expert receives more tokens than others, which causes its resident worker to be heavily loaded while other workers may idle” and page 121 “FAsTERMoE is tested on 2 different clusters with up to 64 GPUs.”)

As per claim 17, He teaches wherein applying the new distribution comprises relocating a particular expert from the first accelerator to the second accelerator. (He page 124 “Given that there are commonly multiple accelerators within a node, each being a worker” [Here we see the worker to accelerator correspondence]. Fig 6 shows Expert 1 being replicated on worker 0 and worker 2)

As to claim 1, the only difference with claim 12 is this limitation “identifying a current distribution of the plurality of experts on the plurality of accelerators” in claim 1 vs “identifying a current distribution of the plurality of experts on the plurality of accelerators”.  He page 123 “As shown in Figure 3, expert 0 receives 3 tokens, 3x more workload than expert 2. As a result, worker 2 idles for a long time before the next communication starts, not making full use of its available computational power. Given that training data naturally follow a skewed distribution, some experts are more likely to be selected more than others”) 
This clearly shows the current distribution as well as its imbalance.

As per claim 18, He teaches A method for distributing experts of a machine learning model on a computing system, the method comprising: 
accessing a computing system comprising a plurality of accelerators; (He page 121 “FAsTERMoE is tested on 2 different clusters with up to 64 GPUs.” page 128 “We evaluate F ASTERMoE on 2 representative clusters. johnny is a cluster with 16 GPUs on 2 worker nodes”)
accessing a machine learning model comprising a plurality of experts distributed on the plurality of accelerators; (He page 121 "To reduce training time, expert parallelism is introduced to train MoE models distributedly, where experts are partitioned onto different workers"; page 124 "Given that there are commonly multiple accelerators within a node, each being a worker, ... ")
 identifying a historical processing record of input tokens by the machine learning model; (He Fig 4 and page 123 “We collect the expert selection of each token when training two real-world models of16 experts to observe the actual popularity of different experts. Some iterations during the training are sampled and visualized in Figure 4.”)
identifying a current distribution of the plurality of experts on the plurality of accelerators; (He page 123 “As shown in Figure 3, expert 0 receives 3 tokens, 3x more workload than expert 2. As a result, worker 2 idles for a long time before the next communication starts, not making full use of its available computational power. Given that training data naturally follow a skewed distribution, some experts are more likely to be selected more than others” and  page 123 “Therefore, the first challenge presented by an MoE training system is to handle dynamic load imbalance caused by skew expert selection”)
determining a new distribution of the plurality of experts on the plurality of accelerators which will result in an improved balanced processing of the set of input tokens by the plurality of accelerators based on the historical processing record of the set of tokens by the plurality of experts as compared to the current distribution of the plurality of experts on the plurality of accelerators; and applying the new distribution of the plurality of experts on the plurality of accelerators. (He page 121 “Dynamic shadowing is enabled to reduce the idling caused by imbalanced expert selection” and page 121 “Guided by our performance model, we invent a dynamic shadowing approach to reduce the impact of skew expert popularity” and page 125 “As shown in Figure 6, some experts are replicated on all workers, namely shadowed experts, so that their parameters, instead of their input tokens, are transferred over the network. Their related computation is performed locally, intuitively reducing the load of the worker that contains a popular expert” and page 126 “To address this challenge, we leverage our performance model to analyze whether an expert should be shadowed at runtime. We predict the end-to-end latency of a training iteration to check performance gain, and act accordingly”)

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.


Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over He et al. (He-2022.pdf: "FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models”. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 120–134 (hereinafter He) in view of Zhai (US 2025/0103922 A1).

As per claim 2, He does not teach applying the new distribution of the plurality of experts on the plurality of accelerators occurs prior to applying the machine learning model to the set of tokens.
However, Zhai teaches applying the new distribution of the plurality of experts on the plurality of accelerators occurs prior to applying the machine learning model to the set of tokens. (Zhai [0215] In an optional embodiment, the apparatus further includes an iterative calculation module configured to: [0216] copy each of the shadow experts in the shadow expert set to obtain a shadow model, and send the shadow models of all the shadow experts to other servers in the mixture-of-experts model; [0217] calculate gradients of the experts and the shadow models by the shadow models and the experts on all the servers in the mixture-of-experts model based on the corresponding input data, and return the gradients of the shadow models to the servers of the respective shadow experts; and [0218] obtain the gradients of the shadow experts based on the gradients of all the received shadow models, obtain a comprehensive gradient based on the gradients of the shadow experts and other experts, and update all the experts based on the comprehensive gradient.)

It would have been obvious to a person in the ordinary skill in the art before the filing date of the claimed invention to combine Zhai with the system of He to apply the new distribution of the plurality of experts. One having ordinary skill in the art would have been motivated to use Zhai into the system of He for the purpose of training a mixture-of experts model. (Zhai paragraph 02)

Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over He et al. (He-2022.pdf: "FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models”. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 120–134 (hereinafter He) in view of Malaya (US 2019/0188577 A1).

As per claim 3, He does not teach determining a particular number of iterations to process between determining if the new distribution of the plurality of experts should be applied; and performing the particular number of iterations prior to applying the new distribution of the plurality of experts on the plurality of accelerators.
However, Malaya teaches determining a particular number of iterations to process between determining if the new distribution of the plurality of experts should be applied; and performing the particular number of iterations prior to applying the new distribution of the plurality of experts on the plurality of accelerators. (Malaya [0015] In some examples, in addition to simply executing different experts on different hardware devices, the orchestrator component varies one or more model characteristics or parameters. The model characteristics or parameters may change how a particular expert performs on a particular hardware device and may change the relative priority among a plurality of processing devices for a particular invocation of an expert. Some examples of model characteristics or parameters include batch size, number of processors over which the expert is parallelized, and model hyper-parameters, such as the number of hidden layers in a neural network or the number of training iterations. The purpose of varying model parameters is to identify desired model parameters for execution of the expert on a particular hardware device. For example, for the execution parameter of execution speed, an invocation of a particular expert for inference on a small input batch may complete faster on a CPU while an invocation of the same expert for inference on a large batch of inputs may complete faster on a GPU. Desired model parameters may differ for different execution parameters. [0020] The priority data store 106 also stores model characteristics or parameters for different combinations of hardware devices and execution parameters. The stored model parameters indicate for which model characteristic or parameter values the expert should be executed on the associated hardware device and when optimized for the associated execution parameter. Model parameters include, without limitation, batch size, one or more processor types of the hardware device for executing the expert, number of processors to parallelize execution of the expert, and model hyper-parameters such as the number of hidden layers in a neural network or the number of training iterations. In one example, the priority data indicates that for an execution parameter of execution throughput and for a first input batch size, a first hardware device should be used to execute that expert. In another example, the priority data indicates that for an execution parameter of execution latency and for a second input batch size, a second hardware device should be used to execute that expert. [0021] The batch size model parameter indicates the number of concurrent training examples being processed in parallel or the number of inputs to be processed during the invocation of the expert for prediction or inference. The number of hidden layers in a neural network indicates the number of sets of neurons that perform computations after the input layer and before the output (model prediction). The number of training iterations indicates the set of iterative steps used by the numerical solver, such as stochastic gradient descent).

It would have been obvious to a person in the ordinary skill in the art before the filing date of the claimed invention to combine Malaya with the system of He to determine a particular number of iterations. One having ordinary skill in the art would have been motivated to use Malaya into the system of He for the purpose of assigning experts to processing devices in an automated manner. (Malaya paragraph 20) 

Claims 4 and 5 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over He et al. (He-2022.pdf: "FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models”. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 120–134 (hereinafter He) in view of Bernat (US 2020/0409748 A1).

As per claim 4, He does not teach applying the new distribution of experts on the plurality of accelerators comprises relocating an expert from an accelerator that is determined to be overloaded to an underutilized accelerator.
However, Bernat teaches applying the new distribution of experts on the plurality of accelerators comprises relocating an expert from an accelerator that is determined to be overloaded to an underutilized accelerator.(Bernat [0066] Referring now to FIG. 10, in some embodiments, the sled 400 may be embodied as an accelerator sled 1000. The accelerator sled 1000 is configured, to perform specialized compute tasks, such as machine learning, encryption, hashing, or other computational-intensive task. [0090] To address this, the kernel analysis and decision logic unit 1617 may migrate one or more of the kernels to other accelerator devices in the system 1600 to reduce load on the currently executing accelerator devices, and in turn support compliance with specified policies.)

It would have been obvious to a person in the ordinary skill in the art before the filing date of the claimed invention to combine Bernat with the system of He to relocate an expert.  One having ordinary skill in the art would have been motivated to use Bernat into the system of He for the purpose of obtaining telemetry data indicative of resource usage and power consumption of the accelerator. (Bernat paragraph 89) 

As per claim 5, He teaches the method of claim 5, wherein determining a new distribution of the plurality of experts is performed in response to predicting that an overloaded accelerator will likely drop an input token directed to the overloaded accelerator in the current distribution, and such that identifying the expert to be relocated is based on determining that the expert is currently located on the overloaded accelerator. (He Fig 3 Expert 0 (overloaded) and page 125 “As shown in Figure 6, some experts are replicated on all workers, namely shadowed experts, so that their parameters, instead of their input tokens, are transferred over the network. Their related computation is performed locally, intuitively reducing the load of the worker that contains a popular expert” and page 126 “To address this challenge, we leverage our performance model to analyze whether an expert should be shadowed at runtime. We predict the end-to-end latency of a training iteration to check performance gain, and act accordingly”)

As to claim 19, it is rejected based on the same reason as claim 4.

Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over He et al. (He-2022.pdf: "FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models”. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 120–134 (hereinafter He) in view of Cotman (US 2002/0186882 A1).

As per claim 7, He does not teach identifying one or more experts that will not be used to process input tokens based on the routing assignment of the set of tokens to the plurality of experts; and removing the one or more experts from one or more corresponding accelerators.
However, Cotman teaches identifying one or more experts that will not be used to process input tokens based on the routing assignment of the set of tokens to the plurality of experts; and removing the one or more experts from one or more corresponding accelerators. (Cotman [0084] Each Gaussian function, called an "expert" in one embodiment of the invention, accounts for a subset of data points. After each observation of a new data point, the algorithm can add, if necessary, an expert to a mixture of experts, which generates the probability density function covering the set of data points given thus far. It also can delete an expert when the expert is found unnecessary after each observation.).

It would have been obvious to a person in the ordinary skill in the art before the filing date of the claimed invention to combine Cotman with the system of He to remove experts from accelerators. One having ordinary skill in the art would have been motivated to use Cotman into the system of He for the purpose of implementing expert analysis of image data (Cotman paragraph 02)

Claims 9 and 10 are rejected under 35 U.S.C. 103 as being unpatentable over He et al. (He-2022.pdf: "FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models”. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 120–134 (hereinafter He) in view of Koren (US 2019/0042538 A1).

As per claim 9, He does not teach wherein at least some accelerators of the plurality of accelerators have a greater memory capacity than other accelerators is full, where the memory is from row memory or column memory and at least some accelerators have a greater processing capability than other accelerators.
However, Koren teaches at least some accelerators of the plurality of accelerators have a greater memory capacity than other accelerators is full, where the memory is from row memory or column memory (Koren [0046-0056], which discloses the element determines whether the memory device is full, where the memory is from row memory or column memory)
and at least some accelerators have a greater processing capability than other accelerators. (Koren, see paragraphs [0050-0056], which discloses increasing
processing speed of an accelerator by functioning in a second mode)

It would have been obvious to a person in the ordinary skill in the art before the filing date of the claimed invention to combine Koren with the system of He to use accelerators with varying capabilites. One having ordinary skill in the art would have been motivated to use Koren into the system of He for the purpose of handling different layers of neural network by a processor. (Koren paragraph 03) 

As per claim 10, Koren teaches wherein the machine learning model comprises a plurality of dense layers and a plurality of sparse layers, each sparse layer further comprising one or more experts of the plurality of experts, and wherein the machine learning model is distributed on the plurality of accelerators such that the plurality of dense layers of the machine learning model is distributed on one or more accelerators having the greater processing capability and the plurality of sparse layers of the machine learning model is distributed on one or more accelerators having the greater memory capability. (Koren, paragraphs [0029-0033], which discloses when a dense layer is provided and the row data set and column data set have few zero elements, efficient calculations are provided, where speeds processing within dense layers, accelerator function is provided. It also shows which discloses when the element determines a layer is sparse, the element has the accelerator to operate in a second mode where indexes of compressed data are received and determined by the processing element see also [0033-0043], which discloses each processing element receives row data sets including row elements and column data sets including column elements, where each processing element also has a multiplier for multiplying the data sets and an accumulator for accumulating the data. The accelerator operates in both a first mode and second mode, where the second mode is for calculating sparse layers containing zero values)

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over He et al. (He-2022.pdf: "FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models”. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 120–134 (hereinafter He) in view of Chen (Chen-2022.pdf: "TA-MoE: Topology-Aware Large Scale Mixture-of-Expert Training”. 36th Conference on Neural Information Processing Systems (NeurIPS 2022) (hereinafter Chen).

As per claim 11, He does not teach the plurality of experts is distributed evenly across the plurality of accelerators in the new distribution.
However, Chen teaches the plurality of experts is distributed evenly across the plurality of accelerators in the new distribution. (Chen page 2 “From the perspective of model design, BASE Layer [12] and the work of expert choice routing [28] assigned an equal number of tokens to each expert by delicate designs of the gate”)

It would have been obvious to a person in the ordinary skill in the art before the filing date of the claimed invention to combine Chen with the system of He to achieve even distribution of accelerators. One having ordinary skill in the art would have been motivated to use Chen into the system of He for the purpose of improving the performance of MoE. 

Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over He et al. (He-2022.pdf: "FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models”. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 120–134 (hereinafter He) in view of Schmitt (US 2023/0222017 A1).

As per claim 15, He does not teach identifying a real-time processing imbalance of the set of tokens is based on determining that an accelerator has dropped an input token.
However, Schmitt teaches identifying a real-time processing imbalance of the set of tokens is based on determining that an accelerator has dropped an input token.  (Schmitt [0052] In at least one embodiment, profiling comprises collecting data concerning operation of said program. In at least one embodiment, a GPU comprises hardware and/or software-based capabilities for collecting such operational information, which can include performance-related metrics. In at least one embodiment, examples of said metrics may include, but are not limited to, GPU idle, GPU busy, L1/L2 cache hit rates, FP pipeline utilization, warp stall reason, vertex shader busy, geometry shader busy, stream out busy, pixel shader busy, stream out busy, vertex count, and so on. It will be appreciated that these examples are intended to be illustrative rather than limiting. [0187] In at least one embodiment, a parallel processor or GPGPU as described herein is communicatively coupled to host/processor cores to accelerate graphics operations, machine-learning operations, pattern analysis operations, and various general purpose GPU (GPGPU) functions). 

It would have been obvious to a person in the ordinary skill in the art before the filing date of the claimed invention to combine Schmitt with the system of He to identify a real-time processing imbalance. One having ordinary skill in the art would have been motivated to use Schmitt into the system of He for the purpose of improving obtaining information about the performance of code on a graphics processing unit can be improved. (Schmitt paragraph 02)

Claim 20 is rejected under 35 U.S.C. 103 as being unpatentable over He et al. (He-2022.pdf: "FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models”. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 120–134 (hereinafter He) in view of Trawczynski (US 2016/0180487 A1).

As per claim 20, He does not teach based on the historical processing record, identifying an accelerator that is determined to be overloaded; and interchanging a first expert from the overloaded accelerator with a second expert from an underutilized accelerator to balance a future processing of a set of input tokens by the plurality of experts on the plurality of accelerators.
However, Trawczynski teaches based on the historical processing record, identifying an accelerator that is determined to be overloaded; and interchanging a first expert from the overloaded accelerator with a second expert from an underutilized accelerator to balance a future processing of a set of input tokens by the plurality of experts on the plurality of accelerators. (Trawczynski [0014] As used herein, the term “processing load” refers to an amount of work done by a GPU for a given amount of time wherein as the GPU does more work in the given amount of time, the processing load increases. In some embodiments, the processing load includes at least two components: a current processing load and an expected future processing load. The current processing load refers to the processing load the GPU is currently experiencing when the current processing load is measured, or the processing, load the GPU has experienced in the relatively recent past. In some embodiments, the current processing load is identified based on the amount of activity at one or more individual modules of the GPU, such as based on the percentage of idle cycles, over a given amount of time, in an arithmetic logic unit (ALU) or a texture mapping unit (TMU) of the GPU. The expected future processing load refers to the processing load the GPU is expected to experience in the relatively near future. In some embodiments, the expected future processing load is identified based on a number of threads (also referred to as wavefronts), scheduled for execution at the GPU)

 It would have been obvious to a person in the ordinary skill in the art before the filing date of the claimed invention to combine Trawczynski with the system of He to move experts between accelerators. One having ordinary skill in the art would have been motivated to use Trawczynski into the system of He for the purpose of load balancing a GPU. (Trawczynski paragraph 12)
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 

US 20240249464 A1 – discloses a multi-factor prediction of computing resources for algorithm execution. For example, a method comprises obtaining a set of factors associated with an algorithm configured to transform one or more two-dimensional images into one or more three-dimensional models. The method further comprises computing an estimated computing power value based on the set of factors. The method then comprises scheduling execution of the algorithm on one or more computing resources based on the estimated computing power value.

US 20230376697 A1 – discloses dialogue response prediction can leverage a plurality of machine-learned language models to generate a plurality of candidate outputs, which can be processed by a dialogue management model to determine a predicted dialogue response. The plurality of machine-learned language models can include a plurality of experts trained on different intents, emotions, and/or tasks. The particular candidate output selected may be selected by the dialogue management model based on semantics determined based on a language representation. The language representation can be a representation generated by processing the conversation history of a conversation to determine conversation semantics.

US 20230236902 A1 – discloses methods are provided for dynamic GPU-enabled VM provisioning across cloud service providers. An example method can include providing a VM pool that includes a GPU-optimized VM and a non-GPU-optimized VM operating in different clouds. A control plane can receive an indication that a user has submitted a machine-learning workload request, determine whether a GPU-optimized VM is available and instruct the non-GPU-optimized VM to send the workload to the GPU-optimized VM in a peer-to-peer manner. The GPU-optimized VM computes the workload and returns a result to the requesting VM. The control plane can instantiate a new GPU-optimized VM (or terminate it when the workload is complete) to dynamically maintain a desired number of available GPU-optimized VMs.

US 20220344049 A1 – discloses a decentralized training platform is described for training an Artificial Intelligence (AI) model where training data (e.g., medical images) is distributed across multiple sites (nodes) and due to confidentiality, legal, or other reasons the data at each site is unable to be shared or leave the site and so cannot be copied to a central location for training. The method comprises training a teacher model locally at each node and then moving each of the teacher models to a central node and using these to train a student model using a transfer dataset. This may be facilitated by setting up the cloud service using inter-region peering connections between the nodes to make the nodes appear as a single cluster. In one variation the student module may be trained at each node using the multiple trained teacher models. In another variation we train multiple student models where each student model is trained by each teacher model at the node the teacher model was trained on, and once the plurality of student models are trained, an ensemble model is generated from the plurality of trained student models. Loss function weighting and node under sampling to enable load balancing may be used to improve accuracy and time/cost efficiency.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to MEHRAN KAMRAN whose telephone number is (571)272-3401.  The examiner can normally be reached on 9-5.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, April Blair can be reached on (571)270-1014.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/MEHRAN KAMRAN/Primary Examiner, Art Unit 2196
Read full office action
Prosecution Timeline

Feb 13, 2023
Application Filed
Jan 09, 2026
Non-Final Rejection mailed — §102, §103
Apr 15, 2026
Applicant Interview (Telephonic)
Apr 15, 2026
Examiner Interview Summary
Precedent Cases

Applications granted by this same examiner with similar technology

18/073,262
Patent 12639090
GUEST-ASSISTED LIVE STORAGE MIGRATION
3y 5m to grant Granted May 26, 2026
17/872,059
Patent 12613736
MEMORY AWARE CONTEXT SWITCHING
3y 9m to grant Granted Apr 28, 2026
18/322,810
Patent 12613857
DATABASE SERVER AGENT
2y 11m to grant Granted Apr 28, 2026
17/883,541
Patent 12591444
Hardware Virtual Machine for Controlling Access to Physical Memory Space
3y 7m to grant Granted Mar 31, 2026
18/067,300
Patent 12585486
SYSTEMS AND METHODS FOR DEPLOYING A CONTAINERIZED NETWORK FUNCTION (CNF) BASED ON INFORMATION REGARDING THE CNF
3y 3m to grant Granted Mar 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
90%
Grant Probability
99%
With Interview (+14.3%)
2y 7m (~0m remaining)
Median Time to Grant
Low
PTA Risk
Based on 491 resolved cases by this examiner. Grant probability derived from career allowance rate.