Last updated: April 19, 2026

Application No. 18/452,326

AUTOMATIC LATENCY OPTIMIZATION FOR CPU-BASED DNN SERVING

Non-Final OA §103§112

Filed

Aug 18, 2023

Examiner

DASCOMB, JACOB D

Art Unit

2198

Tech Center

2100 — Computer Architecture & Software

Assignee

Microsoft Technology Licensing, LLC

OA Round

1 (Non-Final)

Interview Optional

— +20.5% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 440 resolved cases, 2023–2026

Examiner Intelligence

DASCOMB, JACOB D View full profile →

Grants 86% — above average

Career Allow Rate

379 granted / 440 resolved

+31.1% vs TC avg

Strong +20% interview lift

Without

With

+20.5%

Interview Lift

resolved cases with interview

Typical timeline

2y 12m

Avg Prosecution

43 currently pending

Career history

483

Total Applications

across all art units

Statute-Specific Performance

§101

11.8%

-28.2% vs TC avg

§103

55.0%

+15.0% vs TC avg

§102

3.5%

-36.5% vs TC avg

§112

18.2%

-21.8% vs TC avg

Black line = Tech Center average estimate • Based on career data from 440 resolved cases

Office Action

§103 §112

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Specification
The disclosure is objected to because of the following informalities:  Several paragraphs recite “per-patch latency;” however, this is likely a typo and intended to refer to “per-batch latency.”  
The disclosure is objected to because of the following informalities:  There is no brief description of Figure 11.  
Appropriate correction is required.

Claim Objections
Claims 1, 8, and 15 are objected to because of the following informalities: They recite “per-patch latency;” however, this likely is intended to refer to “per-batch latency.”  Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 1 recites the limitation “the thread optimizer system” in lines 10-11.  There is insufficient antecedent basis for this limitation in the claim.  For the purpose of examination, this element has been interpreted as “the thread optimization system.”  Claims 2-20 depend on claim 1 or recite commensurate subject matter; therefore, they are rejected for the same reason.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claim(s) 1, 3-5, 7, 8, 10-12, 14, 15, and 17-19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lee (US 11,687,771) and further in view of Cui (US 2020/0042362).

Regarding claim 1, Lee teaches: A thread optimization system for a model serving system, the thread optimization system comprising: a processor; and a memory in communication with the processor, the memory comprising executable instructions that, when executed by the processor alone or in combination with other processors, cause the thread optimization system to perform functions of: 
estimating a batch size for (col. 7:52-56, “At 103, the analyzer 100 generates a baseline resource allocation state. A window, or grouping, size may additionally be generated to reduce the complexity of performing a causal resource profile for different allocations of computing resources, as described in connection with FIG. 5”), the model serving system serving a deep learning model and including a plurality of threads corresponding to a total thread count for processing inferences for the deep learning model (col. 8:67 and col. 9:1-2, “For this example, the kernel 201 may have 12 thread blocks that execute on GPU cores 1-4 that are distributed on cores 1-3 as depicted in FIG. 2A”); 
automatically determining, using an optimizer component of the thread optimizer system (col. 7:13-17, “FIG. 1 depicts a functional block diagram of a multipath neural network analyzer 100 that optimizes allocation or partitioning of computing resources to the different layers in different paths of a multipath neural network according to the subject matter disclosed herein”), a first optimal configuration (col. 7:25-30, “The analyzer 100 may analyze the different layers in the different paths of the multipath neural network 101 using causal resource profiling and determine an optimal allocation or partitioning of computing resources so that the multipath neural network 101 executes, or runs, in a minimal amount of time”) that defines a number of inference instances (col. 8:65-67, “the modified kernel-launching function is used to set the min/max core arguments for kernel 201 to be ⅔”), [and] a number of threads per inference instance (col. 8:1 and col. 9:1-2, “the kernel 201 may have 12 thread blocks that execute on GPU cores 1-4 that are distributed on cores 1-3 as depicted in FIG. 2A”), (col. 10:56-59, “the convolution layer may take a little longer to execute based on the reduced memory allocation, but the overall execution time for both paths may be significantly reduced”), the first optimal configuration being determined with reference to a plurality of predetermined model profiles that define single-inference average batch latencies for different combinations of thread counts and batch sizes (col. 7:52-56, “At 103, the analyzer 100 generates a baseline resource allocation state. A window, or grouping, size may additionally be generated to reduce the complexity of performing a causal resource profile for different allocations of computing resources, as described in connection with FIG. 5”), the predetermined model profiles being used as input to a dynamic programming algorithm that identifies optimal configurations that minimize the average per-batch latency (col. 6:59-62, “resource utilization of individual layers may be updated or reconfigured if a performance analysis indicates a performance improvement based on a particular computing resource allocation”); and 
allocating compute resources based on the number of inference instances, the number of threads per inference instance, (col. 8:6-14, “At 109, as better computing-resource allocations are determined for the different layers and/or paths and better scheduling allocation is determined for the neural network 101, the analyzer 100 updates the baseline resource allocation of layers at 103 with the better allocation and the better scheduling allocation will be used when the neural network 101 runs”).
Lee does not teach; however, Cui discloses: estimating a batch size for inference requests (¶ 3, “the data includes training data to build and optimize deep learning models, as well as model parameters of deep learning models which are utilized for inference processing”) of the model serving system (¶ 18, “FIG. 1 schematically illustrates a computing system comprising a deep learning computing platform, which implements a self-adaptive batch dataset partitioning control method to optimize load balancing among a set of accelerator resources for a distributed deep learning model training task, according to an embodiment of the invention”); automatically determining a sub-batch size per inference instance for processing a batch of inference requests of the batch size (¶ 27, “Each iteration of the backpropagation process is performed on a mini-batch of data, wherein a mini-batch of data comprises a subset (or portion) of a total dataset of model training data”); and allocating compute resources based on the sub-batch size per inference instance (¶ 47, “The accelerators 320-1, 320-2, 320-3, and 320-4 execute a model training task by processing the respective sub-batch datasets 390-1, 390-2, 390-3, and 390-4 using the model computation graph 375”).
It would have been obvious to a person having ordinary skill in the art, at the effective filing date of the invention, to have applied the known technique of estimating a batch size for inference requests of the model serving system; automatically determining a sub-batch size per inference instance for processing a batch of inference requests of the batch size, as taught by Cui, in the same way to the determined first optimal configuration, as taught by Lee. Both inventions are in the field of resource allocation for deep learning tasks, and combining them would have predictably resulted in “accelerated data processing in a high-performance computing environment,” as indicated by Cui (¶ 1).

Regarding claim 3, Cui teaches: The thread optimization system of claim 1, wherein the functions further comprise: dispatching inference requests (¶ 23, “The deep learning computing platform 50 comprises a software platform to support deep learning tasks such as DL model training and inference processing (or classification), which are executed on the distributed computing system 100”) to the inference instances based on the optimal configuration using a dispatching component of the thread optimizing system (¶ 32, “he self-adaptive batch dataset partitioning control module 53 implements an iterative batch size tuning process which is configured to determine an optimal job partition ratio for partitioning mini-batch datasets into sub-batch datasets for processing by a set of hybrid accelerator resources during a data-parallel DL model training process”).

Regarding claim 4, Cui teaches: The thread optimization system of claim 1, wherein the functions further comprise: triggering a configuration change when a different batch size for inference requests is detected (¶ 17, “for each iteration of a backpropagation process, a mini-batch of data samples is partitioned and evenly distributed to a plurality of worker nodes”), the configuration change including: automatically determining, using the optimizer component of the thread optimizer system, a second optimal configuration that defines a number of inference instances, a number of threads per inference instance, and a sub-batch size per inference instance for processing a batch of inference requests of the different batch size using intra-operator parallelism (¶ 13, “implementing an iterative batch size tuning process which is configured to determine an optimal job partition ratio for partitioning mini-batch datasets into sub-batch datasets for processing by a set of hybrid accelerator resources, wherein the sub-batch datasets are partitioned into optimal batch sizes for processing by respective accelerator resources to minimize a time for completing the deep learning model training process”); and allocating the compute resources based on the number of inference instances, the number of threads per inference instance, and the sub-batch size per inference instance indicated by the second optimal configuration (¶ 32, “The sub-batch datasets are partitioned into optimal batch sizes for processing by respective accelerator resources to minimize a time for completing the deep learning model training process”).

Regarding claim 5, Lee teaches: The thread optimization system of claim 4, wherein the second optimal configuration indicates a change in the number of threads per inference instance, and wherein the functions further comprise: activating a first set of inference instances as active instances for the first optimal configuration (col. 2:50-54, “The first kernel is allocated a minimum number of cores of the compute unit and maximum number of cores of the compute unit in which the minimum number of cores of the compute unit is allocated”), and activating a second set of inference instances as passive instances for the first optimal configuration (col. 7:9-12, “For dynamic allocations, resources may be assigned during run-time to provide flexibility based on network changes, such as changes in topology, inputs, batch size, etc”); in response to triggering the configuration change, configuring the second set of inference instances based on the second optimal configuration and swapping the second set of inference instances with the first set of inference instances such that the second set of inference instances are the active instances and the first set of inference instances are the passive instances (col. 7:42-48, “the analyzer 100 may dynamically update allocation of computing resources for one or more layers of one or more paths of the multipath neural network during execution of the multipath neural network to provide flexibility based on network changes, such as changes in topology, inputs, batch size, etc”).

Regarding claim 7, Cui teaches: The thread optimization system of claim 1, wherein the batch size is estimated using a batch-size estimator of the thread optimization system (¶ 13, “an iterative batch size tuning process which is configured to determine an optimal job partition ratio for partitioning mini-batch datasets into sub-batch datasets for processing by a set of hybrid accelerator resources, wherein the sub-batch datasets are partitioned into optimal batch sizes for processing by respective accelerator resources to minimize a time for completing the deep learning model training process”), the batch-size estimator being integrated into a base serving system of the model serving system and configured to intercept incoming inference requests and estimate the batch size based on the intercepted inference requests (¶ 68, “The service controller 640 comprises a computing resource scheduling and provisioning module 642, a request queue 644, and a deep learning system 646 (which supports DLaaS)”).

Claims 8, 10-12, 14, 15, and 17-19 recite commensurate subject matter as claims 1, 3-5, and 7.  Therefore, they are rejected for the same reasons.

Claim(s) 2, 9, and 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lee and Cui, as applied above, and further in view of Zhang (US 10,698,693).

Regarding claim 2, Lee teaches: the dynamic programming algorithm is a 2-dimensional (col. 11:22-35, “The goal of the definition is to find the parallel scheduling and implementation of all layers that minimize the overall execution time of the multipath neural network, as in Eq. (1). (45) For⁢∀t∈ℝ,∀r∈Resources,.Math.l=1Kwr,l.Math.αl,t<Mr,(1) in which K is a total number of layers, w.sub.r,l is utilization of resource r by layer l, M.sub.r is maximum available resource r available for execution of the multipath neural network, and α.sub.l,t is 1 if layer l is executing at time t, and 0 if not”), wherein the predetermined model profiles correspond to fill items for the (col. 8:65-67 and col. 9:1-2, “Initially, the modified kernel-launching function is used to set the min/max core arguments for kernel 201 to be ⅔. For this example, the kernel 201 may have 12 thread blocks that execute on GPU cores 1-4 that are distributed on cores 1-3 as depicted in FIG. 2A.”).
Lee and Cui do not teach; however, Zhang discloses: the dynamic programming algorithm is a 2-dimensional knapsack problem (col. 3:35-37, “These technologies generally involve solving large-scale knapsack problems (KPs) in a scalable distributed paradigm via synchronous coordinate descent (SCD)”), and wherein a goal of the knapsack problem is to find a set of fill items that minimizes the average batch latency across model instances while keeping a total weight of the fill items to a size of the knapsack (col. 14:23-27, “For the special case of K=1, it can be shown that Algorithm 4 will converge to a solution with a total reward that is at most max.sub.i,jp.sub.i,j less than that of the optimal solution, since the algorithm essentially produces a rounded integer solution after solving a fractional knapsack problem”).
It would have been obvious to a person having ordinary skill in the art, at the effective filing date of the invention, to have applied the known technique of the dynamic programming algorithm is a 2-dimensional knapsack problem, and wherein a goal of the knapsack problem is to find a set of fill items that minimizes the average batch latency across model instances while keeping a total weight of the fill items to a size of the knapsack, as taught by Zhang, in the same way to the dynamic programming algorithm, as taught by Lee and Cui. Both inventions are in the field of optimizing resource allocation, and combining them would have predictably resulted in “solving a knapsack problem (KP) subject to multiple global constraints and local constraints,” as indicated by Zhang (col. 1:55-57).

Claims 9 and 16 recite commensurate subject matter as claim 2.  Therefore, they are rejected for the same reasons.

Claim(s) 6, 13, and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lee and Cui, as applied above, and further in view of Hu (US 11,599,798).

Regarding claim 6, Lee and Cui do not teach; however, Hu discloses: generating the predetermined model profiles using a profiler component of the thread optimization system (col. 8:5-7, “profiling needs the sub-batch size to obtain accurate measurements, profiling and sub-batch size selection are performed iteratively until the sub-batch size converges”), the predetermined model profiles being generated using only powers of 2 for the batch sizes of the predetermined model profiles (col. 10:62-64, “In practice, we find that adding a regularization step which tunes b to be a power of 2 can usually improve the GPU performance”).
It would have been obvious to a person having ordinary skill in the art, at the effective filing date of the invention, to have applied the known technique of the dynamic programming algorithm is a 2-dimensional knapsack problem, and wherein a goal of the knapsack problem is to find a set of fill items that minimizes the average batch latency across model instances while keeping a total weight of the fill items to a size of the knapsack, as taught by Hu, in the same way to the dynamic programming algorithm, as taught by Lee and Cui. Both inventions are in the field of optimizing resource allocation, and combining them would have predictably resulted in a method that “significantly improve the accuracy of NN results,” as indicated by Hu (col. 1:48-49).

Claims 13 and 20 recite commensurate subject matter as claim 6.  Therefore, they are rejected for the same reasons.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Saxena (US 11,263,052) discloses “computing, based at least in part on the set of batch sizes, one or more node counts corresponding to a number of nodes that can be used for processing said job” (abstract), which relates to the disclosed optimization based on batch size.
Fan (US 2020/0104397) discloses “automatic selection of degrees of parallelism for efficient execution of queries in a database system are performed by systems and devices” (abstract), which relates to the disclosed determination of an optimal number of model inferences.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JACOB D DASCOMB whose telephone number is (571)272-9993. The examiner can normally be reached M-F 9:00-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Vital can be reached at (571) 272-4215. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/JACOB D DASCOMB/           Primary Examiner, Art Unit 2198

Read full office action

Prosecution Timeline

Aug 18, 2023

Application Filed

Jan 24, 2026

Non-Final Rejection — §103, §112

Feb 27, 2026

Interview Requested

Mar 06, 2026

Examiner Interview Summary

Mar 06, 2026

Applicant Interview (Telephonic)

Precedent Cases

Applications granted by this same examiner with similar technology

17/980,204

Patent 12591462

INFERENCE SERVICE DEPLOYMENT METHOD, DEVICE, AND STORAGE MEDIUM

2y 5m to grant Granted Mar 31, 2026

18/148,792

Patent 12585487

CANCELLATION OF A MIGRATION-BASED UPGRADE USING A NETWORK SWAP WORKFLOW

2y 5m to grant Granted Mar 24, 2026

17/504,849

Patent 12578906

STORAGE VIRTUALIZATION DEVICE SUPPORTING VIRTUAL MACHINE, OPERATION METHOD THEREOF, AND OPERATION METHOD OF SYSTEM HAVING THE SAME

2y 5m to grant Granted Mar 17, 2026

18/070,118

Patent 12578985

HYBRID VIRTUAL MACHINE ALLOCATION OPTIMIZATION SYSTEM AND METHOD

2y 5m to grant Granted Mar 17, 2026

18/113,331

Patent 12566645

PREDICTED-TEMPERATURE-BASED VIRTUAL MACHINE MANAGEMENT SYSTEM

2y 5m to grant Granted Mar 03, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

1-2

Expected OA Rounds

86%

Grant Probability

99%

With Interview (+20.5%)

2y 12m

Median Time to Grant

Low

PTA Risk

Based on 440 resolved cases by this examiner. Grant probability derived from career allow rate.