Last updated: April 19, 2026
Application No. 17/522,562
TECHNIQUES FOR PARTITIONING NEURAL NETWORKS

Final Rejection §103
Filed
Nov 09, 2021
Examiner
LI, HARRISON
Art Unit
2195
Tech Center
2100 — Computer Architecture & Software
Assignee
Nvidia Corporation
OA Round
4 (Final)
Interview Optional

— +38.9% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 11 resolved cases, 2023–2026
Examiner Intelligence

LI, HARRISON View full profile →
Grants 82% — above average
Career Allow Rate
9 granted / 11 resolved
+26.8% vs TC avg
Strong +39% interview lift
Without
With
+38.9%
Interview Lift
resolved cases with interview
Typical timeline
2y 9m
Avg Prosecution
37 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
20.5%
-19.5% vs TC avg
§103
46.7%
+6.7% vs TC avg
§102
6.9%
-33.1% vs TC avg
§112
21.8%
-18.2% vs TC avg
Black line = Tech Center average estimate • Based on career data from 11 resolved cases
Office Action

§103
Detailed Action
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-30 are pending.

Response to Arguments
Regarding 35 U.S.C. 101:
Applicant’s amendments and arguments regarding the rejection of claims 1-30 under 35 U.S.C. 101 have been fully considered and are found to be persuasive. The rejections of claims 1-30 under 35 U.S.C. 101 are withdrawn as the claims are found to integrate the judicial exception of resource allocation and task scheduling into a practical application. 

Regarding: Prior Art Rejections:
Applicant’s amendments and arguments regarding the rejection of claims 1-4, 6-11, 13-18, 21-25, and 29-30 under 35 U.S.C. 102 and claims 5, 12, 19-20, and 26-27 under 35 U.S.C. 103 have been fully considered and are moot due to new grounds of rejection necessitated by amendment. 


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-4, 6-11, 13-18, 21-25, and 29-30 are rejected under 35 U.S.C. 103 as being unpatentable over Kannan et al. US 11797280 B1 in view of Kursun US 20200175350 A1.

Kannan is cited to in a previous action.

Regarding claim 1, Kannan teaches the invention substantially as claimed including:
One or more processors (col 9. lines 66-67, a processor), comprising: 
circuitry to cause one or more neural networks having a first partitioning to be dynamically repartitioned (col 9, line 64 circuit devices; Process 700 may begin at block 702 by performing an initial partitioning of the neural network model into partitions for execution on multiple processing integrated circuit devices, in which each partition corresponds to a processing integrated circuit device, Col 10 13-17) by:
detecting a change to performance of the one or more neural networks (A similar analysis of the I/O transfer latency gain and the change in execution latencies of partitions P1 and P2 can be performed for each of the nodes along the initial partition boundary 502, Col 8 23-26); and
causing computational load to perform one or more additional inferences to be distributed according to a second partitioning (Fig 7 710; At block 710, instructions for programming the processing integrated circuit devices to execute the neural network model can be generated according to the adjusted partitions, Col 11 59-61; Executing a neural network on a set of input data can be referred to as inference or performing inference, Col 18 61-63), the second partitioning generated based, at least in part, on the detected change to one or more performance of the one or more neural networks (Fig 7 708; At block 708, the partitions can be adjusted or optimized by moving computations from a source partition to a target partition to change execution latencies of the partitions and/or the amount of data being transferred from the source partition to the target partition, Col 10 56-61).
Kannan does not explicitly teach the changes detected based at least in part on one or more performance metrics obtained from performance of first and second inferences by the one or more neural networks having the first partitioning.
However, Kursun teaches the changes detected based at least in part on one or more performance metrics obtained from performance of first and second inferences by the one or more neural networks having the first partitioning ([0092] the system may use construction logic to perform online reconfiguration of the neural network. The system may continuously monitor the neural network for real-time changes in the data patterns and/or adversarial interaction patterns. For instance, the system may continuously assess the rate of successful detection of unauthorized users, the rate of false positives, decision-making speed, breadth of compatibility with different channels, and the like. The system may then determine that an improvement in one or more of these above dimensions could be made by modifying the structure of the neural network).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to have combined Kursun’s continuous monitoring of neural network data patterns with the existing system of Kannan. A person of ordinary skill in the art would have been motivated to make this combination to provide the resulting system with the advantage of real time neural network adjustments using recent execution data (see Kursun [0002] As the structure of neural networks increase in depth and complexity, manual exploration of said neural networks becomes increasingly difficult, or in some cases, impracticable. Furthermore, conventional methods do not provide an efficient way to optimize the performance of the neural network once it has been implemented into the production environment. Accordingly, there is a need for a scalable and efficient way to construct deep neural networks and perform real-time reconfiguration of the neural networks once constructed).

Regarding claim 2, Kannan and Kursun teaches the one or more processors of claim 1. 
Kannan further teaches wherein the one or more performance metrics of the one or more neural networks include an inferencing request metric (col 10, lines 57-59 At block 708, the partitions can be adjusted or optimized by moving computations from a source partition to a target partition to change execution latencies of the partitions; col 18. Lines 57-63 In some examples, the accelerator 902 can implement a neural network processing engine. In these examples, the accelerator 902, for a set of input data 950, can execute a neural network to perform a task for which the neural network was trained. Executing a neural network on a set of input data can be referred to as inference or performing inference).

Regarding claim 3, Kannan teaches the one or more processors of claim 1. 
Kannan further teaches wherein the circuitry is to cause the one or more neural networks to be dynamically partitioned on a plurality of graphics processing units (GPUs) (col 2, lines 44-46 To improve throughput, multiple processing integrated circuit devices (e.g., processing cores) can be used together to execute a neural network; col 2, lines 53-54 Each processing core can be a processor, GPU).

Regarding claim 4, Kannan and Kursun teaches the one or more processors of claim 1. Kannan further teaches wherein the circuitry is to cause the one or more neural networks to be dynamically partitioned on a first one or more graphics processing units (GPUs) of a first computer system (col 3, lines 24-36 the neural network model is partitioned into N number of partitions with each processing core executing one of the partitions ... processing core 220-1), and a second one or more GPUs of a second computer system (col. 3, line 37 processing core 220-2).

Regarding claim 6, Kannan and Kursun teaches the one or more processors of claim 1.
Kannan further teaches wherein the circuitry is to allocate the dynamically partitioned one or more neural networks on one or more inference nodes (col 3, lines 24-26 the neural network model is partitioned into N number of partitions with each processing core executing one of the partitions; col 18, lines 61-63 Executing a neural network on a set of input data can be referred to as inference or performing inference). Examiner notes: Because the processors in which the neural networks are partitioned in order to perform inferences, they are considered inference nodes).

Regarding claim 7, Kannan and Kursun teaches the one or more processors of claim 1.
Kannan further teaches wherein the circuitry is to cause the one or more neural networks to be dynamically partitioned based, at least in part, on one or more performance metrics of one or more graphics processing units (GPUs) (col 2, lines 53-54 Each processing core can be a processor, GPU; col 6, lines 4-11 The compute latency of a partition can be the total number of compute clock cycles to perform computations for the nodes in the partition. In some implementations, this can be the sum of the compute clock cycles to perform computations for each node in the partition. The compute clock cycles to perform computations for a node can be an estimation provided by a compiler for the hardware architecture of the target processing integrated circuit device).

Regarding claim 8, Kannan and Kursun teaches the one or more processors of claim 1.
Kannan further teaches wherein the one or more performance metrics include one or more inferencing request metrics (col 18, lines 61-63 Executing a neural network on a set of input data can be referred to as inference or performing inference; col 10, lines 57-59 At block 708, the partitions can be adjusted or optimized by moving computations from a source partition to a target partition to change execution latencies of the partitions). Examiner notes: execution latencies of a partition result from the processing of an inference request) and the circuitry is to cause the one or more neural networks to be dynamically partitioned also based, at least in part, on one or more graphics processing unit metrics (col 2, lines 53-54 Each processing core can be a processor, GPU; col 6, lines 17-27 The weight loading latency of a partition can be the total number of weight loading clock cycles for loading weights used by the nodes in the partition during execution of the neural network model. The number of weight loading clock cycles can be computed based on the total number of weights used by the nodes in the partition, the size or capacity of the on-chip memory (e.g., cache memory) of the processing integrated circuit device or processing core allocated for weights storage).

Regarding claim 9, it is the system of claim 1. Therefore, it is rejected for the same reasons as claim 1. 
	Kannan further teaches a system comprising one or more processors (col 12, line 18 four processing cores), and one or more memories to store one or more of the one or more performance metrics (Fig 8, #806 The storage device; col 5, line 64 an execution latency can be calculated for each partition).

Regarding claim 10, Kannan and Kursun teaches the system of claim 9. 
Kannan further teaches wherein the one or more performance metrics of the one or more neural networks include an inferencing request throughput or an inferencing request latency (col 10, lines 57-59 At block 708, the partitions can be adjusted or optimized by moving computations from a source partition to a target partition to change execution latencies of the partitions; col 18, lines 61-63 Executing a neural network on a set of input data can be referred to as inference or performing inference).

Regarding claim 11, Kannan and Kursun teaches the system of claim 9.
Kannan further teaches wherein the one or more processors are to also to cause the one or more neural networks to be dynamically partitioned based, at least in part, on one or more memory metrics (col 6, lines 1-2 The execution latency of a partition may include: ... the weight loading latency; col 6, lines 17-25 The weight loading latency of a partition can be the total number of weight loading clock cycles for loading weights used by the nodes in the partition during execution of the neural network model. The number of weight loading clock cycles can be computed based on ... the size or capacity of the on-chip memory (e.g., cache memory) of the processing integrated circuit device or processing core allocated for weights storage; col 10, lines 36-40 process 700 calculates, for each partition, an execution latency by aggregating compute clock cycles to perform computations in the partition, and weight loading clock cycles determined based on a number of weights used in the partition; col 10, lines 57-59 At block 708, the partitions can be adjusted or optimized by moving computations from a source partition to a target partition to change execution latencies of the partitions).

Regarding claim 13, Kannan and Kursun teaches the system of claim 9.
Kannan further teaches wherein the one or more processors are to allocate the dynamically partitioned one or more neural networks on two or more inference nodes (col 3, lines 19-26 FIG. 2 illustrates an example of a system 200 that implements a pipelined architecture to improve both throughput and latency. Similar to system 100, system 200 includes multiple processing cores 220-1 to 220-N. However, instead of having each of the processing cores 220-1 to 220-N execute the full neural network model, the neural network model is partitioned into N number of partitions with each processing core executing one of the partitions; col 18, lines 61-63 Executing a neural network on a set of input data can be referred to as inference or performing inference). Examiner notes: the processing cores in which neural networks are partitioned into are inference nodes because they perform inferencing).

Regarding claim 14, Kannan and Kursun teaches the system of claim 9.
Kannan further teaches wherein the one or more processors are included in an inferencing system that is to receive inferencing requests via a network (col 24, lines 60-63 In some examples, the support systems 1174 can be responsible for taking instructions from the host processor 1172 when programs executing on the host processor 1172 request the execution of a neural network; col 18, lines 61-63 Executing a neural network on a set of input data can be referred to as inference or performing inference; col 25, lines 31-37 In some examples, the user device may be in communication with the service provider computer over one or more networks. Additionally, the user device may be part of the distributed system managed by, controlled by, or otherwise part of the service provider computer (e.g., a console device integrated with the service provider computers)).

Regarding claim 15, Kannan and Kursun teaches the system of claim 9.
Kannan further teaches wherein the one or more processors are to cause the one or more neural networks to be dynamically partitioned based, at least in part, on one or more requests to perform operations using the one or more neural networks (col 24, lines 60-67 In some examples, the support systems 1174 can be responsible for taking instructions from the host processor 1172 when programs executing on the host processor 1172 request the execution of a neural network. For example, the host processor 1172 can provide the support systems 1174 with a set of input data and a task that is to be performed on the set of input data. In this example, the support systems 1174 can identify a neural network that can perform the task, and can program the acceleration engine 1160 to execute the neural network on the set of input data. In some examples, the support systems 1174 only needs to select an appropriate neural network processing engine of the neural network processor. In some examples, the support systems 1174 may need to load the data for the neural network onto the acceleration engine 1160 before the acceleration engine 1160 can start executing the neural network. In these and other examples, the support systems 1174 can further receive the output of executing the neural network, and provide the output back to the host processor 1172; col 10, lines 57-61 the partitions can be adjusted or optimized by moving computations from a source partition to a target partition to change execution latencies of the partitions and/or the amount of data being transferred from the source partition to the target partition); Examiner notes: when a program sends a request to perform inferencing of neural networks, the rebalancing techniques described in Kannan occurs to optimize for latency and throughput).

Regarding claim 16, it is the method of claim 1. Therefore, it is rejected for the same reasons as claim 1. 

Regarding claim 17, Kannan and Kursun teaches the method of claim 16.
Kannan further teaches wherein the one or more performance metrics of the one or more neural networks include one or more inferencing request performance metrics (col 10, lines 57-59 At block 708, the partitions can be adjusted or optimized by moving computations from a source partition to a target partition to change execution latencies of the partitions; col 18, lines 61-63 Executing a neural network on a set of input data can be referred to as inference or performing inference).

Regarding claim 18, Kannan and Kursun teaches the method of claim 16. 
Kannan further teaches wherein dynamically partitioning the one or more neural networks includes partitioning the one or more neural networks in response to a first inferencing request (col 5, lines 4-21 During an initial phase of the partitioning process, the compiler may start at the output nodes (e.g., nodes N21 and N22) and traverse backwards from the output towards the input to aggregate the amount of compute load for each traversed node. When the aggregated amount of compute load reaches a compute threshold, a partition boundary can be created to form a partition of neural network model 300. By way of example, referring to FIG. 3, a first partition P1 is formed with nodes N22, N21, N20, N19, and N18 because the aggregated compute load of these nodes reached a compute threshold. Continuing to traverse backwards, a second partition P2 is formed with nodes N17, N16, N15, N14, N13, and N12 because the aggregated compute load of these nodes reached a compute threshold. In this manner, an initial partitioning of neural network model 300 can be performed. It should be noted that the number of nodes per partition may differ because the amount of compute for each node can be different; col 18, lines 61-63 Executing a neural network on a set of input data can be referred to as inference or performing inference). Examiner notes: It is understood that initial partitioning and execution of a neural network are the result of an inferencing request); and repartitioning the one or more neural networks based, at least in part, on the one or more performance metrics of the one or more neural networks in response to a second inferencing request (col 10, lines 57-61 the partitions can be adjusted or optimized by moving computations from a source partition to a target partition to change execution latencies of the partitions and/or the amount of data being transferred from the source partition to the target partition; col 18, lines 61-63 Executing a neural network on a set of input data can be referred to as inference or performing inference). Examiner notes: It is understood that repartitioning and execution of a neural network is the result of an inferencing request).

Regarding claim 21, Kannan and Kursun teaches the method of claim 16.
Kannan further teaches wherein dynamically partitioning includes partitioning a first model that includes a first one or more neural networks, and partitioning a second model that includes a second one or more neural networks, and wherein a first partition of the first model is to be processed using a graphics processing unit (GPU), and a second partition of the second model is to be processed using the GPU. (For method of partitioning see Claim 16; col 2, lines 44-46 To improve throughput, multiple processing integrated circuit devices (e.g., processing cores) can be used together to execute a neural network; col 2, lines 53-54 Each processing core can be a ... GPU; col 28, lines 10-15 In some instances, each core in a single or multi-core processor may also include multiple executing logical processors (or executing threads). In such a core (e.g., those with multiple logical processors), several stages of the execution pipeline and also lower level caches may also be shared). Examiner notes: It is understood that the neural network partitioning method may be run to partition a second neural network model onto an available GPU containing an existing first partition of a first neural network model, and because of multi core processing, the GPU may a/so execute the second partition of the second neural network model).

Regarding claim 22, it is the non-transitory machine-readable medium of claim 1. Therefore, it is rejected for the same reasons as claim 1. 
Kannan further teaches a non-transitory machine-readable medium having stored thereon a set of instructions which if performed by one or more processors (claim 18 computer readable medium having stored therein instructions that, when executed by one or more processors).

Regarding claim 23, Kannan and Kursun teaches the non-transitory machine-readable medium of claim 22. 
Kannan further teaches wherein the one or more performance metrics of the one or more neural networks are based, at least in part, on one or more requests to perform inferencing operations received over a network (col 10, lines 57-59 At block 708, the partitions can be adjusted or optimized by moving computations from a source partition to a target partition to change execution latencies of the partition; col 18, Lines 57-63 Executing a neural network on a set of input data can be referred to as inference or performing inference; col 26, lines 16-24 In various examples, the network 1200 can be used to process data. For example, input data can be received ... from other networks 1208 with which the network 1200 can communicate. In this example, the input data can be directed to a node in the network 1200 that includes an acceleration engine, for the acceleration engine to operate on and produce a result. The result can then be transferred to the node or other network from which the input data was received).

Regarding claim 24, Kannan and Kursun teaches the non-transitory machine-readable medium of claim 22.
Kannan further teaches wherein the one or more performance metrics of the one or more neural networks include a throughput metric or a latency metric (col 10, lines 57-61 At block 708, the partitions can be adjusted or optimized by moving computations from a source partition to a target partition to change execution latencies of the partitions and/or the amount of data being transferred from the source partition to the target partition).

Regarding claim 25, Kannan and Kursun teaches the non-transitory machine-readable medium of claim 22.
Kannan further teaches wherein the one or more performance metrics of the one or more neural networks include one or more inferencing request performance metrics (col 10, lines 57-59 At block 708, the partitions can be adjusted or optimized by moving computations from a source partition to a target partition to change execution latencies of the partitions; col 18, Lines 57-63 In some examples, the accelerator 902 can implement a neural network processing engine. In these examples, the accelerator 902, for a set of input data 950, can execute a neural network to perform a task for which the neural network was trained. Executing a neural network on a set of input data can be referred to as inference or performing inference), and the set of instructions which if performed by the one or more processors is also to cause the one or more processors to dynamically partition the one or more neural networks based, at least in part, on a set of available computing devices (col 3, 35-39 For example, if N number of processing cores are available as shown, processing core 220-1 may only need access to weights W.sub.1, processing core 220-2 may only need access to weights W.sub.2, processing core 220- N may only need access to weights W.sub.N, and so on.; col 10, lines 57-59 At block 708, the partitions can be adjusted or optimized by moving computations from a source partition to a target partition).

Regarding claim 29, Kannan and Kursun teaches the processor of claim 1.
	Kannan further teaches wherein the one or more performance metrics of the one or more neural networks are one or more measured values obtained by using the one or more neural networks to perform inferencing (Col 5 line 67 – Col 6 line 2 The execution latency of a partition may include three components: the compute latency; Col 6 lines 4-6 The compute latency of a partition can be the total number of compute clock cycles to perform computations for the nodes in the partition; Col 6 lines 12-16 the neural network model can be executed on the actual hardware to obtain a latency measurement for each node, and the actual measured latency can be used to compute the total number of compute clock cycles for a partition; Col 18 lines 61-63 Executing a neural network on a set of input data can be referred to as inference or performing inference).

Regarding claim 30, Kannan and Kursun teaches the processor of claim 1.
Kannan further teaches wherein the circuitry is to cause each of the one or more neural networks to be dynamically partitioned based, at least in part, on one or more measured values obtained by using the one or more neural networks to perform inferencing (Col 5 lines 32-36 the execution latencies computed for the initial partitioning 502 indicates that partition P1 has a longer execution latency than partition P2. Thus, it may be beneficial to adjust the partitions to achieve a more balanced execution latencies between P1 and P2; Col 10 lines 36-40 At block 704, process 700 calculates, for each partition, an execution latency by aggregating compute clock cycles to perform computations in the partition, and weight loading clock cycles determined based on a number of weights used in the partition; Col 10 lines 57-59 At block 708, the partitions can be adjusted or optimized by moving computations from a source partition to a target partition to change execution latencies of the partitions).

Claims 5 and 27 are rejected under 35 U.S.C. 103 as being unpatentable over Kannan et al. US 11797280 B1 in view of Kursun US 20200175350 A1, as applied to claims 1 and 22 respectively above, and in further view of Rhu et al. IEEE: vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design.
Rhu is cited to in a previous action.

Regarding claim 5, Kannan and Kursun teaches the processor of claim 1. 
Kannan further teaches wherein the circuitry is to generate one or more
representations of a corresponding one or more of the dynamically partitioned one or more neural networks (col 4, lines 17-19 To assist with the partitioning process, a neural network model can be represented as a directed acyclic hypergraph by a compiler),
Kannan does not specifically teach generating virtual representations of one or more neural networks.
However, Rhu teaches generating virtual representations of one or more neural
networks (Rhu Section I In this paper, we propose virtualized Deep Neural Network (vDNN). a runtime memory management solution that virtualizes the memory usage of deep neural networks across both GPU and CPU memories. Our vDNN allows ML practitioners to deploy larger and deeper networks beyond the physical capacity of available GPUs, enabling them to focus more on their algorithms while the system architecture and run- time system transparently manage the allocation, placement, movement, and release of their data).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to have combined Rhu with Kannan and Kursun because Rhu teaching of virtual representations would improve Kan nan's system beyond physical capacity of available resources (Rhu Section I, Our vDNN allows ML practitioners deploy larger and deeper networks beyond the physical capacity of available GPUs, enabling them to focus more on their algorithms while the system architecture and runtime system transparently manage the allocation, placement, movement, and release of their data).

Regarding claim 27, it is the non-transitory machine-readable medium of claim 5, therefore it is rejected for the same reason as claim 5 above.

Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Kannan et al. US 11797280 B1 in view of Kursun US 20200175350 A1, as applied to claim 9 above, in view of Rhu et al. IEEE: vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design, and further in view of Hu et al. US 20220058477 A1.

Hu is cited to in a previous action.

Regarding claim 12, Kannan and Kursun teaches the system of claim 9.
Kannan further teaches requests to use the partitioned one or more neural networks (col 3, lines 24-26 the neural network model is partitioned into N number of partitions with each processing core executing one of the partitions; col 13, lines 42-45 the acceleration engine 812 can be programmed to execute a particular neural network, such as one that performs image recognition or one that performs machine translation; col 15, lines 39-42 Once the acceleration engine 812 has finished, the acceleration engine 812 can notify the driver 822, and the driver 822 can deliver a result back to the application that requested the result).
Kannan does not specifically teach, wherein requests are to be routed via a corresponding one or more non-partitioned virtual neural network models
However, Rhu teaches virtual neural network models as previously cited.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to have combined Rhu with Kannan and Kursun because Rhu teaching of virtual neural network model would improve memory utilization and efficiency of Kanan's system (Rhu Section 1 Our vDNN allows ML practitioners deploy larger and deeper networks beyond the physical capacity of available GPUs, enabling them to focus more on their algorithms while the system architecture and run- time system transparently manage the allocation, placement, movement, and release of their data).
Kannan, Kursun, and Rhu do not specifically teach, wherein requests are to be routed via a corresponding one or more non-partitioned virtual neural network models
However, Hu teaches, wherein requests are to be routed via a corresponding one or more non-partitioned neural network models (Hu 0086, The client device 804 may make a request to the cloud services provider 812 for tuned hyperparameters. In one example, the client device 804 may make a request to the cloud services provider 812 for a trained neural network model, where the trained neural network model is a large neural network model. The cloud services provider 812 may route the request to a specific tenant, such as tenant 828 to fulfill the request. In some examples, the client device 804 may be interacting direction with a tenant, such as tenant A 820. Accordingly, the request may be fulfilled by a web service or application 836 that exposes or otherwise makes available the tuned hyperparameters via a neural network training server 840. In some examples, the neural network training server 840 may be the same as the hyperparameter tuning server 304 and/or the neural network training server 404. Accordingly, a client device 804 may provide a neural network with the request, a dataset with the request, or both the neural network model and the dataset with the request. Accordingly, the neural network training server 840 may generate the tuned hyperparameters as previously discussed and provide the tuned hyperparameters back to the requesting client device 804). 
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to have combined Hu with Kannan, Kursun, and Rhu because Hu teaching of redirect use requests would improve the performance of Kannan's system (Hu 0001 training the models and/or using the models can be computationally expensive and utilize so much energy that multiple tuning and training passes are often impractical for very large models. Accordingly, there is ample opportunity for improvements in computer hardware and software to implement neural networks).

Claims 19 and 26 are rejected under 35 U.S.C. 103 as being unpatentable over Kannan et al. US 11797280 B1 in view of Kursun US 20200175350 A1, as applied to claims 16 and 22 respectively above, in view of Almeida et al. Pub No. US 20220083386 A1.

Almeida is cited in a previous action.


Regarding claim 19, Kannan and Kursun teaches, the method of claim 16.
Kannan and Kursun does not specifically teach, wherein dynamically partitioning the one or more neural networks is also performed based, at least in part, on one or more graphics processing unit (GPU) power metrics.
However, Almeida teaches wherein dynamically partitioning the one or more neural
networks is also performed based, at least in part, on one or more graphics processing unit (GPU) power metrics ([0005], In a first approach of the present techniques, there is provided a method for distributing neural network execution using an electronic user device, the method comprising: ... partitioning the neural network into a number of partitions based on the determined subset of computing resources that are able to satisfy the at least one optimisation constraint; [0024] The identified computing resources within the electronic user device may comprise one or more of: ... a graphics processing unit (GPU) ... ; [0061] The partitions may also be determined based on the computing capability/processing power of a computing resource; [0054] The method may comprise obtaining at least one optimization constraint to be satisfied when executing the neural network (step S102). This step may comprise obtaining one or more of: ... an energy constraint (which could be specified in terms of cost)
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to have combined Almeida with Kannan and Kursun because Almeida teaching of dynamically partitions based on GPU power consumption would improve the flexibility of Kannan's system to utilize resources (Almeida [0003-0004] However, on-device execution of a neural network may not always be possible, because of the processing capability of the device. Alternatively, a neural network could be executed using cloud computing or edge computing, and the results could be provided to the user device).

Regarding claim 26, it is the non-transitory machine-readable medium of claim 19, therefore, it is rejected for the same reason as claim 19 above.

Claim 20 is rejected under 35 U.S.C. 103 as being unpatentable over Kannan et al. US 11797280 B1 in view of Kursun US 20200175350 A1, as applied to claim 16 above, in view of Jeong et al. IEEE: Optimal Partitioning of Distributed Neural Networks for Various Communication Environments, and further in view of Shah et al. US 20180300621 A1.

Shah is cited in a previous action.

Regarding claim 20, Kannan and Kursun teaches the method of claim 16. 
Kannan further teaches wherein the one or more neural networks are a first one or more neural networks (col 3, lines 24-25 the neural network model is partitioned),
Kannan does not specifically teach, dynamically partitioning the one or more neural
networks is based, at least in part, on a second one or more neural networks that use the one or more performance metrics as one or more inputs.
However, Jeong teaches dynamically partitioning the one or more neural networks is based, at least in part, on a second one or more neural networks (Fig 1 (a) Distributed NN and Fig 1 (b) NoNN depict student neural networks being partitioned by a teacher neural network),
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to have combined Jeong with Kannan and Kursun for a neural network model that is able to partition other neural networks to solve device memory constraints (Jeong Section I However, normally, the student network is still too large to be solely executed on a single device. So, the layers are divided into multiple subsets, each of which is executed on a separate device as shown in the figure).
Kannan, Kursun, and Jeong do not teaches one or more neural networks that use the one or more performance metrics as one or more inputs.
However, Shah teaches one or more neural networks that use the one or more performance metrics of the first one or more neural networks as one or more inputs (0028 A neural network application inputs the time series of performance metrics and a probe matrix to a neural network, and the neural network discovers runtime dependencies, such as precedence relationships, among the performance metrics using the probe matrix).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to have combined Shah with Kannan, Kursun, and Jeong because Shah's teaching of using performance metrics as neural network inputs would improve the performance of Kannan, Kursun, and Jeong’s partitioning system for resource efficiency (Shah Background and 0007 An important mathematical operation during neural network processing is performing a convolution between matrices. However, conventional convolution operations can require significant memory usage in computer systems or devices having memory size constraints, such as cache or prefetch memory found in central processing units (CPUs)/graphics processing unit (GPUs), or in devices with limited memory, such as mobile devices or Internet-of-Things (loT) devices).

Claim 28 is rejected under 35 U.S.C. 103 as being unpatentable over Kannan et al. US 11797280 B1 in view of Kursun US 20200175350 A1, as applied to claim 22 above, in view of Mathur et al. US 20210158939 A1.

Mathur is cited in a previous action.

Regarding claim 28, Kannan and Kursun teaches the non-transitory machine-readable medium of claim 22.
Kannan and Kursun does not specifically teach wherein the one or more performance metrics are based, at least in part, on one or more medical image inferencing requests.
However, Mathur teaches wherein the one or more performance metrics are based, at least in part, on one or more medical image inferencing requests ([0123] can receive a request (e.g., an image processing request 106) to process a medical image using at least one medical image inferencing models (e.g., one or more of the internal algorithms 134 and/or the external algorithms 136). At 402, the system identifies a workflow comprising a medical image inferencing model that is applicable to the medical image).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to have combined Mathur with Kannan and Kursun because Mathur's teaching of medical image inferences will improve the field of use of Kannan and Kursun’s system to include healthcare (Mathur [0003] The healthcare industry has innumerable opportunities to leverage artificial intelligence (Al), machine learning (ML), and other analytical models to achieve more accurate, proactive, and comprehensive patient care).

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the
examiner should be directed to HARRISON LI whose telephone number is (703) 756-1469. The
examiner can normally be reached Monday-Friday 9:00am-5:30pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing
using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is
encouraged to use the USPTO Automated Interview Request (AIR) at
http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s
supervisor, Aimee Li can be reached on 571-272-4169. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/H.L./ 
Examiner, Art Unit 2195

/Aimee Li/Supervisory Patent Examiner, Art Unit 2195
Read full office action
Prosecution Timeline

Nov 09, 2021
Application Filed
Jun 20, 2024
Non-Final Rejection — §103
Dec 19, 2024
Examiner Interview Summary
Dec 27, 2024
Response Filed
Mar 05, 2025
Final Rejection — §103
May 08, 2025
Applicant Interview (Telephonic)
May 08, 2025
Examiner Interview Summary
Jun 12, 2025
Request for Continued Examination
Jun 18, 2025
Response after Non-Final Action
Jul 09, 2025
Non-Final Rejection — §103
Sep 22, 2025
Interview Requested
Oct 06, 2025
Applicant Interview (Telephonic)
Oct 06, 2025
Examiner Interview Summary
Jan 13, 2026
Response Filed
Mar 16, 2026
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/550,766
Patent 12547428
PAGE TRANSITION DETECTION USING SCREEN OPERATION HISTORY
2y 5m to grant Granted Feb 10, 2026
19/096,698
Patent 12517737
METHODS FOR DYNAMICALLY GENERATING GENERATIVE OPERATING SYSTEMS BASED ON HARDWARE AND SOFTWARE ENVIRONMENT FEATURE
2y 5m to grant Granted Jan 06, 2026
17/586,818
Patent 12379971
RELIABILITY-AWARE RESOURCE ALLOCATION METHOD AND APPARATUS IN DISAGGREGATED DATA CENTERS
2y 5m to grant Granted Aug 05, 2025
Study what changed to get past this examiner. Based on 3 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

5-6
Expected OA Rounds
82%
Grant Probability
99%
With Interview (+38.9%)
2y 9m
Median Time to Grant
High
PTA Risk
Based on 11 resolved cases by this examiner. Grant probability derived from career allow rate.
TECHNIQUES FOR PARTITIONING NEURAL NETWORKS

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email