Office Action Analysis: 18160680 — DEPLOYING NEURAL NETWORK MODELS ON RESOURCE-CONSTRAINED DEVICES

Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This office action is in response to Applicant’s amendment filed on 04/21/2026. By the amendment, Claims 1-20 have been amended. Therefore, Claims 1-20 are pending. Any examiner’s note, objection, or rejection not repeated is withdrawn due to Applicant’s amendment.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1-4 and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Paek et al. (US 11175898 B2) in view of Banitalebi Dehkordi et al. (US 20220414432 A1), hereinafter referred to as Paek and Banitalebi, respectively.

Regarding Claim 1, Paek discloses A method, comprising: storing, on a persistent storage of a first electronic device, a model file that includes a neural network mode (Col. 4, Line 67- A memory 240 includes neural network model document files 244. Please note the described embodiments of the reference stating a memory 240 including neural network model document files 244 corresponds to Applicant’s method comprising storing a model file that includes a neural network mode on a persistent storage of a first electronic device, as it is known in the art that a memory of an electronic device may include a persistent storage.); 
determining constraint information associated with a deployment of the neural network model on the first electronic device (Col. 2, Lines 26-30- when deploying a given deep neural network for execution on a target platform and/or target processor on the target platform, depending on the available hardware, resource constraints (e.g., memory and/or computing). Please note that considering resource constraints of a target processor when deploying a deep neural network corresponds to Applicant’s determining constraint associated with a deployment of the neural network model on the first electronic device.); 
receiving a first input associated with a machine learning task (Col. 6, Lines 58-60-, a neural network (NN) is a computing model that uses a collection of connected nodes to process input data based on machine learning techniques. Please note that the neural network model processing input data based on machine learning techniques corresponds to Applicant’s receiving a first input associated with a machine learning task.); 
generating an intermediate result based on an application of the first sub-model on the first input (Col. 11, Lines 57-58- intermediate data layer 408 uses the output of intermediate data layer 406. Please note that the output of an intermediate data layer corresponds to Applicant’s generating an intermediate result by based on application of the first sub-model on the first input, as the layers which are inherent in sub-models as later disclosed by Banitalebi are operating on inputs.);  
and unloading the first sub-model from the working memory of the first electronic device based on the generation of the intermediate result (Col. 12, Lines 7-10- a memory allocation may be required to hold the output until whatever intermediate data layer needs it has used the output. Please note that holding the output until whatever intermediate data layer needs it has used the output corresponds to Applicant’s unloading the first sub-model from the working memory of the first electronic device based on the generation of the intermediate result, as there must inherently be a means by which the memory is un-allocated once it is no longer needed once the intermediate data layer has used it, i.e., once the intermediate result is generated.); 
executing a second set of operations for a second sub-model of the plurality of sub-models, wherein the second set of operations comprises: loading the second sub-model in the working memory of the first electronic device based on the unloading of the first sub-model Col. 11, Lines 46-60- Convolutional neural network 400 also illustrates the dependencies between different intermediate data layers. Thus, intermediate data layer 404 and intermediate data layer 406 both use the output of intermediate data layer 402; intermediate data layer 408 uses the output of intermediate data layer 406; and intermediate data layer 410 uses the output of intermediate data layer 408 and intermediate data layer 404.; Col. 12, Lines 7-10- a memory allocation may be required to hold the output until whatever intermediate data layer needs it has used the output. Please note that dependencies between different, subsequent intermediate data layers, where for example the output of layer 406 is used for layer 408, and a memory allocation is required to hold the output so that the subsequent intermediate data layer can use it corresponds to Applicant’s loading the second sub-model in the working memory of the first electronic device based on the unloading of the first sub-model, since, as later disclosed by Banitalebi, each sub-model consists of layers, and therefore may implement this intermediate layer pipeline system. Additionally, holding the output until whatever intermediate data layer needs it has used the output corresponds to Applicant’s loading the second sub-model in the working memory of the first electronic device based on the unloading of the first sub-model, as there must inherently be a means by which the memory is allocated, i.e., loaded, for the subsequent intermediate data layer 406 corresponding to a second sub-model that is reliant on the output of an earlier intermediate data layer 402 corresponding to the first sub-model, which was un-allocated once it was no longer needed and produced the output.); 
generating a first output based on an application of the second sub-model on  the intermediate result (Col. 11, Lines 46-60- Convolutional neural network 400 also illustrates the dependencies between different intermediate data layers. Thus, intermediate data layer 404 and intermediate data layer 406 both use the output of intermediate data layer 402; intermediate data layer 408 uses the output of intermediate data layer 406; and intermediate data layer 410 uses the output of intermediate data layer 408 and intermediate data layer 404. Please note that dependencies between different, subsequent intermediate data layers, where for example the output of layer 406 is used for layer 408, corresponds to Applicant’s generating a first output based on an application of the second sub-model on the intermediate result, since, as later disclosed by Banitalebi, each sub-model consists of layers, and therefore may implement this intermediate layer pipeline system that applies a second sub-model to generate a first output based on the intermediate result.); 
and unloading the second sub-model from the working memory of the first electronic device based on the generation of the first output (Col. 11, Lines 46-60- Convolutional neural network 400 also illustrates the dependencies between different intermediate data layers. Thus, intermediate data layer 404 and intermediate data layer 406 both use the output of intermediate data layer 402; intermediate data layer 408 uses the output of intermediate data layer 406; and intermediate data layer 410 uses the output of intermediate data layer 408 and intermediate data layer 404.; Col. 12, Lines 7-10- a memory allocation may be required to hold the output until whatever intermediate data layer needs it has used the output. Please note that dependencies between different, subsequent intermediate data layers, where for example the output of layer 406 is used for layer 408, corresponds to Applicant’s unloading the second sub-model from the working memory of the first electronic device based on the generation of the first device output, since, as later disclosed by Banitalebi, each sub-model consists of layers, and therefore may implement this intermediate layer pipeline system. Additionally, holding the output until whatever intermediate data layer needs it has used the output corresponds to Applicant’s unloading the second sub-model from the working memory of the first electronic device based on the generation of the first output, as there must inherently be a means by which the memory is un-allocated once it is no longer needed once the intermediate data layer has used it and produced an output, i.e., once the first output is generated); 
and controlling a first display device to render the first output (Col. 17, Lines 17-19- The output device interface 806 may enable, for example, the display of images generated by electronic system 800. Please note that displaying images generated by electronic system 800 on the output device interface 806 corresponds to Applicant’s controlling a first display device to render the first output.).
Paek does not explicitly disclose determining a partition of the neural network model based on the constraint information and the model file; 
extracting a plurality of sub-models from the neural network model based on the partition; 
executing a first set of operations for a first sub-model of the plurality of sub-models, wherein the first set of operations comprises: 
loading the first sub-model in a working memory of the first electronic device; 
However, Banitalebi discloses determining a partition of the neural network model based on the constraint information and the model file ([0014] In one or more of the preceding aspects, the selecting may be further based on a memory constraint for the first device.; [0040]  An deep learning model splitting module 10 (hereinafter splitting module 10) is configured to receive, as an input a trained deep learning model for an inference task, and automatically process the trained deep learning model to divide (i.e. split) it into first and second deep learning models. Please note that dividing or splitting a deep learning model into first and second deep learning models corresponds to Applicant’s determining a partition of the neural network model based on the constraint information and the model file; it is known to one of ordinary skill in the art that a deep learning model is a variant of a neural network model, and this splitting is inherently done based on the previously mentioned input model file and memory constraints. ); 
extracting a plurality of sub-models from the neural network model based on the partition ([0040] An deep learning model splitting module 10 (hereinafter splitting module 10) is configured to receive, as an input a trained deep learning model for an inference task, and automatically process the trained deep learning model to divide (i.e. split) it into first and second deep learning models. Please note that the first and second deep learning models that are created as a result of the split/divide correspond to Applicant’s extracting a plurality of sub-models from the neural network model based on the partition.); 
executing a first set of operations for a first sub-model of the plurality of sub-models, wherein the first set of operations comprises:  loading the first sub-model in a working memory of the first electronic device ([0041] the deep learning model that is provided as input to the splitting module 10 is a trained DNN 11, and the resulting first and second deep learning models that are generated by the splitting module 10 are an edge DNN 30 that is configured to for deployment on a target edge device 88 and a cloud DNN 40 that is configured for deployment on a target cloud device 86.  Please note that the resulting split models being configured for deployment corresponds to Applicant’s executing a first set of operations for a first sub-model of the plurality of sub-models, wherein the first set of operations comprises loading the first sub-model, i.e., the split model, in a working memory of the first electronic device, i.e., deploying it.); 
Paek and Banitalebi are both considered to be analogous to the claimed invention because they are in the same field of managing neural networks while considering computer resource constraints. Therefore, it would have been obvious to someone of ordinary skill in the art prior to the effective filing date of the claimed invention to have modified Paek to incorporate the teachings of Banitalebi to modify the neural network model system that stores a model file, determines constraint information, receives input for a machine learning task, generates an intermediate result with intermediate layers that use previous output as input, loads the second sub-model, generates a first output based on an application of the second sub-model on the intermediate result, and unloads from memory once completed to partition the neural network model based on the constraint information and model file, extract sub-models based on the partition, and execute a set of operations including loading the sub-model into working memory, allowing for improved performance via decreased latency and more flexible deployment, as described in Banitalebi.
 
Regarding Claim 2, Paek-Banitalebi as described in Claim 1, Banitalebi further discloses wherein each sub-model of the plurality of sub- models includes a subset of a set of neural network (NN) layers of the neural network model ([0010] identifying a first set of one or more neural network layers from the trained neural network for inclusion in the first neural network and a second set of one or more neural network layers from the trained neural network for inclusion in the second neural network. Please note that identifying corresponding sets of neural network layers for inclusion in the split neural network models corresponds to Applicant’s sub-model of the plurality of sub-models each including a subset of NN layers of the neural network model. ).
 
Regarding Claim 3, Paek-Banitalebi as described in Claim 1, Paek further discloses wherein the first set of operations further comprises storing the intermediate result in the persistent storage (Col. 11, Lines 57-58- intermediate data layer 408 uses the output of intermediate data layer 406; Col. 15, Lines 50-52-different memory allocation portions are designated for different intermediate data layers. Please note that memory allocation being performed for intermediate data layers which have outputs corresponds to Applicant’s first set of operations further comprising storing the intermediate result in the persistent storage.).
 
Regarding Claim 4, Paek-Banitalebi as described in Claim 1, Paek further discloses wherein the constraint information includes at least one of: a size of the working memory of the first electronic device, a processing capability of the first electronic device to execute a count of multiply-accumulate (MAC) operations per second, a network communication capability of the first electronic device, wherein the network communication capability of the first electronic device indicates a transmission bandwidth of the first electronic device and a reception bandwidth of the first electronic device, or an indication that the first input includes at least one of personal data or sensitive data (Col. 12, Lines 44-47- the total amount of memory available for allocation may be determined based at least in part on an amount of available memory of a given target device. Please note that the total amount of memory available for allocation being determined based on an amount of available memory of a given target device corresponds to Applicant’s constraint information including a size of the working memory of the first electronic device. As the claim states “at least one of” the possible limitations included in the constraint information, this is interpreted as fulfilling the requirements of the claim.).

Regarding Claim 19, Paek discloses A first electronic device, comprising: a memory configured to store a model file that includes a neural network model (Col. 4, Line 67- A memory 240 includes neural network model document files 244. Please note the described embodiments of the reference stating a memory 240 including neural network model document files 244 corresponds to Applicant’s memory configured to store a model file that includes a neural network mode on a first electronic device, as it is known in the art that a memory of an electronic device may include a persistent storage.); 
and circuitry configured to (Col. 6, Lines 33-37-specialized (e.g., dedicated) hardware has been developed that is optimized for performing particular operations from a given NN. A given electronic device may include a neural processor, which can be implemented as circuitry that performs various machine learning operations. Please note that a neural processor implemented as circuitry performing various machine learning operations corresponds to Applicant’s circuitry): 
determine constraint information associated with a deployment of the neural network model on the first electronic device (Col. 2, Lines 26-30- when deploying a given deep neural network for execution on a target platform and/or target processor on the target platform, depending on the available hardware, resource constraints (e.g., memory and/or computing). Please note that considering resource constraints of a target processor when deploying a deep neural network corresponds to Applicant’s determining constraint associated with a deployment of the neural network model on the first electronic device.); 
receive an input associated with a machine learning task (Col. 6, Lines 58-60-, a neural network (NN) is a computing model that uses a collection of connected nodes to process input data based on machine learning techniques. Please note that the neural network model processing input data based on machine learning techniques corresponds to Applicant’s receiving a first input associated with a machine learning task.); 
generate an intermediate result based on an application of the first sub-model on the input (Col. 11, Lines 57-58- intermediate data layer 408 uses the output of intermediate data layer 406. Please note that the output of an intermediate data layer corresponds to Applicant’s generating an intermediate result by based on application of the first sub-model on the input, as the layers which are inherent in sub-models as later disclosed by Banitalebi are operating on inputs.);  
and unload the first sub-model from the working memory of the first electronic device based on the generation of the intermediate result (Col. 12, Lines 7-10- a memory allocation may be required to hold the output until whatever intermediate data layer needs it has used the output. Please note that holding the output until whatever intermediate data layer needs it has used the output corresponds to Applicant’s unloading the first sub-model from the working memory of the first electronic device based on the generation of the intermediate result, as there must inherently be a means by which the memory is un-allocated once it is no longer needed once the intermediate data layer has used it, i.e., once the intermediate result is generated.); 
executing a second set of operations for a second sub-model of the plurality of sub-models, wherein in the second set of operations, the circuitry is further configured to: load the second sub-model in the working memory of the first electronic device based on the unloading of the first sub-model Col. 11, Lines 46-60- Convolutional neural network 400 also illustrates the dependencies between different intermediate data layers. Thus, intermediate data layer 404 and intermediate data layer 406 both use the output of intermediate data layer 402; intermediate data layer 408 uses the output of intermediate data layer 406; and intermediate data layer 410 uses the output of intermediate data layer 408 and intermediate data layer 404.; Col. 12, Lines 7-10- a memory allocation may be required to hold the output until whatever intermediate data layer needs it has used the output. Please note that dependencies between different, subsequent intermediate data layers, where for example the output of layer 406 is used for layer 408, and a memory allocation is required to hold the output so that the subsequent intermediate data layer can use it corresponds to Applicant’s loading the second sub-model in the working memory of the first electronic device based on the unloading of the first sub-model, since, as later disclosed by Banitalebi, each sub-model consists of layers, and therefore may implement this intermediate layer pipeline system. Additionally, holding the output until whatever intermediate data layer needs it has used the output corresponds to Applicant’s loading the second sub-model in the working memory of the first electronic device based on the unloading of the first sub-model, as there must inherently be a means by which the memory is allocated, i.e., loaded, for the subsequent intermediate data layer 406 corresponding to a second sub-model that is reliant on the output of an earlier intermediate data layer 402 corresponding to the first sub-model, which was un-allocated once it was no longer needed and produced the output.); 
generate an output based on an application of the second sub-model on  the intermediate result (Col. 11, Lines 46-60- Convolutional neural network 400 also illustrates the dependencies between different intermediate data layers. Thus, intermediate data layer 404 and intermediate data layer 406 both use the output of intermediate data layer 402; intermediate data layer 408 uses the output of intermediate data layer 406; and intermediate data layer 410 uses the output of intermediate data layer 408 and intermediate data layer 404. Please note that dependencies between different, subsequent intermediate data layers, where for example the output of layer 406 is used for layer 408, corresponds to Applicant’s generating a first output based on an application of the second sub-model on the intermediate result, since, as later disclosed by Banitalebi, each sub-model consists of layers, and therefore may implement this intermediate layer pipeline system that applies a second sub-model to generate a first output based on the intermediate result.); 
and unload the second sub-model from the working memory of the first electronic device based on the generation of the output (Col. 11, Lines 46-60- Convolutional neural network 400 also illustrates the dependencies between different intermediate data layers. Thus, intermediate data layer 404 and intermediate data layer 406 both use the output of intermediate data layer 402; intermediate data layer 408 uses the output of intermediate data layer 406; and intermediate data layer 410 uses the output of intermediate data layer 408 and intermediate data layer 404.; Col. 12, Lines 7-10- a memory allocation may be required to hold the output until whatever intermediate data layer needs it has used the output. Please note that dependencies between different, subsequent intermediate data layers, where for example the output of layer 406 is used for layer 408, corresponds to Applicant’s unloading the second sub-model from the working memory of the first electronic device based on the generation of the first device output, since, as later disclosed by Banitalebi, each sub-model consists of layers, and therefore may implement this intermediate layer pipeline system. Additionally, holding the output until whatever intermediate data layer needs it has used the output corresponds to Applicant’s unloading the second sub-model from the working memory of the first electronic device based on the generation of the first output, as there must inherently be a means by which the memory is un-allocated once it is no longer needed once the intermediate data layer has used it and produced an output, i.e., once the first output is generated); 
and controlling a first display device to render the output (Col. 17, Lines 17-19- The output device interface 806 may enable, for example, the display of images generated by electronic system 800. Please note that displaying images generated by electronic system 800 on the output device interface 806 corresponds to Applicant’s controlling a first display device to render the first output.).
Paek does not explicitly disclose determine a partition of the neural network model based on the constraint information and the model file; 
extract a plurality of sub-models from the neural network model based on the partition; 
execute a first set of operations for a first sub-model of the plurality of sub-models, wherein in the first set of operations, the circuitry is further configured to: 
load the first sub-model in a working memory of the first electronic device; 
However, Banitalebi discloses determine a partition of the neural network model based on the constraint information and the model file ([0014] In one or more of the preceding aspects, the selecting may be further based on a memory constraint for the first device.; [0040]  An deep learning model splitting module 10 (hereinafter splitting module 10) is configured to receive, as an input a trained deep learning model for an inference task, and automatically process the trained deep learning model to divide (i.e. split) it into first and second deep learning models. Please note that dividing or splitting a deep learning model into first and second deep learning models corresponds to Applicant’s determining a partition of the neural network model based on the constraint information and the model file; it is known to one of ordinary skill in the art that a deep learning model is a variant of a neural network model, and this splitting is inherently done based on the previously mentioned input model file and memory constraints. ); 
extract a plurality of sub-models from the neural network model based on the partition ([0040] An deep learning model splitting module 10 (hereinafter splitting module 10) is configured to receive, as an input a trained deep learning model for an inference task, and automatically process the trained deep learning model to divide (i.e. split) it into first and second deep learning models. Please note that the first and second deep learning models that are created as a result of the split/divide correspond to Applicant’s extracting a plurality of sub-models from the neural network model based on the partition.); 
execute a first set of operations for a first sub-model of the plurality of sub-models, wherein in the first set of operations, the circuitry is further configured to:  load the first sub-model in a working memory of the first electronic device ([0041] the deep learning model that is provided as input to the splitting module 10 is a trained DNN 11, and the resulting first and second deep learning models that are generated by the splitting module 10 are an edge DNN 30 that is configured to for deployment on a target edge device 88 and a cloud DNN 40 that is configured for deployment on a target cloud device 86.  Please note that the resulting split models being configured for deployment corresponds to Applicant’s executing a first set of operations for a first sub-model of the plurality of sub-models, wherein the first set of operations comprises loading the first sub-model, i.e., the split model, in a working memory of the first electronic device, i.e., deploying it.); 
Paek and Banitalebi are both considered to be analogous to the claimed invention because they are in the same field of managing neural networks while considering computer resource constraints. Therefore, it would have been obvious to someone of ordinary skill in the art prior to the effective filing date of the claimed invention to have modified Paek to incorporate the teachings of Banitalebi to modify the neural network model system that stores a model file, determines constraint information, receives input for a machine learning task, generates an intermediate result with intermediate layers that use previous output as input, loads the second sub-model, generates a first output based on an application of the second sub-model on the intermediate result, and unloads from memory once completed to partition the neural network model based on the constraint information and model file, extract sub-models based on the partition, and execute a set of operations including loading the sub-model into working memory, allowing for improved performance via decreased latency and more flexible deployment, as described in Banitalebi.

Regarding Claim 20, Paek discloses A non-transitory computer-readable medium having stored thereon, computer executable instructions that, when executed by a first electronic device, causes the first electronic device to perform operations, the operations comprising (Col. 19, Lines 23-30- The tangible computer-readable storage medium also can be non-transitory in nature. The computer-readable storage medium can be any storage medium that can be read, written, or otherwise accessed by a general purpose or special purpose computing device, including any processing electronics and/or processing circuitry capable of executing instructions. Please note the non-transitory computer-readable storage medium that can be read by a computing device to execute instructions corresponds to Applicant’s non-transitory computer-readable medium having stored thereon, computer executable instructions that, when executed by an electronic device, causes the electronic device to perform operations.): 
storing, on a persistent storage of the first electronic device, a model file that includes a neural network mode (Col. 4, Line 67- A memory 240 includes neural network model document files 244. Please note the described embodiments of the reference stating a memory 240 including neural network model document files 244 corresponds to Applicant’s method comprising storing a model file that includes a neural network mode on a persistent storage of a first electronic device, as it is known in the art that a memory of an electronic device may include a persistent storage.); 
determining constraint information associated with a deployment of the neural network model on the first electronic device (Col. 2, Lines 26-30- when deploying a given deep neural network for execution on a target platform and/or target processor on the target platform, depending on the available hardware, resource constraints (e.g., memory and/or computing). Please note that considering resource constraints of a target processor when deploying a deep neural network corresponds to Applicant’s determining constraint associated with a deployment of the neural network model on the first electronic device.); 
receiving an input associated with a machine learning task (Col. 6, Lines 58-60-, a neural network (NN) is a computing model that uses a collection of connected nodes to process input data based on machine learning techniques. Please note that the neural network model processing input data based on machine learning techniques corresponds to Applicant’s receiving an input associated with a machine learning task.); 
generating an intermediate result based on an application of the first sub-model on the input (Col. 11, Lines 57-58- intermediate data layer 408 uses the output of intermediate data layer 406. Please note that the output of an intermediate data layer corresponds to Applicant’s generating an intermediate result by based on application of the first sub-model on the input, as the layers which are inherent in sub-models as later disclosed by Banitalebi are operating on inputs.);  
and unloading the first sub-model from the working memory of the first electronic device based on the generation of the intermediate result (Col. 12, Lines 7-10- a memory allocation may be required to hold the output until whatever intermediate data layer needs it has used the output. Please note that holding the output until whatever intermediate data layer needs it has used the output corresponds to Applicant’s unloading the first sub-model from the working memory of the first electronic device based on the generation of the intermediate result, as there must inherently be a means by which the memory is un-allocated once it is no longer needed once the intermediate data layer has used it, i.e., once the intermediate result is generated.); 
executing a second set of operations for a second sub-model of the plurality of sub-models, wherein the second set of operations comprises: loading the second sub-model in the working memory of the first electronic device based on the unloading of the first sub-model Col. 11, Lines 46-60- Convolutional neural network 400 also illustrates the dependencies between different intermediate data layers. Thus, intermediate data layer 404 and intermediate data layer 406 both use the output of intermediate data layer 402; intermediate data layer 408 uses the output of intermediate data layer 406; and intermediate data layer 410 uses the output of intermediate data layer 408 and intermediate data layer 404.; Col. 12, Lines 7-10- a memory allocation may be required to hold the output until whatever intermediate data layer needs it has used the output. Please note that dependencies between different, subsequent intermediate data layers, where for example the output of layer 406 is used for layer 408, and a memory allocation is required to hold the output so that the subsequent intermediate data layer can use it corresponds to Applicant’s loading the second sub-model in the working memory of the first electronic device based on the unloading of the first sub-model, since, as later disclosed by Banitalebi, each sub-model consists of layers, and therefore may implement this intermediate layer pipeline system. Additionally, holding the output until whatever intermediate data layer needs it has used the output corresponds to Applicant’s loading the second sub-model in the working memory of the first electronic device based on the unloading of the first sub-model, as there must inherently be a means by which the memory is allocated, i.e., loaded, for the subsequent intermediate data layer 406 corresponding to a second sub-model that is reliant on the output of an earlier intermediate data layer 402 corresponding to the first sub-model, which was un-allocated once it was no longer needed and produced the output.); 
generating an output based on an application of the second sub-model on the intermediate result (Col. 11, Lines 46-60- Convolutional neural network 400 also illustrates the dependencies between different intermediate data layers. Thus, intermediate data layer 404 and intermediate data layer 406 both use the output of intermediate data layer 402; intermediate data layer 408 uses the output of intermediate data layer 406; and intermediate data layer 410 uses the output of intermediate data layer 408 and intermediate data layer 404. Please note that dependencies between different, subsequent intermediate data layers, where for example the output of layer 406 is used for layer 408, corresponds to Applicant’s generating an output based on an application of the second sub-model on the intermediate result, since, as later disclosed by Banitalebi, each sub-model consists of layers, and therefore may implement this intermediate layer pipeline system that applies a second sub-model to generate a first output based on the intermediate result.); 
and unloading the second sub-model from the working memory of the first electronic device based on the generation of the output (Col. 11, Lines 46-60- Convolutional neural network 400 also illustrates the dependencies between different intermediate data layers. Thus, intermediate data layer 404 and intermediate data layer 406 both use the output of intermediate data layer 402; intermediate data layer 408 uses the output of intermediate data layer 406; and intermediate data layer 410 uses the output of intermediate data layer 408 and intermediate data layer 404.; Col. 12, Lines 7-10- a memory allocation may be required to hold the output until whatever intermediate data layer needs it has used the output. Please note that dependencies between different, subsequent intermediate data layers, where for example the output of layer 406 is used for layer 408, corresponds to Applicant’s unloading the second sub-model from the working memory of the first electronic device based on the generation of the first device output, since, as later disclosed by Banitalebi, each sub-model consists of layers, and therefore may implement this intermediate layer pipeline system. Additionally, holding the output until whatever intermediate data layer needs it has used the output corresponds to Applicant’s unloading the second sub-model from the working memory of the first electronic device based on the generation of the output, as there must inherently be a means by which the memory is un-allocated once it is no longer needed once the intermediate data layer has used it and produced an output, i.e., once the first output is generated); 
and controlling a display device to render the first output (Col. 17, Lines 17-19- The output device interface 806 may enable, for example, the display of images generated by electronic system 800. Please note that displaying images generated by electronic system 800 on the output device interface 806 corresponds to Applicant’s controlling a display device to render the first output.).
Paek does not explicitly disclose determining a partition of the neural network model based on the constraint information and the model file; 
extracting a plurality of sub-models from the neural network model based on the partition; 
executing a first set of operations for a first sub-model of the plurality of sub-models, wherein the first set of operations comprises: 
loading the first sub-model in a working memory of the first electronic device; 
However, Banitalebi discloses determining a partition of the neural network model based on the constraint information and the model file ([0014] In one or more of the preceding aspects, the selecting may be further based on a memory constraint for the first device.; [0040]  An deep learning model splitting module 10 (hereinafter splitting module 10) is configured to receive, as an input a trained deep learning model for an inference task, and automatically process the trained deep learning model to divide (i.e. split) it into first and second deep learning models. Please note that dividing or splitting a deep learning model into first and second deep learning models corresponds to Applicant’s determining a partition of the neural network model based on the constraint information and the model file; it is known to one of ordinary skill in the art that a deep learning model is a variant of a neural network model, and this splitting is inherently done based on the previously mentioned input model file and memory constraints. ); 
extracting a plurality of sub-models from the neural network model based on the partition ([0040] An deep learning model splitting module 10 (hereinafter splitting module 10) is configured to receive, as an input a trained deep learning model for an inference task, and automatically process the trained deep learning model to divide (i.e. split) it into first and second deep learning models. Please note that the first and second deep learning models that are created as a result of the split/divide correspond to Applicant’s extracting a plurality of sub-models from the neural network model based on the partition.); 
executing a first set of operations for a first sub-model of the plurality of sub-models, wherein the first set of operations comprises:  loading the first sub-model in a working memory of the first electronic device ([0041] the deep learning model that is provided as input to the splitting module 10 is a trained DNN 11, and the resulting first and second deep learning models that are generated by the splitting module 10 are an edge DNN 30 that is configured to for deployment on a target edge device 88 and a cloud DNN 40 that is configured for deployment on a target cloud device 86.  Please note that the resulting split models being configured for deployment corresponds to Applicant’s executing a first set of operations for a first sub-model of the plurality of sub-models, wherein the first set of operations comprises loading the first sub-model, i.e., the split model, in a working memory of the first electronic device, i.e., deploying it.); 
Paek and Banitalebi are both considered to be analogous to the claimed invention because they are in the same field of managing neural networks while considering computer resource constraints. Therefore, it would have been obvious to someone of ordinary skill in the art prior to the effective filing date of the claimed invention to have modified Paek to incorporate the teachings of Banitalebi to modify the neural network model system that stores a model file, determines constraint information, receives input for a machine learning task, generates an intermediate result with intermediate layers that use previous output as input, loads the second sub-model, generates a first output based on an application of the second sub-model on the intermediate result, and unloads from memory once completed to partition the neural network model based on the constraint information and model file, extract sub-models based on the partition, and execute a set of operations including loading the sub-model into working memory, allowing for improved performance via decreased latency and more flexible deployment, as described in Banitalebi.

Claims 5-18 are rejected under 35 U.S.C. 103 as being unpatentable over Paek et al. (US 11175898 B2) in view of Banitalebi Dehkordi et al. (US 20220414432 A1), and further in view of Muthusamy et al. (US 20200349413 A1), hereinafter referred to as Paek, Banitalebi, and Muthusamy, respectively.

Regarding Claim 5, Paek-Banitalebi as described in Claim 4, Paek further discloses determining a memory footprint of each neural network (NN) layer of a set of NN layers of the neural network model, wherein the memory footprint of the each NN layer indicates a memory required to load a corresponding NN layer of the set of NN layers on the working memory of the first electronic device, and the corresponding NN layer is loaded as part of one of the first sub-model or the second sub-model of the plurality of sub-models (Col. 7, Lines 55-58- the memory allocations 344 correspond to code for allocating memory portions based on a determined size of each layer of the NN and/or based on an amount of memory available at the target device. Please note that determining the size of each layer of the NN and allocating memory accordingly at the target device corresponds to Applicant’s determining a memory footprint of each NN layer of a set of NN layers of the neural network model, with the memory footprint of each NN layer being indicative of a memory required to load a corresponding NN layer of the set on the working memory of the first electronic device, where the corresponding NN layer is loaded as part of one of the first sub-model of the plurality of sub-models. This is because, as previously stated by Banitalebi in the combination, the sub-models are each comprised of NN layers.); 
Paek-Banitalebi does not explicitly disclose grouping each set of adjoining NN layers of a plurality of sets of adjoining NN layers into a respective subset of NN layers of a plurality of subsets of NN layers, wherein the each set of adjoining NN layers of the plurality of sets of adjoining NN layers is grouped based on the determined memory footprint of the each NN layer of the set of NN layers, the  set of NN layers includes the plurality of sets of adjoining NN layers, and a memory footprint of the respective subset of NN layers is one of less than or equal to the size of the working memory of the first electronic device; and determining the partition of the neural network model based on the grouping of the plurality of sets of adjoining NN layers of the set of NN layers
However, Muthusamy discloses grouping each set of adjoining NN layers of a plurality of sets of adjoining NN layers into a respective subset of NN layers of a plurality of subsets of NN layers, wherein the each set of adjoining NN layers of the plurality of sets of adjoining NN layers is grouped based on the determined memory footprint of the each NN layer of the set of NN layers, the  set of NN layers includes the plurality of sets of adjoining NN layers, and a memory footprint of the respective subset of NN layers is one of less than or equal to the size of the working memory of the first electronic device; and determining the partition of the neural network model based on the grouping of the plurality of sets of adjoining NN layers of the set of NN layers ([0029] a monolithic grouping strategy that has constraints on total size (e.g. Cloud Functions). In the monolithic grouping strategy, all model outputs are available every time when running the models and there is a high inferencing throughput. The dotted boxes in FIGS. 4-6 represent groups, which are the minimum units of deployment granularity. All layers in a group must be deployed together to one compute unit. Please note that a grouping strategy to have layers in a group corresponds to Applicant’s grouping each set of adjoining NN layers of a plurality of sets of adjoining NN layers into a respective subset of NN layers of a plurality of subsets of NN layers, wherein the each set of adjoining NN layers of the plurality of sets of adjoining NN layers is grouped based on the determined memory footprint of the each NN layer of the set of NN layers, the  set of NN layers includes the plurality of sets of adjoining NN layers, and a memory footprint of the respective subset of NN layers is less than the size of the working memory of the first electronic device, as since the group is deployed together to a compute unit based on constraints on total size, each subset inherently has a memory footprint that may be less than the size of the working memory. Furthermore, as they are all deployed together when grouped, this corresponds to them being adjoining, and there is inherently a partition of the neural network model when the NN layers of the set of NN layers are grouped to separate them from other groups. ).
Paek-Banitalebi and Muthusamy are both considered to be analogous to the claimed invention because they are in the same field of managing resources for neural network models. Therefore, it would have been obvious to someone of ordinary skill in the art prior to the effective filing date of the claimed invention to have modified Paek-Banitalebi to incorporate the teachings of Muthusamy to modify the system as described in Claim 4 that determines a memory footprint of each NN layer to group adjoining each set of adjoining NN layers into subsets based on the memory footprint of each layer of the set, allowing for improved memory management and resource usage, as described in Muthusamy.
 
Regarding Claim 6, Paek-Banitalebi-Muthusamy as described in Claim 5, Banitalebi further discloses partitioning the neural network model based on the plurality of subsets of NN layers, wherein each subset of the plurality of subsets of NN layers corresponds to a respective sub-model of the plurality of sub-models; and extracting the plurality of sub-models based on the partitioning ([0010] identifying a first set of one or more neural network layers from the trained neural network for inclusion in the first neural network and a second set of one or more neural network layers from the trained neural network for inclusion in the second neural network. Please note that identifying corresponding sets of neural network layers for inclusion in the split neural network models corresponds to Applicant’s partitioning the neural network model based on the plurality of subsets of NN layers, wherein each subset of the plurality of subsets of NN layers corresponds to a respective sub-model of the plurality of sub-models, and the plurality of sub-models is extracted further based on the partitioning. This is because, as previously stated by Muthusamy, each subset of the plurality of subsets belongs to a model, which could include the previously mentioned sub-models of Banitalebi, and therefore the partitions between subsets would also allow for the extraction of sub-models that they correspond to.).

Regarding Claim 7, Paek-Banitalebi-Muthusamy as described in Claim 5, Paek further discloses determining the memory footprint of the each NN layer based on a size of the corresponding NN layer (Col. 7, Lines 55-58- the memory allocations 344 correspond to code for allocating memory portions based on a determined size of each layer of the NN and/or based on an amount of memory available at the target device. Please note that determining the size of each layer of the NN and allocating memory accordingly at the target device corresponds to Applicant’s determining a memory footprint of each NN layer based on a size of the corresponding NN layer.), 
a size of a second input of the corresponding NN layer, a size of a second output of the corresponding NN layer (Col. 5, Lines 5-8-information including descriptions of input and output feature(s), […] may be included in a given neural network model document file. Please note that information including descriptions of input and output features corresponds to Applicant’s second input and output sizes for corresponding NN layers that may be considered in determining the memory footprint.), 
and a size of a buffer for the corresponding NN layer (Col. 12, Lines 7-10- a memory allocation may be required to hold the output until whatever intermediate data layer needs it has used the output. Please note that the memory allocation holding the output until the intermediate data layer that needs it has used it corresponds to the buffer to be allocated to the corresponding NN layer, and in making the allocation, there must necessarily be a size.), 
wherein a memory footprint of a specific subset of the plurality of subsets of NN layers is a sum of memory footprints of a specific set of adjoining NN layers of the plurality of sets of adjoining NN layers, and the specific set of adjoining NN layers is grouped into the specific subset of the plurality of subsets of NN layers (Col. 7, Lines 55-58- the memory allocations 344 correspond to code for allocating memory portions based on a determined size of each layer of the NN and/or based on an amount of memory available at the target device. Please note that as the size of each layer of the NN is determined and memory is allocated accordingly at the target device, a set of layers as grouped in the subset disclosed by Muthusamy would be able to have the memory footprint of each specific subset of the plurality of subsets of NN layers determined via the sum of footprints of adjoining layers of the set that are grouped; the sum of memory footprints of the sets of adjoining grouped layers of the subset would be obvious as a measure of the memory footprint of the subset.).
 
Regarding Claim 8, Paek-Banitalebi-Muthusamy as described in Claim 5, Muthusamy further discloses wherein the determined memory footprint of the each NN layer of the set of NN layers of the neural network model is one of less than or equal to the size of the working memory of the first electronic device ([0029] a monolithic grouping strategy that has constraints on total size (e.g. Cloud Functions). In the monolithic grouping strategy, all model outputs are available every time when running the models and there is a high inferencing throughput. The dotted boxes in FIGS. 4-6 represent groups, which are the minimum units of deployment granularity. All layers in a group must be deployed together to one compute unit. Please note that the constraints of total size being considered when deploying layers in a group to one compute unit corresponds to Applicant’s wherein the determined memory footprint of each NN layer of the set of NN layers of the neural network model is less than or equal to the size of the working memory of the first electronic device, as since the group of NN layers is deployed together to a compute unit based on constraints on total size, each layer of the subset inherently has a memory footprint less than the size of the working memory.).
 
Regarding Claim 9, Paek-Banitalebi as described in Claim 4, Paek further discloses determining a count of MAC operations associated with each neural network (NN) layer of a set of NN layers of the neural network model (Col. 6, Lines 33-43-specialized (e.g., dedicated) hardware has been developed that is optimized for performing particular operations from a given NN. A given electronic device may include a neural processor, which can be implemented as circuitry that performs various machine learning operations based on computations including multiplication, adding and accumulation. Such computations may be arranged to perform, for example, convolution of input data. A neural processor, in an example, is specifically configured to perform machine learning algorithms, typically by operating on predictive models such as NNs. Please note that the neural processor circuitry performing machine learning operations based on computations including multiplication, adding, and accumulating, by operating on NN predictive models corresponds to Applicant’s determining a count of MAC operations associated with each NN layer of a set of NN layers of the neural network model, because as will be disclosed, the NN model consists of NN layers, and since it must follow processing resource constraints, it would be obvious that each layer of the set that performs MAC operations factors in to the determination of meeting the constraints.); 
the set of NN layers includes the plurality of sets of adjoining NN layers, and a count of MAC operations associated with the respective subset is one of less than or equal to the processing capability of the first electronic device (Col. 2, Lines 26-31- when deploying a given deep neural network for execution on a target platform and/or target processor on the target platform, depending on the available hardware, resource constraints (e.g., memory and/or computing) can be encountered that may limit the execution of a given neural network. Please note that the constraints of computing resource availability being considered when deploying layers in a group to one compute unit corresponds to Applicant’s wherein the count of MAC operations associated with each subset is less than or equal to the processing capability of the first electronic device, as since subset is deployed together, corresponding to the set of NN layers that includes the set of adjoining NN layers, to a compute unit based on constraints on processing, each subset inherently has a processing requirement less than the processing capability of the target platform’s hardware.);
Paek-Banitalebi does not explicitly disclose grouping each set of adjoining NN layers of a plurality of sets of adjoining NN layers into a respective subset of a plurality of subsets of NN layers, wherein the each set of adjoining NN layers of the plurality of sets of adjoining NN layers is grouped based on the determined count of MAC operations associated with the each NN layer of the set of NN layers, and determining the partition of the neural network model based on the plurality of subsets of NN layers.
However, Muthusamy discloses grouping each set of adjoining NN layers of a plurality of sets of adjoining NN layers into a respective subset of a plurality of subsets of NN layers, wherein the each set of adjoining NN layers of the plurality of sets of adjoining NN layers is grouped based on the determined count of MAC operations associated with the each NN layer of the set of NN layers, and determining the partition of the neural network model based on the plurality of subsets of NN layers ([0029] a monolithic grouping strategy that has constraints on total size (e.g. Cloud Functions). In the monolithic grouping strategy, all model outputs are available every time when running the models and there is a high inferencing throughput. The dotted boxes in FIGS. 4-6 represent groups, which are the minimum units of deployment granularity. All layers in a group must be deployed together to one compute unit. Please note that a grouping strategy to have layers in a group corresponds to Applicant’s grouping each set of adjoining NN layers of a plurality of sets of adjoining NN layers into a respective subset of a plurality of subsets of NN layers, wherein the each set of adjoining NN layers of the plurality of sets of adjoining NN layers is grouped based on the determined count of MAC operations associated with the each NN layer of the set of NN layers, as since the group is deployed together to a compute unit based on constraints on total size, the constraint on size could be based upon the maximum processing capacity for a deployed model as previously disclosed by Paek. Furthermore, as they are all deployed together when grouped, this corresponds to them being adjoining, and there is inherently a partition of the neural network model when the NN layers of the set of NN layers are grouped to separate them from other groups. ).
Paek-Banitalebi and Muthusamy are both considered to be analogous to the claimed invention because they are in the same field of managing resources for neural network models. Therefore, it would have been obvious to someone of ordinary skill in the art prior to the effective filing date of the claimed invention to have modified Paek-Banitalebi to incorporate the teachings of Muthusamy to modify the system as described in Claim 4 that determines a count of MAC operations associated with each NN layer of the set including the set of adjoining layers where each count of operations associated with each subset is less than the processing capability of the device to group each set of adjoining NN layers into respective subsets based on the determined MAC operations of each layer, allowing for improved computing resource management, as described in Muthusamy.
 
Regarding Claim 10, Paek-Banitalebi-Muthusamy as described in Claim 9, Banitalebi further discloses partitioning the neural network model based on the plurality of subsets of NN layers, wherein each subset of the plurality of subsets of NN layers corresponds to a respective sub-model of the plurality of sub-models; and extracting the plurality of sub-models based on the partitioning ([0010] identifying a first set of one or more neural network layers from the trained neural network for inclusion in the first neural network and a second set of one or more neural network layers from the trained neural network for inclusion in the second neural network. Please note that identifying corresponding sets of neural network layers for inclusion in the split neural network models corresponds to Applicant’s partitioning the neural network model based on the plurality of subsets of NN layers, wherein each subset of the plurality of subsets of NN layers corresponds to a respective sub-model of the plurality of sub-models, and the plurality of sub-models is extracted further based on the partitioning. This is because, as previously stated by Muthusamy, each subset of the plurality of subsets belongs to a model, which could include the previously mentioned sub-models of Banitalebi, and therefore the partitions between subsets would also allow for the extraction of sub-models that they correspond to.)
 
Regarding Claim 11, Paek-Banitalebi-Muthusamy as described in Claim 9, Paek further discloses wherein a count of MAC operations associated with a specific subset of the plurality of subsets of NN layers is a sum of counts of MAC operations associated with a specific set of adjoining NN layers of the plurality of sets of adjoining NN layers, and the specific set of adjoining NN layers is grouped into the specific subset of the plurality of subsets of NN layers (Col. 7, Lines 55-58- the memory allocations 344 correspond to code for allocating memory portions based on a determined size of each layer of the NN and/or based on an amount of memory available at the target device. Please note that as the size of each layer of the NN may be determined according to constraints such as the previously mentioned compute resource constraints, a set of layers as grouped in the subset disclosed by Muthusamy would be able to have the count of MAC operations of each specific subset of the plurality of subsets of NN layers determined via the sum of counts of MAC operations of adjoining layers of the set that are grouped; the sum of MAC operations of the adjoining grouped layers of the subset would be obvious as a measure of the MAC operations of the subset.).
 
Regarding Claim 12, Paek-Banitalebi-Muthusamy as described in Claim 9, Paek further discloses wherein the determined count of MAC operations associated with the each NN layer of the set of NN layers of the neural network model is one of less than or equal to the processing capability of the first electronic device (Col. 2, Lines 26-31- when deploying a given deep neural network for execution on a target platform and/or target processor on the target platform, depending on the available hardware, resource constraints (e.g., memory and/or computing) can be encountered that may limit the execution of a given neural network. Please note that the constraints of computing resource availability being considered when deploying layers in a group to one compute unit corresponds to Applicant’s wherein the determined count of MAC operations associated with each NN layer of the set of NN layers of the neural network model is less than the processing capability of the first electronic device, as since the group of NN layers is deployed together to a compute unit based on constraints on processing, each layer of the subset inherently has a processing requirement less than the processing capability of the target platform’s hardware.).
 
Regarding Claim 13, Paek-Banitalebi as described in Claim 4, Banitalebi further discloses determining a size of a working memory of a second electronic device ([0041] (i) Edge device constraints 22: one or more parameters that define the computational abilities (e.g., memory size, CPU bit processing size) of the target edge device 88 that will be used to implement the edge DNN 30. Please note that the memory size of the target edge device corresponds to Applicant’s determining the size of a working memory of a second electronic device. ); 
determining a network communication capability of the second electronic device, wherein the network communication capability of the second electronic device indicates a transmission bandwidth of the second electronic device and a reception bandwidth of the second electronic device ([0010] optimize, within an accuracy constraint, an overall latency of: […] transmission of the feature map output from the first device to the second device, and execution of the second neural network on the second device to generate an inference output based on the feature map output from the first device.; [0041]  (iv) Network constraints 28: one or more parameters that specify information about the communication network links that exist between the cloud device 86 and the edge device 88. Please note that optimizing the overall latency of the second device and considering network constraints 28 about communication network links that exist between the cloud device 86 and the edge device 88 corresponds to Applicant’s determining a network communication capability indicative of a transmission bandwidth of the second electronic device and a reception bandwidth of the second electronic device, as it is known in the art that the specified parameters about the network link may include bandwidth for transmission and reception.); 
Paek-Banitalebi does not explicitly disclose grouping adjoining neural network (NN) layers of a set of NN layers based on at least one of the size of the working memory of the second electronic device, the network communication capability of the first electronic device, and the network communication capability of the second electronic device; and determining a subset of adjoining NN layers of the set of NN layers of the neural network model based on the grouping of the adjoining NN layers, wherein the determined subset of adjoining NN layers is a third sub-model of the plurality of sub-models.
However, Muthusamy discloses grouping adjoining neural network (NN) layers of a set of NN layers based on at least one of the size of the working memory of the second electronic device, the network communication capability of the first electronic device, and the network communication capability of the second electronic device; and determining a subset of adjoining NN layers of the set of NN layers of the neural network model based on the grouping of the adjoining NN layers, wherein the determined subset of adjoining NN layers is a third sub-model of the plurality of sub-models ([0029] a monolithic grouping strategy that has constraints on total size (e.g. Cloud Functions). In the monolithic grouping strategy, all model outputs are available every time when running the models and there is a high inferencing throughput. The dotted boxes in FIGS. 4-6 represent groups, which are the minimum units of deployment granularity. All layers in a group must be deployed together to one compute unit. Please note that a grouping strategy to have layers in a group corresponds to Applicant’s grouping adjoining NN layers of the set of NN layers based on the size of the working memory of the second electronic device, as the group is deployed together to a compute unit based on constraints on total size, i.e., the size of the working memory of the second electronic device. Furthermore, as they are all deployed together when grouped, i.e., when a subset of layers of the set is determined based on the grouping, this corresponds to them being adjoining, and there is inherently a partition of the neural network model when the NN layers of the set of NN layers are grouped to separate them from other groups, able to form a third set distinct from two others. As the claim states “at least one of” the determinations of subsets of adjoining NN layers, this is interpreted as fulfilling the requirements of the claim. ).
Paek-Banitalebi and Muthusamy are both considered to be analogous to the claimed invention because they are in the same field of managing resources for neural network models. Therefore, it would have been obvious to someone of ordinary skill in the art prior to the effective filing date of the claimed invention to have modified Paek-Banitalebi to incorporate the teachings of Muthusamy to modify the system as described in Claim 4 that determines a size of the working memory of the second electronic memory device and determining its transmission and reception bandwidth to group adjoining NN layers into subsets based on the size of the working memory of the second electronic device, allowing for improved memory management and resource usage, as described in Muthusamy.
 
Regarding Claim 14, Paek-Banitalebi-Muthusamy as described in Claim 13, Paek further discloses wherein a memory footprint of the determined subset of adjoining NN layers is a sum of memory footprints of the adjoining NN layers of the set of NN layers, and the memory footprint of the subset of adjoining NN layers is one of less than or equal to the size of the working memory of the second electronic device (Col. 7, Lines 55-58- the memory allocations 344 correspond to code for allocating memory portions based on a determined size of each layer of the NN and/or based on an amount of memory available at the target device. Please note that as the size of each layer of the NN is determined and memory is allocated accordingly at the target device, a set of layers as grouped in the subset disclosed by Muthusamy would be able to have the memory footprint of each subset of adjoining NN layers of the plurality of subsets of NN layers determined via the sum of footprints of adjoining layers of the set that are grouped; the sum of memory footprints of the adjoining grouped layers of the subset would be obvious as a measure of the memory footprint of the subset. Furthermore, as it is also based on the amount of memory available at the target device, this corresponds to the subset memory footprint being less than the size of the working memory of the second electronic device.).
 
Regarding Claim 15, Paek-Banitalebi-Muthusamy as described in Claim 13, Muthusamy further discloses detecting the at least one of personal data or sensitive data in the first input ([0073] Security provides identity verification for cloud consumers and tasks, as well as protection for data and other resources. Please note that the security providing protection for data throughout the system corresponds to Applicant’s detecting sensitive data in the input.); 
Banitalebi further discloses and determining, based on the detection, at least one NN layer of the set of NN layers that receive the first input, wherein the adjoining NN layers in the determined subset of adjoining NN layers are subsequent to each of the determined at least one NN layer of the set of NN layers ([0041]  divide the trained DNN 11 into edge DNN 30 and cloud DNN 40 based on a set of constraints 20 that are received by the splitting module 10 as inputs. These constrains may include, for example: […] (ii) Cloud device constraints 24: one or more parameters that define the computational abilities of the target cloud device 86 that will be used to implement the cloud DNN 40. Please note that, since the splitting may  occur based on cloud device constraints 24 of parameters of the target cloud device, such as the protected data of the cloud disclosed by Muthusamy that may be processed as input by the system, this corresponds to determining NN layers of the set of NN layers that receive the input based on the detection, as in dividing each layer of the neural network based on the constraints the system would necessarily be able to identify which receive protected data as input to satisfy the constraint. Additionally, since the layers are adjoining within the previously obtained subsets of Muthusamy, this corresponds to the adjoining NN layers in the determined subset being subsequent to each of the determined NN layers of the set of NN layers.).
 
Regarding Claim 16, Paek-Banitalebi-Muthusamy as described in Claim 13, Banitalebi further discloses transmitting the third sub-model and the first output to the second electronic device, wherein a bandwidth for the transmission of the third sub-model is one of less than or equal to the transmission bandwidth of the first electronic device, a bandwidth for a reception of the third sub-model and the first output is one of less than or equal to the reception bandwidth of the second electronic device, and the reception of the third sub-model and the first output is by the second electronic device ([0010] optimize, within an accuracy constraint, an overall latency of: […] transmission of the feature map output from the first device to the second device, and execution of the second neural network on the second device to generate an inference output based on the feature map output from the first device.; [0041]  (iv) Network constraints 28: one or more parameters that specify information about the communication network links that exist between the cloud device 86 and the edge device 88. Please note that transmitting the feature map output from the first device to the second device corresponds to Applicant’s transmitting the third sub- model that was extracted and the first output to the second electronic device. Furthermore, it is obvious to one of ordinary skill in the art that optimizing the overall latency of this process between the devices corresponds to Applicant’s bandwidth required for the transmission being less than the transmission bandwidth of the first electronic device, and a bandwidth required for a reception of the third sub-model and the first output, by the second electronic device, being less than the reception bandwidth of the second electronic device. This is because an “optimized latency” would inherently have the bandwidth required for the transmission being less than or equal to the transmission of the first device, i.e., up to its maximum, and the bandwidth required for reception of the feature map output corresponding to the third sub-model and first output by the second device to be less than or equal to the reception of its second electronic device, i.e., up to its respective maximum.).
 
Regarding Claim 17, Paek-Banitalebi-Muthusamy as described in Claim 13, Banitalebi further discloses receiving the third sub-model by the second electronic device; and controlling the second electronic device to execute a third set of operations for the received third sub-model, wherein the third set of operations comprises: loading the third sub-model in a working memory of the second electronic device ([0041] a cloud DNN 40 that is configured for deployment on a target cloud device 86. Please note that a cloud DNN 40 being configured for deployment on a target cloud device 86 corresponds to Applicant’s controlling the second electronic device to execute a third set of operations for the received third sub-model, wherein the third set of operations comprises: loading the third sub-model in a working memory of the second electronic device. This is because an inherent aspect of deployment of a neural network model is being loaded into memory.); 
generating a second output based an application of the third sub-model on the first output ([0088] execution of the second neural network on the second device to generate an inference output. Please note that generating an inference output on the second device by a second neural network corresponds to Applicant’s generating a second output based on an application of the third sub-model on the first output.); 
Paek further discloses unloading the third sub-model from the working memory of the second electronic device (Col. 3, Lines 4-6- deallocation techniques, which are often performed during running of the neural network model. Please note that deallocation performed during running of the neural network model corresponds to Applicant’s unloading the third sub-model from the working memory of the second electronic device.); 
and rendering the second output result on a second display device (Col. 17, Lines 17-20- The output device interface 806 may enable, for example, the display of images generated by electronic system 800. Output devices that may be used with the output device interface 806. Please note that displaying images generated by electronic system 800 on the output device interface 806, where there may possibly be multiple output devices used with the interface, corresponds to Applicant’s rendering the result on a second display device, i.e., on a device distinct from the first. ).

Regarding Claim 18, Paek-Banitalebi-Muthusamy as described in Claim 17, Banitalebi further discloses the third set of operations further comprises transmitting the second output to the first electronic device, a bandwidth for the transmission of the second output is one of less than or equal to the transmission bandwidth of the second electronic device, a bandwidth required for a reception of the second output is one of less than or equal to the reception bandwidth of the first electronic device, and the reception of the second output is by the first electronic device ([0010] optimize, within an accuracy constraint, an overall latency of: […] transmission of the feature map output from the first device to the second device, and execution of the second neural network on the second device to generate an inference output based on the feature map output from the first device.; [0041]  (iv) Network constraints 28: one or more parameters that specify information about the communication network links that exist between the cloud device 86 and the edge device 88. Please note that execution of the second neural network on the second device to generate an inference output based on the feature map output from the first device corresponds to Applicant’s transmitting the second output to the first electronic device. Furthermore, it is obvious to one of ordinary skill in the art that optimizing the overall latency of this process between the devices corresponds to Applicant’s bandwidth required for the transmission being less than the transmission bandwidth of the second electronic device, and a bandwidth required for a reception of the second output, by the first electronic device, being less than or equal to the reception bandwidth of the first electronic device. This is because an “optimized latency” would inherently have the bandwidth required for the transmission being less than the transmission of the second device, i.e., up to its maximum, and the bandwidth required for reception of the inference output corresponding to the result by the first device to be less than or equal to the reception of its second electronic device, i.e., up to its respective maximum.).

Response to Arguments
Applicant's arguments filed 04/21/2026 have been fully considered but they are not persuasive. 
Applicant’s arguments are summarized as follows:

The combination of Paek-Banitalebi does not teach the limitations of “unloading the first sub-model from the working memory of the first electronic device based on the generation of the intermediate result…loading the second sub-model in a working memory of the first electronic device based on the unloading of the first sub-model” as disclosed in amended independent Claim 1. This is because, as cited from Paek, it describes determining dependencies between different intermediate data layers of a convolutional neural network 400 and transmitting output data of a first intermediate data layer to a second intermediate data layer, and holding output data from the first intermediate data layer in memory until the output will be used by the second intermediate data layer. Paek describes unloading output data of the intermediate data layer, bur does not describe unloading a first sub-model from working memory, and thus does not describe loading a second sub-model in the working memory based on the unloading of the first sub-model. Banitalebi additionally does not remedy the deficiencies, and therefore, the combination does not teach the amended limitations, and the rejection under 35 U.S.C. 103 should be withdrawn.
Since Claims 19 and 20 contain similar limitations to Claim 1, they should also be allowable under 35 U.S.C. 103.
Since dependent Claims 2-4 depend on allowable amended independent Claim 1 and further recite matter not taught by the references, their rejections under 35 U.S.C. 103 should be withdrawn as well.
Since dependent Claims 5-18 depend on allowable amended independent Claim 1 and further recite matter not taught by the references, their rejections under 35 U.S.C. 103 should be withdrawn as well.

Regarding A, the examiner respectfully disagrees. As stated in [0040-0041] of Benitalebi above, the first and second deep learning models that are created as a result of the split/divide correspond to Applicant’s plurality of sub-models extracted from the neural network model.  Additionally, as stated above in Col. 11, Lines 46-60 of Paek, holding the output until whatever intermediate data layer needs it has used the output corresponds to Applicant’s loading the second sub-model in the working memory of the first electronic device based on the unloading of the first sub-model, as there must inherently be a means by which the memory is allocated, i.e., loaded, for the subsequent intermediate data layer 406 corresponding to a second sub-model that is reliant on the output of an earlier intermediate data layer 402 corresponding to the first sub-model, which was un-allocated once it was no longer needed and produced the output. As a result, since Paek teaches unloading first neural network layers from working memory based on the generation of the intermediate result, and loading second neural network layers based on the unloading of the first neural network layers, and Banitalebi teaches sub-models of neural networks that consist of layers, the combined system of the teachings of the references teaches the amended limitations.
Therefore, the recited features can be found in the combination of references, amended Independent Claim 1 remains rejected under 35 U.S.C. 103 for the reasons stated above, and the combinations would have been obvious to a person of ordinary skill in the art prior to the effective filing date of the application.

Regarding B, the examiner respectfully disagrees. Contrary to Applicant’ arguments, because independent Claims 19 and 20 contains similar limitations to rejected Claim 1 and do not add limitations that overcome the rejection, they likewise remain rejected. Therefore, the recited features can be found in the combination of references, amended Independent Claims 19 and 20 remain rejected under 35 U.S.C. 103 for the reasons stated above, and the combinations would have been obvious to a person of ordinary skill in the art prior to the effective filing date of the application.

Regarding C, the examiner respectfully disagrees. Independent claim 1 remains rejected for the reasons stated above. Thus, contrary to Applicant’s arguments, because the dependent claims 2-4 depend from unpatentable independent claims and do not add limitations that overcome the rejection, they likewise remain rejected.

Regarding D, the examiner respectfully disagrees. Independent claim 1 remains rejected for the reasons stated above. Thus, contrary to Applicant’s arguments, because the dependent 5-18 depend from unpatentable independent claims and do not add limitations that overcome the rejection, they likewise remain rejected.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Seok et al. (US 20240169201 A1) discloses performing ML tasks on resource constrained devices, a pipeline in which results are stored before feeding them to the next stage, storing weight data and intermediate input and output data, a maximum throughput, multiply and accumulate operations(see [0006, 0010-0011, 0014, 0039, 0055, 0060-0061]).
Kierat et al. (US 20240119267 A1) discloses storage to store forward and output weight and I/O data for a neural network and for each layer, loading the neural network into processors, an intermediate representation of a model, transmitting neural networks, and storing intermediate data in buffers (see [00179-0182, 0198, 0313, 0433]).

THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARAZ T AKBARI whose telephone number is (571)272-4166. The examiner can normally be reached Monday-Thursday 9:30am-7:30pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, April Blair can be reached at (571)270-1014. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/FARAZ T AKBARI/Examiner, Art Unit 2196   
/APRIL Y BLAIR/Supervisory Patent Examiner, Art Unit 2196
Read full office action
DEPLOYING NEURAL NETWORK MODELS ON RESOURCE-CONSTRAINED DEVICES

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

DEPLOYING NEURAL NETWORK MODELS ON RESOURCE-CONSTRAINED DEVICES

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email