Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 12 November 2025 has been entered.
Claim Rejections - 35 USC § 103
The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
Claim(s) 1-5, 7-8, 10-17, and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ponomarev et al. (NPL from IDS: Latency Estimation Tool and Investigation of Neural Networks Inference on Mobile GPU, published Aug. 2021, hereinafter "Ponomarev") in view of Tan et al. (NPL: Efficient Execution of Deep Neural Networks on Mobile Devices with NPU, published May 2021, hereinafter “Tan”), further in view of Hermans et al. (NPL: Training and Analyzing Deep Recurrent Neural Networks, published Dec. 2013, hereinafter "Hermans")
Regarding claim 1, Ponomarev teaches a method to estimate a latency of a layer of a
neural network executed on a neural processing unit (NPU) including an on-chip memory (Ponomarev, Pg. 4 Bullets 4-5), the method comprising:
measuring, by the host processing device, a total latency for the inference operation for the selected layer and [[an]] the auxiliary layer based on hardware-level timing signals (Ponomarev, Page 3 Paragraph 2— “We study inference TensorFlow Lite models on mobile GPU and propose an open-source LETI tool allowing to reconstruct models from parametrization and model latency”, Page 4 Bullet 5 — “we evaluate latency of TFLite models on CPU, GPU, or NPU of Android devices”, and in Page 8 Section 4.1 Paragraph 2 — “After that, we fill all the required layers and compute total latency as the sum of corresponding blocks. The speed of single layer is measured directly with TensorFlow-Benchmark.” — teaches measuring, by the host processing device (on mobile devices), a total latency for the inference operation for the selected layer and auxiliary layer (computes total latency as the sum of corresponding blocks, where the corresponding blocks may be the corresponding layers) based on hardware-level timing signals (latency is evaluated on CPU, GPU, or NPU of devices, and thus measurement of latency is based on hardware-level timing));
storing the estimated latency in a hardware-specific latency lookup table (LUT) used by the host processing device to determine end-to-end latency and to deploy neural network architectures on the NPU (Ponomarev, Page 4 Bullet 4 – “we evaluate the latency of TFLite models on the CPU, GPU or NPU of the Android devices”, Page 8 Section 4.1 Paragraph 1 – “We use a single layer as a block. After that, we initialize inputs for each block with the correct input tensor (for the first one, it is image shape 224 x 224 x 3; for the second one, it is the shape of the first block’s output). After that, we convert these blocks as standalone TFLite models and deploy them on the device for evaluation. We evaluate each block’s inference time within 300 runs and put the value into the lookup table”, Fig. 2, and in Page 8 Table 2 – teaches storing the estimated latency in a hardware-specific latency lookup table (evaluates each block’s inference time and puts value into LUT) used by the host processing device (latencies for mobile CPU, as stated in Table 2 description) to determine end-to-end latency (Table 2 description describes latency as a sum of block’s latencies, thus determines end-to-end latency) and to deploy neural network architectures on the NPU (deploys models on NPU of devices)).
Ponomarev fails to explicitly teach measuring, by the host processing device, an overhead latency for the inference operation, the overhead latency including data transfer latency between a dynamic random access memory (DRAM) of the host processing device and a static random access memory (SRAM) of the NPU; and subtracting, by the host processing device, the overhead latency from the total latency to generate an estimate of the latency of the layer.
However, analogous to the field of the claimed invention, Tan teaches:
executing, by the NPU, an inference operation over the selected layer and [[an]] the auxiliary layer to obtain data associated with a model (Tan, Section 4.2.2 Paragraph 1 – “Since NPU has its own memory space, all data must be moved from the main memory to NPU before the model can be executed on NPU.” and in Section 4.2.2 Paragraph 5 – “The layer processing time for the 𝑙𝑡ℎ layer can be computed as 𝑡𝑝 𝑙 = (𝑇𝑙𝑎𝑠𝑡 𝑙 −𝑡𝑑 𝑙 ) − (𝑇𝑙𝑎𝑠𝑡 𝑙+1 −𝑡𝑑 𝑙+1).” – teaches executing, by the NPU (model executed on NPU), an inference operation over the selected layer and an auxiliary layer to obtain data associated with a model (performs an inference operation over a layer and an auxiliary layer to obtain data, such as processing time information, associated with a model));
measuring, by the host processing device, an overhead latency for the inference operation, the overhead latency including data transfer latency between a dynamic random access memory (DRAM) of the host processing device and a static random access memory (SRAM) of the NPU (Tan, Section 4.2.2 Paragraph 1 – “Since NPU has its own memory space, all data must be moved from the main memory to NPU before the model can be executed on NPU. Since the NPU processing time is short and a large amount of data is transmitted between the main memory and NPU, the data transmission time and the layer processing time may be at similar level. To better estimate the processing time, we cannot ignore this data transmission time, especially when many layers are run on NPU. However, the current SDK for NPU does not include tools for measuring the data transmission time or the layer processing time of different layers. They can only measure the processing time of running the whole DNN model. To address this problem, we propose a method to compute the layer processing time and the data transmission time, and use them to model the data processing time of running the DNNmodel with a layer combination” – teaches measuring, by the host processing device (NPU can only measure processing time of whole model, thus host processing device computes the layer processing time), an overhead latency for the inference operation (running the DNN model) wherein the overhead latency includes data transfer latency (data transmission time) between a dynamic random access memory of a host processing device and a static random access memory of the NPU (NPU has its own memory space, main memory moves data to NPU, computes data transmission time)); [[and]]
subtracting, by the host processing device, the overhead latency from the total latency to generate an estimate of the latency of the layer (Tan, Section 4.2.2, Paragraph 3 – “let 𝑡𝑑 𝑙 denote the data transmission time of moving the input data of the 𝑙𝑡ℎ layer from the main memory to NPU”, Section 4.2.2, Paragraph 4 – “let 𝑇𝑙𝑎𝑠𝑡 𝑙 denote the processing time from the 𝑙𝑡ℎ layer to the last layer of the DNN model and it can be computed as 𝑇𝑙𝑎𝑠𝑡 𝑙 = 𝑡𝑑 𝑙 + 𝑛 𝑖=𝑙 𝑡𝑝 𝑖 + 𝑡𝑟 𝑛..” and in Section 4.2.2 Paragraph 5 – “The layer processing time for the 𝑙𝑡ℎ layer can be computed as 𝑡𝑝 𝑙 = (𝑇𝑙𝑎𝑠𝑡 𝑙 −𝑡𝑑 𝑙 ) − (𝑇𝑙𝑎𝑠𝑡 𝑙+1 −𝑡𝑑 𝑙+1).” – teaches subtracting, by the host processing device, the overhead latency (data transmission time) from the total latency (𝑇𝑙𝑎𝑠𝑡 𝑙 and 𝑇𝑙𝑎𝑠𝑡 𝑙+1), to generate an estimate of the latency of the layer (determines latency of the lth layer))[[.]]; and
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the overhead and layer latency measurements of Tan with the total latency measurement and look-up tables of Ponomarev to estimate a true latency for an individual layer of a neural network based on another layer. Doing so would better estimate the processing time of layer by accounting for the data transmission time or layer processing time of different layers (Tan, Section 4.2.2).
The combination of Ponomarev and Tan fails to explicitly teach adding, by a host processing device, an auxiliary layer to a selected layer of the neural network.
adding, by a host processing device, an auxiliary layer to a selected layer of the
neural network (Hermans, Section 2.3, Paragraph 2 — “We add layers one by one and at all times an output layer only exists at the current top layer.” — teaches adding, by a host processing device (as in Section 3), an auxiliary layer to a selected layer of a neural network (adds layer one by one and at all times an output layer exist at the top layer));
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the supporting layer of Hermans with the consecutive layer latency measurements of Ponomarev and Tan in order to measure the latency of an auxiliary layer and a selected layer. Doing so would enable measuring performance at each layer (Hermans, Section 3.2).
Claim 13 incorporates substantively all the limitations of claim 1 in a system and thus is rejected on the same grounds as above. Claim 13 introduces a further limitation regarding “store the estimated latency in a latency lookup table (LUT) used by the host computing device to predict network-level latency and to adjust neural-network deployment parameters on the neural processing circuit”. Ponomarev, at Page 3 Paragraph 2 – “We assume that our tool is potentially useful for NAS research because it can create all possible models from the desired parametrization and evaluate their TFLite versions on the device. To set up the desired search space, the researcher has to setup parametrization.”, Fig. 2, and Page 8 Table 2 – teaches storing the estimated latency in a latency lookup table used by the host computing device to predict network-level latency (estimated latency stored in LUT) and to adjust neural-network deployment parameters on the neural processing circuit (as in Fig. 2, tool creates all possible models from desired parametrization and deploys models on NPU of devices).
Regarding claim 8, Ponomarev teaches a method to estimate a latency of a layer of a neural network executed on a neural processing unit (NPU) (Ponomarev, Pg. 4, Bullets 4-5), the method comprising:
measuring, by the host processing device, a total latency for the inference operation for the selected layer and [[an]] the auxiliary layer based on hardware-level timing signals (Ponomarev, Page 3 Paragraph 2— “We study inference TensorFlow Lite models on mobile GPU and propose an open-source LETI tool allowing to reconstruct models from parametrization and model latency”, Page 4 Bullet 5 — “we evaluate latency of TFLite models on CPU, GPU, or NPU of Android devices”, and in Page 8 Section 4.1 Paragraph 2 — “After that, we fill all the required layers and compute total latency as the sum of corresponding blocks. The speed of single layer is measured directly with TensorFlow-Benchmark.” — teaches measuring, by the host processing device (on mobile devices), a total latency for the inference operation for the selected layer and auxiliary layer (computes total latency as the sum of corresponding blocks, where the corresponding blocks may be the corresponding layers) based on hardware-level timing signals (latency is evaluated on CPU, GPU, or NPU of devices, and thus measurement of latency is based on hardware-level timing));
modeling, by the host processing device, [[a]]] an latency based on a linear regression model of an input size that input to the selected layer, and an output size that is output from the auxiliary layer, the model being fitted using measured NPU timing data (Ponomarev, Fig. 7 and Page 9 Section 4.2 Paragraph 1 – “The baseline is just to use FLOPs as a latency proxy. There are several ways how to fit it, for example, optimize least square error (linear regression).” & Page 8 Section 4.1 Paragraph 2 — “we initialize inputs for each block with correct input tensor (for the first one, it is image shape: 224x224x3, for the second it is the shape of the first block’s output” – teaches modeling, by the host processing device, a latency for the inference operation that is associated with the layers by modeling the latency based on a linear regression of an input data size that is input to the selected layer and an output data size that is output from the auxiliary layer (linear regression model based on FLOPS, thus latency is based on complexity of selected layers), the model fitted using measured NPU timing data (as in Fig. 7, model is fitted using predicted and measured timing data on devices)); [[and]]
updating a latency lookup table (LUT) stored in memory to include the estimated latency, the LUT being used by a neural architecture search engine to configure NPU execution scheduling with reduced root-mean-square error between the estimated and the measured latencies (Ponomarev, Page 4 Bullet 4 – “we evaluate the latency of TFLite models on the CPU, GPU or NPU of the Android devices”, Page 8 Section 4.1 Paragraph 2 – “We use a single layer as a block. After that, we initialize inputs for each block with the correct input tensor (for the first one, it is image shape 224 x 224 x 3; for the second one, it is the shape of the first block’s output). After that, we convert these blocks as standalone TFLite models and deploy them on the device for evaluation. We evaluate each block’s inference time within 300 runs and put the value into the lookup table”, Fig. 2, and in Page 8 Table 2 – teaches updating a latency lookup table to include the estimated latency (evaluates each block’s inference time and puts value into LUT), the LUT being used by a neural architecture search engine to configure NPU execution scheduling (Fig. 2 shows the LUT being used by a NAS to configure execution scheduling) with reduced root-mean-square error between the estimated and the measured latencies (Table 2 shows the reduced error and relative error between the estimated and measured latencies)).
Ponomarev fails to explicitly teach measuring, by the host processing device, an overhead latency for the inference operation; and subtracting, by the host processing device, the overhead latency from the total latency to generate an estimate of the latency of the layer.
However, analogous to the field of the claimed invention, Tan teaches:
executing, by the NPU, an inference operation over the selected layer and [[an]] the auxiliary layer (Tan, Section 4.2.2 Paragraph 1 – “Since NPU has its own memory space, all data must be moved from the main memory to NPU before the model can be executed on NPU.” and in Section 4.2.2 Paragraph 5 – “The layer processing time for the 𝑙𝑡ℎ layer can be computed as 𝑡𝑝 𝑙 = (𝑇𝑙𝑎𝑠𝑡 𝑙 −𝑡𝑑 𝑙 ) − (𝑇𝑙𝑎𝑠𝑡 𝑙+1 −𝑡𝑑 𝑙+1).” – teaches executing, by the NPU (model executed on NPU), an inference operation over the selected layer and an auxiliary layer to obtain data associated with a model (performs an inference operation over a layer and an auxiliary layer to obtain data, such as processing time information, associated with a model));
measuring, by the host processing device, an overhead latency (Tan, Section 4.2.2 Paragraph 1 – “Since NPU has its own memory space, all data must be moved from the main memory to NPU before the model can be executed on NPU. Since the NPU processing time is short and a large amount of data is transmitted between the main memory and NPU, the data transmission time and the layer processing time may be at similar level. To better estimate the processing time, we cannot ignore this data transmission time, especially when many layers are run on NPU. However, the current SDK for NPU does not include tools for measuring the data transmission time or the layer processing time of different layers. They can only measure the processing time of running the whole DNN model. To address this problem, we propose a method to compute the layer processing time and the data transmission time, and use them to model the data processing time of running the DNNmodel with a layer combination” – teaches measuring, by the host processing device (NPU can only measure processing time of whole model, thus host processing device computes the layer processing time), an overhead latency for the inference operation (running the DNN model) wherein the overhead latency includes data transfer latency (data transmission time) between a dynamic random access memory of a host processing device and a static random access memory of the NPU (NPU has its own memory space, main memory moves data to NPU, computes data transmission time)); [[and]]
subtracting, by the host processing device, the overhead latency from the total latency to generate an estimate of the latency of the layer (Tan, Section 4.2.2, Paragraph 3 – “let 𝑡𝑑 𝑙 denote the data transmission time of moving the input data of the 𝑙𝑡ℎ layer from the main memory to NPU”, Section 4.2.2, Paragraph 4 – “let 𝑇𝑙𝑎𝑠𝑡 𝑙 denote the processing time from the 𝑙𝑡ℎ layer to the last layer of the DNN model and it can be computed as 𝑇𝑙𝑎𝑠𝑡 𝑙 = 𝑡𝑑 𝑙 + 𝑛 𝑖=𝑙 𝑡𝑝 𝑖 + 𝑡𝑟 𝑛..” and in Section 4.2.2 Paragraph 5 – “The layer processing time for the 𝑙𝑡ℎ layer can be computed as 𝑡𝑝 𝑙 = (𝑇𝑙𝑎𝑠𝑡 𝑙 −𝑡𝑑 𝑙 ) − (𝑇𝑙𝑎𝑠𝑡 𝑙+1 −𝑡𝑑 𝑙+1).” – teaches subtracting, by the host processing device, the overhead latency (data transmission time) from the total latency (𝑇𝑙𝑎𝑠𝑡 𝑙 and 𝑇𝑙𝑎𝑠𝑡 𝑙+1), to generate an estimate of the latency of the layer (determines latency of the lth layer))[[.]]; and
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the overhead and layer latency measurements of Tan with the total latency measurement, regression models, and look-up tables of Ponomarev to model the overhead latency and update the LUT for more accurate latency calculation. Doing so would better estimate the processing time of layer by accounting for the data transmission time or layer processing time of different layers (Tan, Section 4.2.2).
The combination of Ponomarev and Tan fails to explicitly teach adding, by a host processing device, an auxiliary layer to a selected layer of the neural network.
adding, by a host processing device, an auxiliary layer to a selected layer of the
neural network (Hermans, Section 2.3, Paragraph 2 — “We add layers one by one and at all times an output layer only exists at the current top layer.” — teaches adding, by a host processing device (as in Section 3), an auxiliary layer to a selected layer of a neural network (adds layer one by one and at all times an output layer exist at the top layer));
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the supporting layer of Hermans with the consecutive layer latency measurements of Ponomarev and Tan in order to measure the latency of an auxiliary layer and a selected layer. Doing so would enable measuring performance at each layer (Hermans, Section 3.2).
Regarding claim 2, the combination of Ponomarev, Tan, and Hermans teaches the method of claim 1, wherein the auxiliary layer comprises an averaging pooling layer, a convolutional Conv lx1 layer, or a convolutional Conv3x3 layer (Ponomarev, Page 4 Bullet 1 — “Firstly, we generate the set of parametrized architectures as in NAS-Benchmark. We verify uniqueness by the same hashing procedure as in Reference [10]. Thus, it is additional proof of the same parametrization and set of models. For NAS-Benchmark configuration at this step, we obtain 423,624 parametrized models/graphs with: max7vertices, max 9 edges with 3 possible layer values (except input and output): [‘conv3x3-bn-relu’, ‘conv1x1-bn-relu’, ‘maxpool3x3’]..” — teaches wherein the auxiliary layer comprises an average pooling layer, a convolutional Conv 1x1 layer, or a Conv 3x3 layer (three possible layer types, thus the added layer would be one of a pooling layer, Conv 1x1 layer, or a Conv 3x3 layer)).
Claims 10 and 14 are similar to claim 2, hence similarly rejected.
Regarding claim 3, the combination of Ponomarev, Tan, and Hermans teach the method of claim 1,
wherein the neural processing unit comprises a first memory, wherein the host processing device is coupled to the neural processing unit and the host processing device comprises a second memory (Tan, Section 4.2.2 Paragraph 1 – “Since NPU has its own memory space, all data must be moved from the main memory to NPU before the model can be executed on NPU. Since the NPU processing time is short and a large amount of data is transmitted between the main memory and NPU, the data transmission time and the layer processing time may be at similar level.” – teaches wherein the neural processing unit comprises a first memory (NPU has its own memory space), wherein the host processing device is coupled to the neural processing unit (data moves from main memory to NPU, thus host processing device is coupled to NPU) and the host processing device comprises a second memory (main memory is memory of host processing device)), and
wherein the overhead latency for the inference operation includes data processing by the host processing device and data transportation between the first memory of the neural processing unit and the second memory of the host processing device to execute the inference operation on the selected layer and the auxiliary layer of the neural network (Tan, Section 4.2.2 Paragraph 1 – “Since NPU has its own memory space, all data must be moved from the main memory to NPU before the model can be executed on NPU. Since the NPU processing time is short and a large amount of data is transmitted between the main memory and NPU, the data transmission time and the layer processing time may be at similar level. To better estimate the processing time, we cannot ignore this data transmission time, especially when many layers are run on NPU. However, the current SDK for NPU does not include tools for measuring the data transmission time or the layer processing time of different layers. They can only measure the processing time of running the whole DNN model. To address this problem, we propose a method to compute the layer processing time and the data transmission time, and use them to model the data processing time of running the DNNmodel with a layer combination”, Section 4.2.2 Paragraph 3 – “let 𝑡𝑑 𝑙 denote the data transmission time of moving the input data of the 𝑙𝑡ℎ layer from the main memory to NPU”, and in Section 4.2.2 Paragraph 5 – “The layer processing time for the 𝑙𝑡ℎ layer can be computed as 𝑡𝑝 𝑙 = (𝑇𝑙𝑎𝑠𝑡 𝑙 −𝑡𝑑 𝑙 ) − (𝑇𝑙𝑎𝑠𝑡 𝑙+1 −𝑡𝑑 𝑙+1).” – teaches wherein the overhead latency for the inference operation includes data processing by the host processing device (layer processing time) and data transportation between the first memory of the neural processing unit and the second memory of the host processing device to execute the inference operation on the selected layer and auxiliary layer of the neural network (computes data transmission time, which is the time to transmit data from main memory to NPU memory, of the inference operation over the consecutive layers 𝑡𝑑 𝑙 and 𝑡𝑑 𝑙+1)).
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the first and second memories, overhead measurements, and layer latency measurements of Tan to further modify the method of Ponomarev, Tan, and Hermans to account for the latency of data transportation during the inference operation between the two layers. Doing so would better estimate the processing time of layer by accounting for the data transmission time or layer processing time of different layers (Tan, Section 4.2.2).
Claims 11 and 15 are similar to claim 3, hence similarly rejected.
Regarding claim 4, the combination of Ponomarev, Tan, and Hermans teaches the method of claim 1, wherein the method further comprises repeating a predetermined number of times executing the inference operation over the selected layer and the auxiliary layer (Ponomarev, Page 4 Bullet 4-5 — “we evaluate latency of TFLite models on CPU, GPU, or NPU of Android devices — with n >= 300 runs” — here, Ponomarev evaluates the total latency, layer by layer, repeated 300 times & in Page 8 Section 4.1 Paragraph 2 – “We use a single layer as a block. After that, we initialize inputs for each block with the correct input tensor (for the first one, it is image shape 224 x 224 x 3; for the second one, it is the shape of the first block’s output). After that, we convert these blocks as standalone TFLite models and deploy them on the device for evaluation. We evaluate each block’s inference time within 300 runs and put the value into the lookup table” – teaches executing an inference operation over selected layers), measuring the total latency for the inference operation for the selected layer and the auxiliary layer (Ponomarev, Page 8 Table 2 — “Latency, measured by direct method and as a sum of block’s latencies for mobile CPU” — shows that the latency is measured as a sum of each layer’s latency)
Ponomarev fails to explicitly teach measuring the overhead latency for the inference operation that is associated with the auxiliary layer.
However, analogous to the field of the claimed invention, Tan teaches:
measuring the overhead latency for the inference operation that is associated with the auxiliary layer (Tan, Section 4.2.2 Paragraph 1 – “Since NPU has its own memory space, all data must be moved from the main memory to NPU before the model can be executed on NPU. Since the NPU processing time is short and a large amount of data is transmitted between the main memory and NPU, the data transmission time and the layer processing time may be at similar level. To better estimate the processing time, we cannot ignore this data transmission time, especially when many layers are run on NPU. However, the current SDK for NPU does not include tools for measuring the data transmission time or the layer processing time of different layers. They can only measure the processing time of running the whole DNN model. To address this problem, we propose a method to compute the layer processing time and the data transmission time, and use them to model the data processing time of running the DNNmodel with a layer combination”, Section 4.2.2 Paragraph 3 – “let 𝑡𝑑 𝑙 denote the data transmission time of moving the input data of the 𝑙𝑡ℎ layer from the main memory to NPU”, and in Section 4.2.2 Paragraph 5 – “The layer processing time for the 𝑙𝑡ℎ layer can be computed as 𝑡𝑝 𝑙 = (𝑇𝑙𝑎𝑠𝑡 𝑙 −𝑡𝑑 𝑙 ) − (𝑇𝑙𝑎𝑠𝑡 𝑙+1 −𝑡𝑑 𝑙+1).” – teaches measuring an overhead latency for the inference operation associated with the consecutive layer (𝑡𝑑 𝑙+1)).
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate the measuring of overhead latency for inference operations associated with layers of Tan to further modify the method of Ponomarev, Tan, and Hermans in order to measure the overhead latency for an inference operation associated with the auxiliary layer of the model. Doing so would better estimate the processing time of layer by accounting for the data transmission time or layer processing time of different layers (Tan, Section 4.2.2).
Claims 12 and 16 are similar to claim 4, hence similarly rejected.
Regarding claim 5, the combination of Ponomarev, Tan, and Hermans teaches the method of claim 1, wherein measuring the overhead latency for the inference operation that is associated with the auxiliary layer further comprises modeling the overhead latency based on a linear regression of an input data size that is input to the selected layer, and an output data size that is output from the auxiliary layer (Ponomarev, Fig. 7 and Page 9 Section 4.2 Paragraph 1 – “The baseline is just to use FLOPs as a latency proxy. There are several ways how to fit it, for example, optimize least square error (linear regression).” & Page 8 Section 4.1 Paragraph 2 — “we initialize inputs for each block with correct input tensor (for the first one, it is image shape: 224x224x3, for the second it is the shape of the first block’s output” – teaches measuring the latency for the inference operation that is associated with the layers by modeling the latency based on a linear regression of an input data size that is input to the selected layer and an output data size that is output from the auxiliary layer (regression model based on FLOPS, thus latency is based on complexity of selected layers).
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate the linear regression for modeling latency of Ponomarev to further modify the overhead latency measurement of Ponomarev, Tan, and Hermans in order to model the overhead latency. Doing so would provide greater precision in the prediction of latency at a GPU or NPU (Ponomarev, Section 4.1).
Claim 17 is similar to claim 5, hence similarly rejected.
Regarding claim 7, the combination of Ponomarev, Tan, and Hermans teach the method of claim 1, further comprising generating a lookup table containing an estimated latency for at least one layer of the neural network (Ponomarev, Page 8 Section 4.1 Paragraph 2 – “We use a single layer as a block. After that, we initialize inputs for each block with the correct input tensor (for the first one, it is image shape 224 x 224 x 3; for the second one, it is the shape of the first block’s output). After that, we convert these blocks as standalone TFLite models and deploy them on the device for evaluation. We evaluate each block’s inference time within 300 runs and put the value into the lookup table” and in Page 8 Table 2 – teaches generating a lookup table containing an estimated latency for at least one layer of the neural network (each block’s, or layer’s, inference time is put into lookup table)).
Claim 19 is similar to claim 7, hence similarly rejected.
Claim(s) 6, 9, and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ponomarev, Tan, and Hermans as applied to claims 1, 8, and 13 above, and further in view of Kim Y. et al. (NPL: CPU-Accelerator Co-Scheduling for CNN Acceleration at the Edge, published Nov. 2020, hereinafter “Kim Y.”).
Regarding claim 6, the combination of Ponomarev, Tan, and Hermans teach the method of claim 1, wherein measuring the overhead latency for the inference operation that is associated with the auxiliary layer further comprises:
determining an input data size that is input to the selected layer, and an output data size that is output from the auxiliary layer (Ponomarev, Page 8 Section 4.1 Paragraph 2 — “we initialize inputs for each block with correct input tensor (for the first one, it is image shape: 224x224x3, for the second it is the shape of the first block’s output)” indicates determining an input data size to selected layer and output data size from proceeding layer);
The combination of Ponomarev, Tan, and Hermans fail to explicitly teach determining a first value for a first coefficient, a second value for a second coefficient and a third value for an intercept variable using a linear regression model; and determining the overhead latency based on the input data size, the output data size, the first coefficient, the second coefficient and the third value.
However, analogous to the field of the claimed invention, Kim Y. teaches:
determining a first value for a first coefficient, a second value for a second coefficient and a third value for an intercept variable using a linear regression model (Kim Y., Figs. 6-9 and Section III, Subsection B Paragraph 1 – “For latency estimation, we use a linear regression-based methodology. In general linear regression, we use a following form of the equation: Eq. (1) where X and Y are explanatory and dependent variables, respectively. With the pairs of X and Y values, we determine α and β values through the linear regression training (line fitting). Based on the above form of the equation, we extend it to estimate accelerator and CPU latencies when processing a single CONV layer.” – teaches determining a first value for a first coefficient, a second value for a second coefficient and a third value for an intercept variable using a linear regression model); and
determining the overhead latency based on the input data size, the output data size, the first coefficient, the second coefficient and the third value (Kim Y., Figs. 6-9 and Section III, Subsection B 1) b) “Transfer Latency” Paragraph 1 – “The transfer latency is also linearly proportional to the size of data to be transferred. Thus, the transfer latency can be estimated by the following equation: Eq. (4) where SizeData indicates a total size of the data (in the number of the elements) to be transferred. The αtran and βtran are also determined by the linear regression analysis. The SizeData can be calculated as follows: Eq (5)” – teaches determining the overhead latency (transfer latency, which may be aggregated with processing, or computation, latency as in Section III, Subsection B 1) d) “Latency Aggregation”) based on the input data size (in Eq. (5), SizeIFMs is the size of the input feature maps), the output data size (SizeOFM is the size of the output feature map), the coefficients and the third value (αtran and βtran)).
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate the latency modeling using regressions of Kim Y. to the overhead latency and data size of Ponomarev, Tan, and Hermans to create an accurate representation of overhead latency based on a linear regression. Doing so would enable the latency regression model to be extended to other systems where an overlap exists between data transfer, computation, and coherence operations (Kim Y., Section III Subsection B 1) d)) and enables determining the relationships between latency and data size by leveraging the linear regression-based latency model (Kim, Section III Subsection B 2))
Claims 9 and 18 are similar to claim 6, hence similarly rejected.
Response to Arguments
Applicant’s arguments, see pp. 5-10 of Remarks, filed 12 November 2025, with respect to the rejection(s) of claim(s) 1, 9, and 13 under 35 U.S.C. 103 have been fully considered and are persuasive. Therefore, the rejection has been withdrawn. However, upon further consideration, a new ground(s) of rejection is made over Ponomarev in view of Tan et al. (NPL: Efficient Execution of Deep Neural Networks on Mobile Devices with NPU, published May 2021), and further in view of Hermans. Ponomarev teaches “measuring, by the host processing device, a total latency…” and “storing the estimated latency in a hardware-specific latency lookup table…”. Tan teaches “executing, by the NPU, an inference operation over the selected layer and…”, “measuring, by the host processing device, an overhead latency…”, and “subtracting, by the host processing device, the overhead latency…”. Hermans teaches “adding, by a host processing device, an auxiliary…”.
Applicant argues on pp. 6 of Remarks that Ponomarev fails to teach “measuring, by the host processing device, a total latency…”. Examiner respectfully disagrees. Ponomarev, in Page 8 Section 4.1 Paragraph 2 — “After that, we fill all the required layers and compute total latency as the sum of corresponding blocks” – teaches measuring a total latency for the inference operation of the corresponding blocks.
Applicant argues on pp. 5 of Remarks that Hermans fails to teach adding an auxiliary layer. Examiner respectfully disagrees. Hermans teaches in Section 2.3, Paragraph 2 — “We add layers one by one and at all times an output layer only exists at the current top layer” and in Section 3.2, Hermans states that this enables checking the performance and contribution of each layer one by one. Thus, it would have been obvious to a person of ordinary skill in the art to incorporate the layer additions of Hermans to the layer latency estimations of Ponomarev and Tan in order to measure the influence each layer has on the latency. Doing so would enable measuring performance at each layer (Hermans, Section 3.2).
Applicant’s arguments with respect to claim(s) 6, 9, and 18 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. Kim Y. et al. (NPL: CPU-Accelerator Co-Scheduling for CNN Acceleration at the Edge, published Nov. 2020) teaches “determining a first value for a first coefficient, a second value for a second coefficient…” and “determining the overhead latency based on the input data size, the output data size…”.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Wess et al. (NPL: ANNETTE: Accurate Neural Network Execution Time Estimation With Stacked Models, published Dec. 2020) teaches methods for estimating the latency of a selected layer and a second, or auxiliary, layer to produce a total execution time for the two layers. This calculation may be used to measure the individual execution time for a single layer.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LOUIS C NYE whose telephone number is 571-272-0636. The examiner can normally be reached Monday - Friday 9:00AM - 5:00PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, MATT ELL can be reached at 571-270-3264. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/LOUIS CHRISTOPHER NYE/Examiner, Art Unit 2141
/MATTHEW ELL/Supervisory Patent Examiner, Art Unit 2141