Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
Acknowledgment is made of the Information Disclosure Statement dated 01/22/2026. All of the cited references have been considered.
Drawings
The drawings have been received on 05/07/2021. These drawings are accepted.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 13, 14, 15, 17, 18 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Hong et al. (US20220121912A1); hereinafter Hong in view of Gadelrab et al. (US20210279635A1); hereinafter Gadelrab
Claim 1 is rejected over Hong and Gadelrab.
Regarding claim 1, Hong teaches a method, comprising: executing a first deep learning accelerator (DLA) model using a first DLA chip of a DLA package, wherein the first DLA chip has a first computational capability and [the first DLA model has a first maximum accuracy value; and] (See Figure 10 of Hong to see three physically distinct NPU units each containing NPU cores with an on-chip SRAM.
PNG
media_image1.png
495
690
media_image1.png
Greyscale
Hong [0107]: “in the case of a layer that requires a great operation time, distributedly processing corresponding operations over a plurality of NPUs 1020 may be beneficial (or result in a faster run time) rather than processing the operations using a single NPU 1020, even when accounting for migration cost of data between the NPUs 1020. Therefore, in this case, a kernel that uses the plurality of NPUs 1020 may be selected.”
Hong [0105]: “In the electronic device of FIG. 10, a model may be executed through at least two NPUs 1020. In this case, the NPUs 1020 may exchange data through a peripheral component interconnect express (PCIe) bus. Also, when capacity of an on-chip memory, for example, an SRAM, or an off-chip memory in the NPU 1020 is insufficient, the host DRAM 1030 may be additionally used.”
Hong [0023]: “generating a plurality of kernels for each of a first neural network model and a second neural network model; and running a kernel of the first model on a number of cores of an available resource of an accelerator; in response to a start of the running, running a kernel of the second model on a remaining number of cores of the available resource of an accelerator; generating one or more output feature maps based on the running of the kernels.”; and
Hong [0024]: “The running of the kernel of the second model may include running the kernel of the second model in response to determining that a collision will not occur between memory access patterns during the running of the kernel of the first model and the running of the kernel of the second model.”
Hong [0025]: “For each of the kernel of the first model and the kernel of the second model, the kernel may be selected for running from among the plurality of kernels based on kernel information including any one or more of: a number of accelerator cores”; and
Hong [0056]: “in response to a plurality of requests received at the host processor 110, the accelerator 120 may execute a plurality of neural networks according to kernels generated by the host processor 110. Here, the plurality of neural networks executed on the accelerator 120 may be neural networks in different structures … Herein, a neural network may also be referred to as a model for clarity of description.”; and
Hong [0058]: “FIG. 2 illustrates an example of a resource allocation process for executing a multi-model on a multi-core accelerator.”;
Hong [0055]: “The neural network may provide an optimal output corresponding to an input by mapping an input and an output having a nonlinear relationship based on deep learning. The deep learning may be a machine learning scheme for solving a given problem from a big data set …”;)
providing signaling indicative of the first results directly to a second DLA chip of the DLA package, wherein the second DLA chip has a second computational capability that is greater than the first computational capability; and (Hong [0096]: “five of the eight cores are allocated to run layer 1 of model 1 and the remaining three of the eight cores are in an idle status. In an “after” situation in which layer 1 of model 1 starts to run in the above situation, layer 1 of model 2 may run by selecting kernel 2 capable of maximally using the remaining three cores.”; and
Hong [0097]: “FIG. 8 illustrates another example of dynamically allocating resources.”; and
Hong [0056]: “the plurality of neural networks executed on the accelerator 120 may be neural networks in different structures.”; and
Hong [0048]: “The request may be for data inference based on the neural network and may cause the accelerator 120 to execute, that is, run (hereinafter, the terms “execute” and “run” and modifications thereof may be interchangeably used) the neural network and to acquire a data inference result for, for example, an object recognition, a pattern recognition, a computer vision, a voice recognition, a machine translation, a machine interpretation, a recommendation service, a personalized service, a video processing, and/or autonomous driving.”; Note: Different applications have different computational capabilities, features, or functions, requiring different accelerator/core performance; and
Hong [0077]: “The scheduler may select a candidate kernel estimated to have the best accelerator performance when running each candidate kernel based on a current situation of the accelerator and kernel information of candidate kernels. A criterion of performance may include any one or any combination of a throughput of each model, a latency, a fairness, a power consumption amount, and a utilization rate of the accelerator based on a situation or a selection of a user.”; and
Hong [0119]: “example, a workload that uses a relatively large computational amount may be allocated to a relatively large number of processing elements and processed thereby, and a workload that uses a relatively small computational amount may be allocated to a relatively small number of processing elements and processed thereby.”;)
Hong does not teach the first DLA model has a first maximum accuracy value; and
providing signaling indicative of the first results directly to a second DLA chip of the DLA package, [wherein the second DLA chip has a second computational capability that is greater than the first computational capability; and]
executing a second DLA model using the second DLA chip and the first results as inputs, wherein the second DLA model has a second maximum accuracy value that is greater than the first maximum accuracy value.
However, Gadelrab teaches the first DLA model has a first maximum accuracy value; and (Gadelrab [0120]: “If the system determines, at block 514, that high efficiency model accuracy is acceptable and overflow/underflow statistics are within the threshold value, the system may return to block 502 and execute a subsequent inference using the high efficiency model.”; and
Gadelrab [0121]: “Otherwise, at block 516, the system exits the high efficiency mode and executes subsequent inferences using the high accuracy mode (e.g., as discussed above with respect to FIG. 4);)
providing signaling indicative of the first results directly to a second DLA chip of the DLA package, [wherein the second DLA chip has a second computational capability that is greater than the first computational capability; and]
executing a second DLA model using the second DLA chip and the first results as inputs, wherein the second DLA model has a second maximum accuracy value that is greater than the first maximum accuracy value. (Gadelrab [0113]: “As illustrated, executing inferences in a high efficiency mode begins at block 502, where a system receives an inference request. The inference request generally includes input on which the information is to be performed.”; and
Gadelrab [0114]: “Generally, the executed inference may be performed using a selected high efficiency model (e.g., a machine learning model and weights quantized to a data type that involves less complex computation”;
Gadelrab [0118]: “At block 514, the system determines whether the high efficiency model accuracy and overflow/underflow statistics are within an acceptable range.”;
Gadelrab [0120]: “If the system determines, at block 514, that high efficiency model accuracy is acceptable and overflow/underflow statistics are within the threshold value, the system may return to block 502 and execute a subsequent inference using the high efficiency model.”
Gadelrab [0121]: “Otherwise, at block 516, the system exits the high efficiency mode and executes subsequent inferences using the high accuracy mode (e.g., as discussed above with respect to FIG. 4)”; and
Gadelrab [0053]: “To allow for inferences to be performed on efficient cores while maintaining a sufficient degree of accuracy in the results generated by executing inferences using a quantized machine learning model, embodiments described herein provide various techniques for adaptively quantizing parameters used by a machine learning model and switching between performing inferences using a high accuracy model on high performance cores and performing inferences using quantized models on efficient cores.”;
Gadelrab [0057]: “The quantized weight information may be in a format for which computation is less intensive than computation using the received weight information.”;)
It would have been obvious before the effective filing date to combine the accelerators executing neural networks of Hong with the switching between different models having different resource requirements of Gadelrab to improve resource efficiency by adaptively optimizing/switching the use of computational (core) resources based on acceptable computational and performance results (Gadelrab, [0053]). Hong and Gadelrab are analogous art because they both concern optimizing neural network models.
Claim 2 is rejected over Hong and Gadelrab with the incorporation of claim 1.
Regarding claim 2, Hong does not teach further comprising, responsive to an accuracy value of second results from executing the second DLA model using the second DLA chip being at least a second threshold accuracy value that is greater than the first threshold accuracy value:
providing signaling indicative of the second results directly to the first DLA chip; and
executing the first DLA model using the first DLA chip and the second results as inputs.
However, Gadelrab teaches further comprising, responsive to an accuracy value of second results from executing the second DLA model using the second DLA chip being at least a second threshold accuracy value that is greater than the first threshold accuracy value:
providing signaling indicative of the second results directly to the first DLA chip; and
executing the first DLA model using the first DLA chip and the second results as inputs. (Gadelrab [0113]: “As illustrated, executing inferences in a high efficiency mode begins at block 502, where a system receives an inference request. The inference request generally includes input on which the information is to be performed.”; and
Gadelrab [0114]: “Generally, the executed inference may be performed using a selected high efficiency model (e.g., a machine learning model and weights quantized to a data type that involves less complex computation”; and
Gadelrab [0118]: “At block 514, the system determines whether the high efficiency model accuracy and overflow/underflow statistics are within an acceptable range.”;
Gadelrab [0120]: “If the system determines, at block 514, that high efficiency model accuracy is acceptable and overflow/underflow statistics are within the threshold value, the system may return to block 502 and execute a subsequent inference using the high efficiency model.”; and
Gadelrab [0121]: “Otherwise, at block 516, the system exits the high efficiency mode and executes subsequent inferences using the high accuracy mode (e.g., as discussed above with respect to FIG. 4)”; and
Gadelrab [0053]:“To allow for inferences to be performed on efficient cores while maintaining a sufficient degree of accuracy in the results generated by executing inferences using a quantized machine learning model, embodiments described herein provide various techniques for adaptively quantizing parameters used by a machine learning model and switching between performing inferences using a high accuracy model on high performance cores and performing inferences using quantized models on efficient cores.”;
Gadelrab [0057]: “The quantized weight information may be in a format for which computation is less intensive than computation using the received weight information.”;)
It would have been obvious before the effective filing date to combine the accelerators executing neural networks of Hong with the switching between different models having different resource requirements of Gadelrab to improve resource efficiency by adaptively optimizing/switching the use of computational (core) resources based on acceptable computational and performance results (Gadelrab, [0053]). Hong and Gadelrab are analogous art because they both concern optimizing neural network models.
Claim 3 is rejected over Hong and Gadelrab with the incorporation of claim 1.
Regarding claim 3, Hong does not teach further comprising, subsequent to providing signaling indicative of the second results directly to the first DLA chip, reducing power provided to the second DLA chip.
However, Gadelrab teaches further comprising, subsequent to providing signaling indicative of the second results directly to the first DLA chip, reducing power provided to the second DLA chip. (Gadelrab [0118]: “At block 514, the system determines whether the high efficiency model accuracy and overflow/underflow statistics are within an acceptable range.”;
Gadelrab [0120]: “If the system determines, at block 514, that high efficiency model accuracy is acceptable and overflow/underflow statistics are within the threshold value, the system may return to block 502 and execute a subsequent inference using the high efficiency model.”; and
Gadelrab [0121]: “Otherwise, at block 516, the system exits the high efficiency mode and executes subsequent inferences using the high accuracy mode (e.g., as discussed above with respect to FIG. 4)”; and
Gadelrab [0053]: “To allow for inferences to be performed on efficient cores while maintaining a sufficient degree of accuracy in the results generated by executing inferences using a quantized machine learning model, embodiments described herein provide various techniques for adaptively quantizing parameters used by a machine learning model and switching between performing inferences using a high accuracy model on high performance cores and performing inferences using quantized models on efficient cores.”);
It would have been obvious before the effective filing date to combine the accelerators executing neural networks of Hong with the switching between different models having different resource requirements of Gadelrab to improve resource efficiency by adaptively optimizing/switching the use of computational (core) resources based on acceptable computational and performance results (Gadelrab, [0053]). Hong and Gadelrab are analogous art because they both concern optimizing neural network models.
Claim 4 is rejected over Hong and Gadelrab with the incorporation of claim 1.
Regarding claim 4, Hong teaches providing signaling indicative of the second results directly to a third DLA chip of the DLA package, wherein the third DLA chip has a third computational capability that is greater than the second computational capability of the second DLA chip; and
executing a third DLA model using the third DLA chip and the second results as inputs, [wherein the third DLA model has a third maximum accuracy value that is greater than the second maximum accuracy value of the second DLA model.] (Hong [0096]: “five of the eight cores are allocated to run layer 1 of model 1 and the remaining three of the eight cores are in an idle status. In an “after” situation in which layer 1 of model 1 starts to run in the above situation, layer 1 of model 2 may run by selecting kernel 2 capable of maximally using the remaining three cores.”; and
Hong [0097-0098]: “FIG. 8 illustrates another example of dynamically allocating resources.”; Note: See paragraphs previously applied to claim 1 above, [0048], [0056], [0077], and [0119]; and
Hong [0059]: “FIG. 2 illustrates an example in which a compiler 210 may generate a plurality of kernels in a design time and a scheduler 220 may allocate one of the plurality of kernels to an accelerator 230 based on a status of the accelerator 230 in a run time. In FIG. 2, it may be assumed that each of models 1 to 3 is requested at a specific time for clarity of description.”;)
Hong does not teach further comprising, responsive to an accuracy value of second results from executing the second DLA model using the second DLA chip being at most the first threshold accuracy value:
[executing a third DLA model using the third DLA chip and the second results as inputs,] wherein the third DLA model has a third maximum accuracy value that is greater than the second maximum accuracy value of the second DLA model.
However, Gadelrab teaches further comprising, responsive to an accuracy value of second results from executing the second DLA model using the second DLA chip being at most the first threshold accuracy value: (Gadelrab [0120]: “If the system determines, at block 514, that high efficiency model accuracy is acceptable and overflow/underflow statistics are within the threshold value, the system may return to block 502 and execute a subsequent inference using the high efficiency model.”; and
Gadelrab [0121]: “Otherwise, at block 516, the system exits the high efficiency mode and executes subsequent inferences using the high accuracy mode (e.g., as discussed above with respect to FIG. 4)”; and
Gadelrab [0053]: “To allow for inferences to be performed on efficient cores while maintaining a sufficient degree of accuracy in the results generated by executing inferences using a quantized machine learning model, embodiments described herein provide various techniques for adaptively quantizing parameters used by a machine learning model and switching between performing inferences using a high accuracy model on high performance cores and performing inferences using quantized models on efficient cores.”;)
[executing a third DLA model using the third DLA chip and the second results as inputs,] wherein the third DLA model has a third maximum accuracy value that is greater than the second maximum accuracy value of the second DLA model. (Gadelrab [0120]: “If the system determines, at block 514, that high efficiency model accuracy is acceptable and overflow/underflow statistics are within the threshold value, the system may return to block 502 and execute a subsequent inference using the high efficiency model.”; and
Gadelrab [0121]: Otherwise, at block 516, the system exits the high efficiency mode and executes subsequent inferences using the high accuracy mode (e.g., as discussed above with respect to FIG. 4)”; and
Gadelrab [0053]: “To allow for inferences to be performed on efficient cores while maintaining a sufficient degree of accuracy in the results generated by executing inferences using a quantized machine learning model, embodiments described herein provide various techniques for adaptively quantizing parameters used by a machine learning model and switching between performing inferences using a high accuracy model on high performance cores and performing inferences using quantized models on efficient cores.”;)
It would have been obvious before the effective filing date to combine the accelerators executing neural networks of Hong with the switching between different models having different resource requirements of Gadelrab to improve resource efficiency by adaptively optimizing/switching the use of computational (core) resources based on acceptable computational and performance results (Gadelrab, [0053]). Hong and Gadelrab are analogous art because they both concern optimizing neural network models.
Claim 5 is rejected over Hong and Gadelrab.
Regarding claim 5, Hong teaches an apparatus, comprising: (Hong [0020]: “a data processing apparatus includes: one or more processors configured to: receive a request for executing a neural network model on an accelerator; generate a plurality of candidate kernels for each of a plurality of layers comprised in the model”; [0020])
a deep learning accelerator (DLA) package, comprising:
a first DLA chip comprising a first quantity of DLA cores; and (See Figure 10 of Hong to see three physically distinct NPU units each containing NPU cores with an on-chip SRAM.
PNG
media_image1.png
495
690
media_image1.png
Greyscale
Hong [0107]: “in the case of a layer that requires a great operation time, distributedly processing corresponding operations over a plurality of NPUs 1020 may be beneficial (or result in a faster run time) rather than processing the operations using a single NPU 1020, even when accounting for migration cost of data between the NPUs 1020. Therefore, in this case, a kernel that uses the plurality of NPUs 1020 may be selected.”
Hong [0105]: “In the electronic device of FIG. 10, a model may be executed through at least two NPUs 1020. In this case, the NPUs 1020 may exchange data through a peripheral component interconnect express (PCIe) bus. Also, when capacity of an on-chip memory, for example, an SRAM, or an off-chip memory in the NPU 1020 is insufficient, the host DRAM 1030 may be additionally used.”
Hong [0056]: “the plurality of neural networks executed on the accelerator 120 may be neural networks in different structures”; and
Hong [0048]: “The request may be for data inference based on the neural network and may cause the accelerator 120 to execute, that is, run (hereinafter, the terms “execute” and “run” and modifications thereof may be interchangeably used) the neural network and to acquire a data inference result for, for example, an object recognition, a pattern recognition, a computer vision, a voice recognition, a machine translation, a machine interpretation, a recommendation service, a personalized service, a video processing, and/or autonomous driving” Note: Different applications have different computational capabilities, features, or functions, requiring different accelerator/core performance; and
Hong [0077]: “The scheduler may select a candidate kernel estimated to have the best accelerator performance when running each candidate kernel based on a current situation of the accelerator and kernel information of candidate kernels. A criterion of performance may include any one or any combination of a throughput of each model, a latency, a fairness, a power consumption amount, and a utilization rate of the accelerator based on a situation or a selection of a user.”; and
Hong [0119]: “For example, a workload that uses a relatively large computational amount may be allocated to a relatively large number of processing elements and processed thereby, and a workload that uses a relatively small computational amount may be allocated to a relatively small number of processing elements and processed thereby.”; Note: See paragraphs [0023]-[0025] and [0051] to see that each accelerator core (of the plurality of accelerator cores on one accelerator chip) may include one or more processing elements (PEs) configured to perform operations according to the neural network).
PNG
media_image1.png
495
690
media_image1.png
Greyscale
a second DLA chip comprising a second quantity of DLA cores that is greater than the first quantity of DLA cores, (See Figure 10 of Hong to see three physically distinct NPU units each containing NPU cores with an on-chip SRAM.
Hong [0107]: “in the case of a layer that requires a great operation time, distributedly processing corresponding operations over a plurality of NPUs 1020 may be beneficial (or result in a faster run time) rather than processing the operations using a single NPU 1020, even when accounting for migration cost of data between the NPUs 1020. Therefore, in this case, a kernel that uses the plurality of NPUs 1020 may be selected.”
Hong [0105]: “In the electronic device of FIG. 10, a model may be executed through at least two NPUs 1020. In this case, the NPUs 1020 may exchange data through a peripheral component interconnect express (PCIe) bus. Also, when capacity of an on-chip memory, for example, an SRAM, or an off-chip memory in the NPU 1020 is insufficient, the host DRAM 1030 may be additionally used.”
Hong [0096]: “five of the eight cores are allocated to run layer 1 of model 1 and the remaining three of the eight cores are in an idle status. In an “after” situation in which layer 1 of model 1 starts to run in the above situation, layer 1 of model 2 may run by selecting kernel 2 capable of maximally using the remaining three cores.”; and
Hong [0097]: “FIG. 8 illustrates another example of dynamically allocating resources.”; and
Hong [0056]: “the plurality of neural networks executed on the accelerator 120 may be neural networks in different structures.”; and
Hong [0048]: “The request may be for data inference based on the neural network and may cause the accelerator 120 to execute, that is, run (hereinafter, the terms “execute” and “run” and modifications thereof may be interchangeably used) the neural network and to acquire a data inference result for, for example, an object recognition, a pattern recognition, a computer vision, a voice recognition, a machine translation, a machine interpretation, a recommendation service, a personalized service, a video processing, and/or autonomous driving.”; Note: Different applications have different computational capabilities, features, or functions, requiring different accelerator/core performance; and
Hong [0077]: “The scheduler may select a candidate kernel estimated to have the best accelerator performance when running each candidate kernel based on a current situation of the accelerator and kernel information of candidate kernels. A criterion of performance may include any one or any combination of a throughput of each model, a latency, a fairness, a power consumption amount, and a utilization rate of the accelerator based on a situation or a selection of a user.”; and
Hong [0119]: “example, a workload that uses a relatively large computational amount may be allocated to a relatively large number of processing elements and processed thereby, and a workload that uses a relatively small computational amount may be allocated to a relatively small number of processing elements and processed thereby.”);
wherein the first DLA chip is coupled to the second DLA chip, (Hong [0051]: “The accelerator core 131 may include one or more processing elements (PEs) configured to perform operations according to the neural network. Although FIG. 1 illustrates that a single accelerator core 131 is included in the accelerator chip 130 for clarity of description, a plurality of accelerator cores may be included in the accelerator chip 130 and may process operations “; and [0134]“components in a described system, architecture, device, or circuit are combined in a different manner, and/or replaced or supplemented by other components or their equivalents.”;)
wherein the first DLA chip is configured to execute a first DLA model having a first computational capability, and (Hong [0056]: “the plurality of neural networks executed on the accelerator 120 may be neural networks in different structures”; and
Hong [0048]: “The request may be for data inference based on the neural network and may cause the accelerator 120 to execute, that is, run (hereinafter, the terms “execute” and “run” and modifications thereof may be interchangeably used) the neural network and to acquire a data inference result for, for example, an object recognition, a pattern recognition, a computer vision, a voice recognition, a machine translation, a machine interpretation, a recommendation service, a personalized service, a video processing, and/or autonomous driving”; Note: Different applications have different computational capabilities, features, or functions, requiring different accelerator/core performance; and
Hong [0077]: “The scheduler may select a candidate kernel estimated to have the best accelerator performance when running each candidate kernel based on a current situation of the accelerator and kernel information of candidate kernels. A criterion of performance may include any one or any combination of a throughput of each model, a latency, a fairness, a power consumption amount, and a utilization rate of the accelerator based on a situation or a selection of a user.”; and
Hong [0119]: “For example, a workload that uses a relatively large computational amount may be allocated to a relatively large number of processing elements and processed thereby, and a workload that uses a relatively small computational amount may be allocated to a relatively small number of processing elements and processed thereby.”; Note: See paragraphs [0023]-[0025] and [0051] to see that each accelerator core (of the plurality of accelerator cores on one accelerator chip) may include one or more processing elements (PEs) configured to perform operations according to the neural network).
Hong does not appear to explicitly teach wherein the second DLA chip is configured to execute a second DLA model having a second computational capability that is greater than the first computational capability.
However, Gadelrab teaches wherein the second DLA chip is configured to execute a second DLA model having a second computational capability that is greater than the first computational capability. (See Fig. 5 and paragraphs [0112]-[0121] of Gadelrab;
(Gadelrab [0113]: “As illustrated, executing inferences in a high efficiency mode begins at block 502, where a system receives an inference request. The inference request generally includes input on which the information is to be performed.”; and
Gadelrab [0114]: “Generally, the executed inference may be performed using a selected high efficiency model (e.g., a machine learning model and weights quantized to a data type that involves less complex computation”;
Gadelrab [0118]: “At block 514, the system determines whether the high efficiency model accuracy and overflow/underflow statistics are within an acceptable range.”;
Gadelrab [0120]: “If the system determines, at block 514, that high efficiency model accuracy is acceptable and overflow/underflow statistics are within the threshold value, the system may return to block 502 and execute a subsequent inference using the high efficiency model.”;
Gadelrab [0121]: “Otherwise, at block 516, the system exits the high efficiency mode and executes subsequent inferences using the high accuracy mode (e.g., as discussed above with respect to FIG. 4)”; and
Gadelrab [0053]: “To allow for inferences to be performed on efficient cores while maintaining a sufficient degree of accuracy in the results generated by executing inferences using a quantized machine learning model, embodiments described herein provide various techniques for adaptively quantizing parameters used by a machine learning model and switching between performing inferences using a high accuracy model on high performance cores and performing inferences using quantized models on efficient cores.”;
Gadelrab [0054]: “execution of inferences may begin using a high accuracy model on high performance cores (e.g., using a machine learning model with floating point weights), and inferences may be performed in parallel on one or more high efficiency cores using one or more sets of quantized weights (e.g., weights quantized to reduced-size representations, such as 16-bit integer, to 8-bit integer, 4-bit integer, etc., relative to the floating point weights included in a high accuracy model)”
Gadelrab [0057]: “The quantized weight information may be in a format for which computation is less intensive than computation using the received weight information.”;)
Gadelrab [0103]: “the system executes an inference using the high accuracy representation of the machine learning model on high accuracy hardware. High accuracy hardware may be processors or processing cores that can perform floating point operations, such as cores designated as high performance cores in a heterogeneous multicore processor (e.g. “big” cores in a BIG.little architecture), graphics processing units, tensor processing units, neural processing units, and/or other high performance processing units”;
Gadelrab [0110]: “The selected high efficiency model may be executed on high efficiency processors that use less power than the high accuracy hardware discussed above. These processors may include, for example, processors designed as high efficiency cores in a heterogeneous multicore processor (e.g., “little” cores in a BIG.little architecture), integer processing modules on a processor, or the like.”;)
It would have been obvious before the effective filing date to combine the accelerators executing neural networks of Hong with the switching between different models having different resource requirements of Gadelrab to improve resource efficiency by adaptively optimizing/switching the use of computational (core) resources based on acceptable computational and performance results (Gadelrab, [0053]). Hong and Gadelrab are analogous art because they both concern optimizing neural network models.
Claim 6 is rejected over Hong and Gadelrab with the incorporation of claim 5.
Regarding claim 6, Hong teaches Hong teaches wherein the first DLA chip is further configured to provide, to the second DLA chip, signaling indicative of results from executing the first DLA model, and wherein the second DLA chip is further configured to execute the second DLA model using the results as inputs. (Hong [0051]: “The accelerator core 131 may include one or more processing elements (PEs) configured to perform operations according to the neural network. Although FIG. 1 illustrates that a single accelerator core 131 is included in the accelerator chip 130 for clarity of description, a plurality of accelerator cores may be included in the accelerator chip 130 and may process operations “; and [0134]: “components in a described system, architecture, device, or circuit are combined in a different manner, and/or replaced or supplemented by other components or their equivalents.”; and
Hong [0096]: “five of the eight cores are allocated to run layer 1 of model 1 and the remaining three of the eight cores are in an idle status. In an “after” situation in which layer 1 of model 1 starts to run in the above situation, layer 1 of model 2 may run by selecting kernel 2 capable of maximally using the remaining three cores.”; and
Hong [0097]: “FIG. 8 illustrates another example of dynamically allocating resources.”;
Hong [0056]: “the plurality of neural networks executed on the accelerator 120 may be neural networks in different structures.”;
Hong [0048]: “The request may be for data inference based on the neural network and may cause the accelerator 120 to execute, that is, run (hereinafter, the terms “execute” and “run” and modifications thereof may be interchangeably used) the neural network and to acquire a data inference result for, for example, an object recognition, a pattern recognition, a computer vision, a voice recognition, a machine translation, a machine interpretation, a recommendation service, a personalized service, a video processing, and/or autonomous driving.”; Note: Different applications have different computational capabilities, features, or functions, requiring different accelerator/core performance;
Hong [0077]: “The scheduler may select a candidate kernel estimated to have the best accelerator performance when running each candidate kernel based on a current situation of the accelerator and kernel information of candidate kernels. A criterion of performance may include any one or any combination of a throughput of each model, a latency, a fairness, a power consumption amount, and a utilization rate of the accelerator based on a situation or a selection of a user.”;
Hong [0119]: “example, a workload that uses a relatively large computational amount may be allocated to a relatively large number of processing elements and processed thereby, and a workload that uses a relatively small computational amount may be allocated to a relatively small number of processing elements and processed thereby.”;)
Claim 7 is rejected over Hong and Gadelrab with the incorporation of claim 5.
Regarding claim 7, Hong does not teach wherein the DLA package is configured to, in response to execution of the first DLA model yielding results having a confidence value less than a threshold confidence value:
pause execution of the first DLA model using the first DLA chip;
execute the second DLA model using the second DLA chip.
However, Gadelrab teaches wherein the DLA package is configured to, in response to execution of the first DLA model yielding results having a confidence value less than a threshold confidence value:
pause execution of the first DLA model using the first DLA chip;
execute the second DLA model using the second DLA chip. (Gadelrab [0113]: “As illustrated, executing inferences in a high efficiency mode begins at block 502, where a system receives an inference request. The inference request generally includes input on which the information is to be performed.”; and
Gadelrab [0114]: “Generally, the executed inference may be performed using a selected high efficiency model (e.g., a machine learning model and weights quantized to a data type that involves less complex computation”;
Gadelrab [0118]: “At block 514, the system determines whether the high efficiency model accuracy and overflow/underflow statistics are within an acceptable range.”;
Gadelrab [0120]: “If the system determines, at block 514, that high efficiency model accuracy is acceptable and overflow/underflow statistics are within the threshold value, the system may return to block 502 and execute a subsequent inference using the high efficiency model.”;
Gadelrab [0121]: “Otherwise, at block 516, the system exits the high efficiency mode and executes subsequent inferences using the high accuracy mode (e.g., as discussed above with respect to FIG. 4)”; and
Gadelrab [0053]: “To allow for inferences to be performed on efficient cores while maintaining a sufficient degree of accuracy in the results generated by executing inferences using a quantized machine learning model, embodiments described herein provide various techniques for adaptively quantizing parameters used by a machine learning model and switching between performing inferences using a high accuracy model on high performance cores and performing inferences using quantized models on efficient cores.”;
Gadelrab [0054]: “execution of inferences may begin using a high accuracy model on high performance cores (e.g., using a machine learning model with floating point weights), and inferences may be performed in parallel on one or more high efficiency cores using one or more sets of quantized weights (e.g., weights quantized to reduced-size representations, such as 16-bit integer, to 8-bit integer, 4-bit integer, etc., relative to the floating point weights included in a high accuracy model)”
Gadelrab [0057]: “The quantized weight information may be in a format for which computation is less intensive than computation using the received weight information.”;)
It would have been obvious before the effective filing date to combine the accelerators executing neural networks of Hong with the switching between different models having different resource requirements of Gadelrab to improve resource efficiency by adaptively optimizing/switching the use of computational (core) resources based on acceptable computational and performance results (Gadelrab, [0053]). Hong and Gadelrab are analogous art because they both concern optimizing neural network models.
Claim 8 is rejected over Hong and Gadelrab with the incorporation of claim 5.
Regarding claim 8, Hong does not teach wherein the DLA package is further configured to, in response to execution of the second DLA model yielding results having at least the threshold confidence value:
pause execution of the second DLA model; and
resume execution of the first DLA model using the first DLA chip.
However, Gadelrab teaches wherein the DLA package is further configured to, in response to execution of the second DLA model yielding results having at least the threshold confidence value:
pause execution of the second DLA model; and
resume execution of the first DLA model using the first DLA chip. (Gadelrab [0120]: “If the system determines, at block 514, that high efficiency model accuracy is acceptable and overflow/underflow statistics are within the threshold value, the system may return to block 502 and execute a subsequent inference using the high efficiency model.”
Gadelrab [0121]: “Otherwise, at block 516, the system exits the high efficiency mode and executes subsequent inferences using the high accuracy mode (e.g., as discussed above with respect to FIG. 4)”;
Gadelrab [0053]: “To allow for inferences to be performed on efficient cores while maintaining a sufficient degree of accuracy in the results generated by executing inferences using a quantized machine learning model, embodiments described herein provide various techniques for adaptively quantizing parameters used by a machine learning model and switching between performing inferences using a high accuracy model on high performance cores and performing inferences using quantized models on efficient cores.”;)
It would have been obvious before the effective filing date to combine the accelerators executing neural networks of Hong with the switching between different models having different resource requirements of Gadelrab to improve resource efficiency by adaptively optimizing/switching the use of computational (core) resources based on acceptable computational and performance results (Gadelrab, [0053]). Hong and Gadelrab are analogous art because they both concern optimizing neural network models.
Claim 9 is rejected over Hong and Gadelrab with the incorporation of claim 5.
Regarding claim 9, Hong teaches wherein the first DLA chip is directly coupled to the second DLA chip. (Hong [0051]: “The accelerator core 131 may include one or more processing elements (PEs) configured to perform operations according to the neural network. Although FIG. 1 illustrates that a single accelerator core 131 is included in the accelerator chip 130 for clarity of description, a plurality of accelerator cores may be included in the accelerator chip 130 and may process operations “; [0134]: “components in a described system, architecture, device, or circuit are combined in a different manner, and/or replaced or supplemented by other components or their equivalents.”; and Hong [0105]: “In the electronic device of FIG. 10, a model may be executed through at least two NPUs 1020. In this case, the NPUs 1020 may exchange data through a peripheral component interconnect express (PCIe) bus. Also, when capacity of an on-chip memory, for example, an SRAM, or an off-chip memory in the NPU 1020 is insufficient, the host DRAM 1030 may be additionally used.”)
Claim 10 is rejected over Hong and Gadelrab with the incorporation of claim 5.
Regarding claim 10, Hong teaches wherein the first DLA chip comprises a first application specific integrated circuit (ASIC), and wherein the second DLA chip comprises a second ASIC. (Hong [0004]: “Dedicated hardware for AI may be implemented by, for example, a central processing unit (CPU) and a graphics processing unit (GPU) and may also be implemented by a field programmable gate array (FPGA) and an application specific integrated circuit (ASIC).”;)
Claim 13 is rejected over Hong and Gadelrab.
Regarding claim 13, Hong teaches a system, comprising:
a deep learning accelerator (DLA) package comprising: (Hong [0123]: “The examples described herein may comprehensively apply to a server-oriented product line applied with at least one system on a chip (SoC) each in which a plurality of accelerator cores and shared memories in a hierarchical structure are connected in a cluster-based structure, for accelerating AI processing.”;)
a first deep learning accelerator (DLA) chip comprising a first plurality of DLA cores and configured to execute a first DLA model; and (Hong [0023]: “generating a plurality of kernels for each of a first neural network model and a second neural network model; and running a kernel of the first model on a number of cores of an available resource of an accelerator; in response to a start of the running, running a kernel of the second model on a remaining number of cores of the available resource of an accelerator; generating one or more output feature maps based on the running of the kernels.”
Hong [0024]: “The running of the kernel of the second model may include running the kernel of the second model in response to determining that a collision will not occur between memory access patterns during the running of the kernel of the first model and the running of the kernel of the second model.”
Hong [0025]: “For each of the kernel of the first model and the kernel of the second model, the kernel may be selected for running from among the plurality of kernels based on kernel information including any one or more of: a number of accelerator cores; [0023-0025]; and”
Hong [0056]: “in response to a plurality of requests received at the host processor 110, the accelerator 120 may execute a plurality of neural networks according to kernels generated by the host processor 110. Here, the plurality of neural networks executed on the accelerator 120 may be neural networks in different structures … Herein, a neural network may also be referred to as a model for clarity of description.”;
Hong [0058]: “FIG. 2 illustrates an example of a resource allocation process for executing a multi-model on a multi-core accelerator.”;
Hong [0055]: “The neural network may provide an optimal output corresponding to an input by mapping an input and an output having a nonlinear relationship based on deep learning. The deep learning may be a machine learning scheme for solving a given problem from a big data set …”;)
a second DLA chip coupled to the first DLA chip and comprising a second plurality of DLA cores greater in quantity than the first plurality of DLA cores and configured to execute a second DLA model on results from execution of the first DLA model; and (Hong [0051]: “The accelerator core 131 may include one or more processing elements (PEs) configured to perform operations according to the neural network. Although FIG. 1 illustrates that a single accelerator core 131 is included in the accelerator chip 130 for clarity of description, a plurality of accelerator cores may be included in the accelerator chip 130 and may process operations “; [0134]: “components in a described system, architecture, device, or circuit are combined in a different manner, and/or replaced or supplemented by other components or their equivalents.”; and Hong [0105]: “In the electronic device of FIG. 10, a model may be executed through at least two NPUs 1020. In this case, the NPUs 1020 may exchange data through a peripheral component interconnect express (PCIe) bus. Also, when capacity of an on-chip memory, for example, an SRAM, or an off-chip memory in the NPU 1020 is insufficient, the host DRAM 1030 may be additionally used.”)
Hong [0096]: “five of the eight cores are allocated to run layer 1 of model 1 and the remaining three of the eight cores are in an idle status. In an “after” situation in which layer 1 of model 1 starts to run in the above situation, layer 1 of model 2 may run by selecting kernel 2 capable of maximally using the remaining three cores.”;
Hong [0097]: “FIG. 8 illustrates another example of dynamically allocating resources.”; [0097]; and
Hong [0056]: “the plurality of neural networks executed on the accelerator 120 may be neural networks in different structures.”;
Hong [0048]: “The request may be for data inference based on the neural network and may cause the accelerator 120 to execute, that is, run (hereinafter, the terms “execute” and “run” and modifications thereof may be interchangeably used) the neural network and to acquire a data inference result for, for example, an object recognition, a pattern recognition, a computer vision, a voice recognition, a machine translation, a machine interpretation, a recommendation service, a personalized service, a video processing, and/or autonomous driving.” Note: Different applications have different computational capabilities, features, or functions, requiring different accelerator/core performance;
Hong [0077]: “The scheduler may select a candidate kernel estimated to have the best accelerator performance when running each candidate kernel based on a current situation of the accelerator and kernel information of candidate kernels. A criterion of performance may include any one or any combination of a throughput of each model, a latency, a fairness, a power consumption amount, and a utilization rate of the accelerator based on a situation or a selection of a user.”; and
Hong [0119]: “example, a workload that uses a relatively large computational amount may be allocated to a relatively large number of processing elements and processed thereby, and a workload that uses a relatively small computational amount may be allocated to a relatively small number of processing elements and processed thereby.”;)
Hong does not teach control circuitry coupled the DLA package and configured to maintain the second DLA chip in a low power state during execution of the first DLA model by the first DLA chip.
However, Gadelrab teaches control circuitry coupled the DLA package and configured to maintain the second DLA chip in a low power state during execution of the first DLA model by the first DLA chip. (Gadelrab [0120]: “If the system determines, at block 514, that high efficiency model accuracy is acceptable and overflow/underflow statistics are within the threshold value, the system may return to block 502 and execute a subsequent inference using the high efficiency model.”
Gadelrab [0121]: Otherwise, at block 516, the system exits the high efficiency mode and executes subsequent inferences using the high accuracy mode (e.g., as discussed above with respect to FIG. 4);)
It would have been obvious before the effective filing date to combine the accelerators executing neural networks of Hong with the switching between different models having different resource requirements of Gadelrab to improve resource efficiency by adaptively optimizing/switching the use of computational (core) resources based on acceptable computational and performance results (Gadelrab, [0053]). Hong and Gadelrab are analogous art because they both concern optimizing neural network models.
Claim 14 is rejected over Hong and Gadelrab with the incorporation of claim 13.
Regarding claim 14, Hong does not teach wherein the control circuitry is further configured to provide first signaling to the DLA package to cause the second DLA chip to exit the low power state in response to results from execution of the first DLA model having a confidence value less than a threshold confidence value, and
wherein the first DLA chip is further configured to provide second signaling to the second DLA chip indicative of the results from execution of the first DLA model in response to the results from execution of the first DLA model having a confidence value less than the threshold confidence value.
However, Gadelrab teaches wherein the control circuitry is further configured to provide first signaling to the DLA package to cause the second DLA chip to exit the low power state in response to results from execution of the first DLA model having a confidence value less than a threshold confidence value, and (Gadelrab [0113]: “As illustrated, executing inferences in a high efficiency mode begins at block 502, where a system receives an inference request. The inference request generally includes input on which the information is to be performed.”;
Gadelrab [0114]: “Generally, the executed inference may be performed using a selected high efficiency model (e.g., a machine learning model and weights quantized to a data type that involves less complex computation”;
Gadelrab [0118]: “At block 514, the system determines whether the high efficiency model accuracy and overflow/underflow statistics are within an acceptable range.”;
Gadelrab [0120]: “If the system determines, at block 514, that high efficiency model accuracy is acceptable and overflow/underflow statistics are within the threshold value, the system may return to block 502 and execute a subsequent inference using the high efficiency model.”
Gadelrab [0121]: “Otherwise, at block 516, the system exits the high efficiency mode and executes subsequent inferences using the high accuracy mode (e.g., as discussed above with respect to FIG. 4)”;
Gadelrab [0053]: “To allow for inferences to be performed on efficient cores while maintaining a sufficient degree of accuracy in the results generated by executing inferences using a quantized machine learning model, embodiments described herein provide various techniques for adaptively quantizing parameters used by a machine learning model and switching between performing inferences using a high accuracy model on high performance cores and performing inferences using quantized models on efficient cores.”;
Gadelrab [0057]: “The quantized weight information may be in a format for which computation is less intensive than computation using the received weight information.”;)
wherein the first DLA chip is further configured to provide second signaling to the second DLA chip indicative of the results from execution of the first DLA model in response to the results from execution of the first DLA model having a confidence value less than the threshold confidence value. (Gadelrab [0120]: “If the system determines, at block 514, that high efficiency model accuracy is acceptable and overflow/underflow statistics are within the threshold value, the system may return to block 502 and execute a subsequent inference using the high efficiency model.”
Gadelrab [0121]: “Otherwise, at block 516, the system exits the high efficiency mode and executes subsequent inferences using the high accuracy mode (e.g., as discussed above with respect to FIG. 4)”;)
It would have been obvious before the effective filing date to combine the accelerators executing neural networks of Hong with the switching between different models having different resource requirements of Gadelrab to improve resource efficiency by adaptively optimizing/switching the use of computational (core) resources based on acceptable computational and performance results (Gadelrab, [0053]). Hong and Gadelrab are analogous art because they both concern optimizing neural network models.
Claim 15 is rejected over Hong and Gadelrab with the incorporation of claim 13.
Regarding claim 15, Hong teaches wherein the first DLA chip comprises a first array of multiply and accumulate circuits (MACs) of a first size corresponding to processing requirements of the first DLA model, and (Hong [0110]: “One of the plurality of processing elements, a processing element 1110 may include the LV0 memory 1111, an LV0 direct memory access (DMA) 1113, a multiplier-accumulator (MAC) 1115, and an LV0 controller 1117.”)
wherein the second DLA chip comprises a second array of MACs of a second size corresponding to processing requirements of the second DLA model, wherein the second array of MACs is larger than the first array of MACs. (Hong [0110]: “One of the plurality of processing elements, a processing element 1110 may include the LV0 memory 1111, an LV0 direct memory access (DMA) 1113, a multiplier-accumulator (MAC) 1115, and an LV0 controller 1117.”;
Hong [0119]: “For example, a workload that uses a relatively large computational amount may be allocated to a relatively large number of processing elements and processed thereby, and a workload that uses a relatively small computational amount may be allocated to a relatively small number of processing elements and processed thereby.”; Note: See paragraphs [0023]-[0025] and [0051] to see that each accelerator core (of the plurality of accelerator cores on one accelerator chip) may include one or more processing elements (PEs) configured to perform operations according to the neural network).)
Claim 17 is rejected over Hong and Gadelrab with the incorporation of claim 13.
Regarding claim 17, Hong teaches wherein the first DLA chip is directly coupled to the second DLA chip. (Hong [0051]: “The accelerator core 131 may include one or more processing elements (PEs) configured to perform operations according to the neural network. Although FIG. 1 illustrates that a single accelerator core 131 is included in the accelerator chip 130 for clarity of description, a plurality of accelerator cores may be included in the accelerator chip 130 and may process operations “; [0134]: “components in a described system, architecture, device, or circuit are combined in a different manner, and/or replaced or supplemented by other components or their equivalents.”; and Hong [0105]: “In the electronic device of FIG. 10, a model may be executed through at least two NPUs 1020. In this case, the NPUs 1020 may exchange data through a peripheral component interconnect express (PCIe) bus. Also, when capacity of an on-chip memory, for example, an SRAM, or an off-chip memory in the NPU 1020 is insufficient, the host DRAM 1030 may be additionally used.”)
Claim 18 is rejected over Hong and Gadelrab.
Regarding claim 18, Hong teaches a non-transitory machine-readable medium storing instructions executable by a processing resource to: (Hong [0019]: “A non-transitory computer-readable record medium may store instructions that, when executed by one or more processors, configure the one or more processors to perform the method.”;)
wherein the first DLA chip comprises a first plurality of DLA cores; (Hong [0056]: “the plurality of neural networks executed on the accelerator 120 may be neural networks in different structures”; and
Hong [0048]: “The request may be for data inference based on the neural network and may cause the accelerator 120 to execute, that is, run (hereinafter, the terms “execute” and “run” and modifications thereof may be interchangeably used) the neural network and to acquire a data inference result for, for example, an object recognition, a pattern recognition, a computer vision, a voice recognition, a machine translation, a machine interpretation, a recommendation service, a personalized service, a video processing, and/or autonomous driving”; Note: Different applications have different computational capabilities, features, or functions, requiring different accelerator/core performance; and
Hong [0077]: “The scheduler may select a candidate kernel estimated to have the best accelerator performance when running each candidate kernel based on a current situation of the accelerator and kernel information of candidate kernels. A criterion of performance may include any one or any combination of a throughput of each model, a latency, a fairness, a power consumption amount, and a utilization rate of the accelerator based on a situation or a selection of a user.”; and
Hong [0119]: “For example, a workload that uses a relatively large computational amount may be allocated to a relatively large number of processing elements and processed thereby, and a workload that uses a relatively small computational amount may be allocated to a relatively small number of processing elements and processed thereby.”; Note: See paragraphs [0023]-[0025] and [0051] to see that each accelerator core (of the plurality of accelerator cores on one accelerator chip) may include one or more processing elements (PEs) configured to perform operations according to the neural network).
wherein the second DLA chip comprises a second plurality of DLA cores that is greater in quantity than the first plurality of DLA cores. (Hong [0096]: “five of the eight cores are allocated to run layer 1 of model 1 and the remaining three of the eight cores are in an idle status. In an “after” situation in which layer 1 of model 1 starts to run in the above situation, layer 1 of model 2 may run by selecting kernel 2 capable of maximally using the remaining three cores.”; and
Hong [0097]: “FIG. 8 illustrates another example of dynamically allocating resources.”; a
Hong [0056]: “the plurality of neural networks executed on the accelerator 120 may be neural networks in different structures.”;
Hong [0048]: “The request may be for data inference based on the neural network and may cause the accelerator 120 to execute, that is, run (hereinafter, the terms “execute” and “run” and modifications thereof may be interchangeably used) the neural network and to acquire a data inference result for, for example, an object recognition, a pattern recognition, a computer vision, a voice recognition, a machine translation, a machine interpretation, a recommendation service, a personalized service, a video processing, and/or autonomous driving.”; Note: Different applications have different computational capabilities, features, or functions, requiring different accelerator/core performance; and
Hong [0077]: “The scheduler may select a candidate kernel estimated to have the best accelerator performance when running each candidate kernel based on a current situation of the accelerator and kernel information of candidate kernels. A criterion of performance may include any one or any combination of a throughput of each model, a latency, a fairness, a power consumption amount, and a utilization rate of the accelerator based on a situation or a selection of a user.”;
Hong [0119]: “example, a workload that uses a relatively large computational amount may be allocated to a relatively large number of processing elements and processed thereby, and a workload that uses a relatively small computational amount may be allocated to a relatively small number of processing elements and processed thereby.”;)
Hong does not teach determine whether execution of a computational layer of a first deep learning accelerator (DLA) model on representative data, using a first DLA chip, yields results having at least a threshold confidence value,
responsive to determining that execution of the computational layer of the first DLA model yields results having less than the threshold confidence value, execute a second DLA model, using a second DLA chip, on results from execution of the computational layer of the first DLA model,
However, Gadelrab teaches determine whether execution of a computational layer of a first deep learning accelerator (DLA) model on representative data, using a first DLA chip, yields results having at least a threshold confidence value, (Gadelrab [0120]: “If the system determines, at block 514, that high efficiency model accuracy is acceptable and overflow/underflow statistics are within the threshold value, the system may return to block 502 and execute a subsequent inference using the high efficiency model.”
Gadelrab [0121]: “Otherwise, at block 516, the system exits the high efficiency mode and executes subsequent inferences using the high accuracy mode (e.g., as discussed above with respect to FIG. 4)”;
Gadelrab [0053]: “To allow for inferences to be performed on efficient cores while maintaining a sufficient degree of accuracy in the results generated by executing inferences using a quantized machine learning model, embodiments described herein provide various techniques for adaptively quantizing parameters used by a machine learning model and switching between performing inferences using a high accuracy model on high performance cores and performing inferences using quantized models on efficient cores.”;)
responsive to determining that execution of the computational layer of the first DLA model yields results having less than the threshold confidence value, execute a second DLA model, using a second DLA chip, on results from execution of the computational layer of the first DLA model, (Gadelrab [0113]: “As illustrated, executing inferences in a high efficiency mode begins at block 502, where a system receives an inference request. The inference request generally includes input on which the information is to be performed.”;
Gadelrab [0114]: “Generally, the executed inference may be performed using a selected high efficiency model (e.g., a machine learning model and weights quantized to a data type that involves less complex computation”;
Gadelrab [0118]: “At block 514, the system determines whether the high efficiency model accuracy and overflow/underflow statistics are within an acceptable range.”;
Gadelrab [0120]: “If the system determines, at block 514, that high efficiency model accuracy is acceptable and overflow/underflow statistics are within the threshold value, the system may return to block 502 and execute a subsequent inference using the high efficiency model.”
Gadelrab [0121]: “Otherwise, at block 516, the system exits the high efficiency mode and executes subsequent inferences using the high accuracy mode (e.g., as discussed above with respect to FIG. 4)”;
Gadelrab [0053]: “To allow for inferences to be performed on efficient cores while maintaining a sufficient degree of accuracy in the results generated by executing inferences using a quantized machine learning model, embodiments described herein provide various techniques for adaptively quantizing parameters used by a machine learning model and switching between performing inferences using a high accuracy model on high performance cores and performing inferences using quantized models on efficient cores.”;
Gadelrab [0057]: “The quantized weight information may be in a format for which computation is less intensive than computation using the received weight information.”;)
It would have been obvious before the effective filing date to combine the accelerators executing neural networks of Hong with the switching between different models having different resource requirements of Gadelrab to improve resource efficiency by adaptively optimizing/switching the use of computational (core) resources based on acceptable computational and performance results (Gadelrab, [0053]). Hong and Gadelrab are analogous art because they both concern optimizing neural network models.
Claim 19 is rejected over Hong and Gadelrab with the incorporation of claim 18.
Regarding claim 19, Hong does not teach further storing instructions to determine whether execution of the computational layer of the first DLA model on the representative data yields results having at least the threshold confidence value at a compile time.
However, Gadelrab teaches further storing instructions to determine whether execution of the computational layer of the first DLA model on the representative data yields results having at least the threshold confidence value at a compile time. (Gadelrab [0120]: “If the system determines, at block 514, that high efficiency model accuracy is acceptable and overflow/underflow statistics are within the threshold value, the system may return to block 502 and execute a subsequent inference using the high efficiency model.”
Gadelrab [0121]: “Otherwise, at block 516, the system exits the high efficiency mode and executes subsequent inferences using the high accuracy mode (e.g., as discussed above with respect to FIG. 4)”;
Gadelrab [0053]: “To allow for inferences to be performed on efficient cores while maintaining a sufficient degree of accuracy in the results generated by executing inferences using a quantized machine learning model, embodiments described herein provide various techniques for adaptively quantizing parameters used by a machine learning model and switching between performing inferences using a high accuracy model on high performance cores and performing inferences using quantized models on efficient cores.”;)
It would have been obvious before the effective filing date to combine the accelerators executing neural networks of Hong with the switching between different models having different resource requirements of Gadelrab to improve resource efficiency by adaptively optimizing/switching the use of computational (core) resources based on acceptable computational and performance results (Gadelrab, [0053]). Hong and Gadelrab are analogous art because they both concern optimizing neural network models.
Claims 11, 12, 21, 22, 23, 24 and 25 are rejected under 35 U.S.C. 103 as being unpatentable over Hong and Gadelrab in view of Teerapittayanon (BranchyNet: Fast Inference via Early Exiting from Deep Neural Networks); hereinafter Teerapittayanon
Claim 11 is rejected over Hong, Gadelrab and Teerapittayanon.
Regarding claim 11, Hong teaches a non-transitory machine-readable medium storing instructions executable by a processing resource to: (Hong [0019]: “A non-transitory computer-readable record medium may store instructions that, when executed by one or more processors, configure the one or more processors to perform the method.”;)
Hong does not teach determine whether a first confidence value of first results [from execution of a first computational layer] of a first deep learning accelerator (DLA) model is at least a threshold confidence value, wherein a first DLA chip of a DLA package executes the first DLA model;
from execution of a first computational layer
responsive to the first confidence value being at least the threshold confidence value, execute a second computational layer of the first DLA model using the first results as inputs; and
responsive to the first confidence value being less than the threshold confidence value:
provide signaling indicative of a wake up request to a second DLA chip of the DLA package comprising a greater quantity of DLA cores than the first DLA chip; and
executing a computational layer of the second DLA model using the first results as inputs and the second DLA chip.
However, Gadelrab teaches determine whether a first confidence value of first results [from execution of a first computational layer] of a first deep learning accelerator (DLA) model is at least a threshold confidence value, wherein a first DLA chip of a DLA package executes the first DLA model; (Gadelrab [0120]: “If the system determines, at block 514, that high efficiency model accuracy is acceptable and overflow/underflow statistics are within the threshold value, the system may return to block 502 and execute a subsequent inference using the high efficiency model.”;
Gadelrab [0121]: Otherwise, at block 516, the system exits the high efficiency mode and executes subsequent inferences using the high accuracy mode (e.g., as discussed above with respect to FIG. 4);)
responsive to the first confidence value being less than the threshold confidence value:
provide signaling indicative of a wake up request to a second DLA chip of the DLA package comprising a greater quantity of DLA cores than the first DLA chip; and (Gadelrab [0113]: “As illustrated, executing inferences in a high efficiency mode begins at block 502, where a system receives an inference request. The inference request generally includes input on which the information is to be performed.”; and
Gadelrab [0114]: “Generally, the executed inference may be performed using a selected high efficiency model (e.g., a machine learning model and weights quantized to a data type that involves less complex computation”;
Gadelrab [0118]: “At block 514, the system determines whether the high efficiency model accuracy and overflow/underflow statistics are within an acceptable range.”;
Gadelrab [0120]: “If the system determines, at block 514, that high efficiency model accuracy is acceptable and overflow/underflow statistics are within the threshold value, the system may return to block 502 and execute a subsequent inference using the high efficiency model.”;
Gadelrab [0121]: Otherwise, at block 516, the system exits the high efficiency mode and executes subsequent inferences using the high accuracy mode (e.g., as discussed above with respect to FIG. 4)”; and
Gadelrab [0053]: “To allow for inferences to be performed on efficient cores while maintaining a sufficient degree of accuracy in the results generated by executing inferences using a quantized machine learning model, embodiments described herein provide various techniques for adaptively quantizing parameters used by a machine learning model and switching between performing inferences using a high accuracy model on high performance cores and performing inferences using quantized models on efficient cores.”;
Gadelrab [0057]: “The quantized weight information may be in a format for which computation is less intensive than computation using the received weight information.”;)
executing a computational layer of the second DLA model using the first results as inputs and the second DLA chip. (Gadelrab [0113]: “As illustrated, executing inferences in a high efficiency mode begins at block 502, where a system receives an inference request. The inference request generally includes input on which the information is to be performed.”; and
Gadelrab [0114]: “Generally, the executed inference may be performed using a selected high efficiency model (e.g., a machine learning model and weights quantized to a data type that involves less complex computation”;
Gadelrab [0118]: “At block 514, the system determines whether the high efficiency model accuracy and overflow/underflow statistics are within an acceptable range.”;
Gadelrab [0120]: “If the system determines, at block 514, that high efficiency model accuracy is acceptable and overflow/underflow statistics are within the threshold value, the system may return to block 502 and execute a subsequent inference using the high efficiency model.”;
Gadelrab [0121]: Otherwise, at block 516, the system exits the high efficiency mode and executes subsequent inferences using the high accuracy mode (e.g., as discussed above with respect to FIG. 4)” ;and
Gadelrab [0053]: “To allow for inferences to be performed on efficient cores while maintaining a sufficient degree of accuracy in the results generated by executing inferences using a quantized machine learning model, embodiments described herein provide various techniques for adaptively quantizing parameters used by a machine learning model and switching between performing inferences using a high accuracy model on high performance cores and performing inferences using quantized models on efficient cores.”;
Gadelrab [0057]: “The quantized weight information may be in a format for which computation is less intensive than computation using the received weight information.”;)
It would have been obvious before the effective filing date to combine the accelerators executing neural networks of Hong with the switching between different models having different resource requirements of Gadelrab to improve resource efficiency by adaptively optimizing/switching the use of computational (core) resources based on acceptable computational and performance results (Gadelrab, [0053]). Hong and Gadelrab are analogous art because they both concern optimizing neural network models.
Teerapittayanon teaches from execution of a first computational layer (Figure 1 of Teerapittayanon discloses the portion of the plurality of layers (exit 3) comprising more layers than either of exits 1 or 2 (the different plurality of layers). Further, section I. of Teerapittayanon teaches by exiting at earlier stages BranchyNet significantly reduces the runtime and energy use of inference. The resource threshold is not defined and, for purposes of examination, the energy use of the network exiting at exit 3 is considered the threshold.)
responsive to the first confidence value being at least the threshold confidence value, execute a second computational layer of the first DLA model using the first results as inputs; and (Teerapittayanon [page 2]: “At each exit point, BranchyNet uses the entropy of a classification result (e.g., by softmax) as a measure of confidence in the prediction. If the entropy of a test sample is below a learned threshold value, meaning that the classifier is confident in the prediction, the sample exits the network with the prediction result at this exit point, and is not processed by the higher network layers. If the entropy value is above the threshold, then the classifier at this exit point is deemed not confident, and the sample continues to the next exit point in the network. If the sample reaches the last exit point, which is the last layer of the baseline neural network, it always performs classification”;)
It would have been obvious before the effective filing date to combine the accelerators executing neural networks of Hong with the early exit branch computational layers of Teerapittayanon to reduce computation costs and save energy (Teerapittayanon, page 2). Hong and Teerapittayanon are analogous art because they both concern optimizing neural network models.
Claim 12 is rejected over Hong, Gadelrab and Teerapittayanon with the incorporation of claim 11.
Regarding claim 12, Hong does not teach determine whether a second confidence value of second results from execution of the second computational layer of the first DLA model is at least the threshold confidence value;
responsive to the second confidence value being at least the threshold confidence value, execute a third computational layer of the first DLA model using the second results as inputs; and
responsive to the second confidence value being less than the threshold confidence value:
provide the signaling indicative of the wake up request to the second DLA chip; and
execute the computational layer of the second DLA model using the second results as inputs.
However, Teerapittayanon teaches determine whether a second confidence value of second results from execution of the second computational layer of the first DLA model is at least the threshold confidence value;
responsive to the second confidence value being at least the threshold confidence value, execute a third computational layer of the first DLA model using the second results as inputs; and
responsive to the second confidence value being less than the threshold confidence value: (Teerapittayanon [page 2]: “At each exit point, BranchyNet uses the entropy of a classification result (e.g., by softmax) as a measure of confidence in the prediction. If the entropy of a test sample is below a learned threshold value, meaning that the classifier is confident in the prediction, the sample exits the network with the prediction result at this exit point, and is not processed by the higher network layers. If the entropy value is above the threshold, then the classifier at this exit point is deemed not confident, and the sample continues to the next exit point in the network. If the sample reaches the last exit point, which is the last layer of the baseline neural network, it always performs classification”; Note: Figure 1 of Teerapittayanon discloses the portion of the plurality of layers (exit 3) comprising more layers than either of exits 1 or 2 (the different plurality of layers). Further, section I. of Teerapittayanon teaches by exiting at earlier stages BranchyNet significantly reduces the runtime and energy use of inference. The resource threshold is not defined and, for purposes of examination, the energy use of the network exiting at exit 3 is considered the threshold.)
It would have been obvious before the effective filing date to combine the accelerators executing neural networks of Hong with the early exit branch computational layers of Teerapittayanon to reduce computation costs and save energy (Teerapittayanon, page 2). Hong and Teerapittayanon are analogous art because they both concern optimizing neural network models.
Gadelrab teaches provide the signaling indicative of the wake up request to the second DLA chip; and (Gadelrab [0113]: “As illustrated, executing inferences in a high efficiency mode begins at block 502, where a system receives an inference request. The inference request generally includes input on which the information is to be performed.”; and
Gadelrab [0114]: “Generally, the executed inference may be performed using a selected high efficiency model (e.g., a machine learning model and weights quantized to a data type that involves less complex computation”;
Gadelrab [0118]: “At block 514, the system determines whether the high efficiency model accuracy and overflow/underflow statistics are within an acceptable range.”;
Gadelrab [0120]: “If the system determines, at block 514, that high efficiency model accuracy is acceptable and overflow/underflow statistics are within the threshold value, the system may return to block 502 and execute a subsequent inference using the high efficiency model.”;
Gadelrab [0121]: Otherwise, at block 516, the system exits the high efficiency mode and executes subsequent inferences using the high accuracy mode (e.g., as discussed above with respect to FIG. 4)”;
Gadelrab [0053]: “To allow for inferences to be performed on efficient cores while maintaining a sufficient degree of accuracy in the results generated by executing inferences using a quantized machine learning model, embodiments described herein provide various techniques for adaptively quantizing parameters used by a machine learning model and switching between performing inferences using a high accuracy model on high performance cores and performing inferences using quantized models on efficient cores.”;
Gadelrab [0057]: “The quantized weight information may be in a format for which computation is less intensive than computation using the received weight information.”;)
execute the computational layer of the second DLA model using the second results as inputs. (Gadelrab [0113]: “As illustrated, executing inferences in a high efficiency mode begins at block 502, where a system receives an inference request. The inference request generally includes input on which the information is to be performed.”;
Gadelrab [0114]: “Generally, the executed inference may be performed using a selected high efficiency model (e.g., a machine learning model and weights quantized to a data type that involves less complex computation”;
Gadelrab [0118]: “At block 514, the system determines whether the high efficiency model accuracy and overflow/underflow statistics are within an acceptable range.”;
Gadelrab [0120]: “If the system determines, at block 514, that high efficiency model accuracy is acceptable and overflow/underflow statistics are within the threshold value, the system may return to block 502 and execute a subsequent inference using the high efficiency model.”;
Gadelrab [0121]: “Otherwise, at block 516, the system exits the high efficiency mode and executes subsequent inferences using the high accuracy mode (e.g., as discussed above with respect to FIG. 4)”;
Gadelrab [0053]: “To allow for inferences to be performed on efficient cores while maintaining a sufficient degree of accuracy in the results generated by executing inferences using a quantized machine learning model, embodiments described herein provide various techniques for adaptively quantizing parameters used by a machine learning model and switching between performing inferences using a high accuracy model on high performance cores and performing inferences using quantized models on efficient cores.”;
Gadelrab [0057]: “The quantized weight information may be in a format for which computation is less intensive than computation using the received weight information.”;)
It would have been obvious before the effective filing date to combine the accelerators executing neural networks of Hong with the switching between different models having different resource requirements of Gadelrab to improve resource efficiency by adaptively optimizing/switching the use of computational (core) resources based on acceptable computational and performance results (Gadelrab, [0053]). Hong and Gadelrab are analogous art because they both concern optimizing neural network models.
Claim 21 is rejected over Hong, Gadelrab and Teerapittayanon.
Regarding claim 21, Hong teaches executing, at the compile time and using a first DLA chip of the DLA package, a first number of computational layers of a first DLA model on representative data; (Hong [0023]: “generating a plurality of kernels for each of a first neural network model and a second neural network model; and running a kernel of the first model on a number of cores of an available resource of an accelerator; in response to a start of the running, running a kernel of the second model on a remaining number of cores of the available resource of an accelerator; generating one or more output feature maps based on the running of the kernels.”;
Hong [0024]: “The running of the kernel of the second model may include running the kernel of the second model in response to determining that a collision will not occur between memory access patterns during the running of the kernel of the first model and the running of the kernel of the second model.”;
Hong [0025]: “For each of the kernel of the first model and the kernel of the second model, the kernel may be selected for running from among the plurality of kernels based on kernel information including any one or more of: a number of accelerator cores”;
Hong [0056]: “in response to a plurality of requests received at the host processor 110, the accelerator 120 may execute a plurality of neural networks according to kernels generated by the host processor 110. Here, the plurality of neural networks executed on the accelerator 120 may be neural networks in different structures … Herein, a neural network may also be referred to as a model for clarity of description.”;
Hong [0058]: “FIG. 2 illustrates an example of a resource allocation process for executing a multi-model on a multi-core accelerator.”; and
Hong [0055]: “The neural network may provide an optimal output corresponding to an input by mapping an input and an output having a nonlinear relationship based on deep learning. The deep learning may be a machine learning scheme for solving a given problem from a big data set …”);
executing a second DLA model, using the second DLA chip of the DLA package, on results from execution of the first number of computational layers of the first DLA model on the representative data, (Hong [0096]: “five of the eight cores are allocated to run layer 1 of model 1 and the remaining three of the eight cores are in an idle status. In an “after” situation in which layer 1 of model 1 starts to run in the above situation, layer 1 of model 2 may run by selecting kernel 2 capable of maximally using the remaining three cores.”; and
Hong [0097]: “FIG. 8 illustrates another example of dynamically allocating resources.”; and
Hong [0056]: “the plurality of neural networks executed on the accelerator 120 may be neural networks in different structures.”; and
Hong [0048]: “The request may be for data inference based on the neural network and may cause the accelerator 120 to execute, that is, run (hereinafter, the terms “execute” and “run” and modifications thereof may be interchangeably used) the neural network and to acquire a data inference result for, for example, an object recognition, a pattern recognition, a computer vision, a voice recognition, a machine translation, a machine interpretation, a recommendation service, a personalized service, a video processing, and/or autonomous driving.”; Note: Different applications have different computational capabilities, features, or functions, requiring different accelerator/core performance; and
Hong [0077]: “The scheduler may select a candidate kernel estimated to have the best accelerator performance when running each candidate kernel based on a current situation of the accelerator and kernel information of candidate kernels. A criterion of performance may include any one or any combination of a throughput of each model, a latency, a fairness, a power consumption amount, and a utilization rate of the accelerator based on a situation or a selection of a user.”; and
Hong [0119]: “example, a workload that uses a relatively large computational amount may be allocated to a relatively large number of processing elements and processed thereby, and a workload that uses a relatively small computational amount may be allocated to a relatively small number of processing elements and processed thereby.”)
wherein the first DLA chip comprises a first plurality of DLA cores and (Hong [0056]: “the plurality of neural networks executed on the accelerator 120 may be neural networks in different structures”; and
Hong [0048]: “The request may be for data inference based on the neural network and may cause the accelerator 120 to execute, that is, run (hereinafter, the terms “execute” and “run” and modifications thereof may be interchangeably used) the neural network and to acquire a data inference result for, for example, an object recognition, a pattern recognition, a computer vision, a voice recognition, a machine translation, a machine interpretation, a recommendation service, a personalized service, a video processing, and/or autonomous driving”; Note: Different applications have different computational capabilities, features, or functions, requiring different accelerator/core performance; and
Hong [0077]: “The scheduler may select a candidate kernel estimated to have the best accelerator performance when running each candidate kernel based on a current situation of the accelerator and kernel information of candidate kernels. A criterion of performance may include any one or any combination of a throughput of each model, a latency, a fairness, a power consumption amount, and a utilization rate of the accelerator based on a situation or a selection of a user.”; and
Hong [0119]: “For example, a workload that uses a relatively large computational amount may be allocated to a relatively large number of processing elements and processed thereby, and a workload that uses a relatively small computational amount may be allocated to a relatively small number of processing elements and processed thereby.”; Note: See paragraphs [0023]-[0025] and [0051] to see that each accelerator core (of the plurality of accelerator cores on one accelerator chip) may include one or more processing elements (PEs) configured to perform operations according to the neural network).
the second DLA chip comprises a second plurality of DLA cores greater in quantity than the first plurality of DLA cores; and (Hong [0096]: “five of the eight cores are allocated to run layer 1 of model 1 and the remaining three of the eight cores are in an idle status. In an “after” situation in which layer 1 of model 1 starts to run in the above situation, layer 1 of model 2 may run by selecting kernel 2 capable of maximally using the remaining three cores.”; and
Hong [0097]: “FIG. 8 illustrates another example of dynamically allocating resources.”; [0097]; and
Hong [0056]: “the plurality of neural networks executed on the accelerator 120 may be neural networks in different structures.”; and
Hong [0048]: “The request may be for data inference based on the neural network and may cause the accelerator 120 to execute, that is, run (hereinafter, the terms “execute” and “run” and modifications thereof may be interchangeably used) the neural network and to acquire a data inference result for, for example, an object recognition, a pattern recognition, a computer vision, a voice recognition, a machine translation, a machine interpretation, a recommendation service, a personalized service, a video processing, and/or autonomous driving.”; Note: Different applications have different computational capabilities, features, or functions, requiring different accelerator/core performance; and
Hong [0077]: “The scheduler may select a candidate kernel estimated to have the best accelerator performance when running each candidate kernel based on a current situation of the accelerator and kernel information of candidate kernels. A criterion of performance may include any one or any combination of a throughput of each model, a latency, a fairness, a power consumption amount, and a utilization rate of the accelerator based on a situation or a selection of a user.”; and
Hong [0119]: “example, a workload that uses a relatively large computational amount may be allocated to a relatively large number of processing elements and processed thereby, and a workload that uses a relatively small computational amount may be allocated to a relatively small number of processing elements and processed thereby.”;)
Hong does not teach determining which computational layers of a first deep learning accelerator (DLA) model to execute on data received by a DLA package subsequent to a compile time by:
determining whether results from execution of the second DLA model on results from execution of the first number of computational layers of the first DLA model have a confidence value that is at least a threshold confidence value.
However, Teerapittayanon teaches determining which computational layers of a first deep learning accelerator (DLA) model to execute on data received by a DLA package subsequent to a compile time by:
determining whether results from execution of the second DLA model on results from execution of the first number of computational layers of the first DLA model have a confidence value that is at least a threshold confidence value. (Teerapittayanon [page 2]: “At each exit point, BranchyNet uses the entropy of a classification result (e.g., by softmax) as a measure of confidence in the prediction. If the entropy of a test sample is below a learned threshold value, meaning that the classifier is confident in the prediction, the sample exits the network with the prediction result at this exit point, and is not processed by the higher network layers. If the entropy value is above the threshold, then the classifier at this exit point is deemed not confident, and the sample continues to the next exit point in the network. If the sample reaches the last exit point, which is the last layer of the baseline neural network, it always performs classification”; Note: Figure 1 of Teerapittayanon discloses the portion of the plurality of layers (exit 3) comprising more layers than either of exits 1 or 2 (the different plurality of layers). Further, section I. of Teerapittayanon teaches by exiting at earlier stages BranchyNet significantly reduces the runtime and energy use of inference. The resource threshold is not defined and, for purposes of examination, the energy use of the network exiting at exit 3 is considered the threshold.)
It would have been obvious before the effective filing date to combine the accelerators executing neural networks of Hong with the early exit branch computational layers of Teerapittayanon to reduce computation costs and save energy (Teerapittayanon, page 2). Hong and Teerapittayanon are analogous art because they both concern optimizing neural network models.
Gadelrab teaches determining whether results from execution of the second DLA model on results from execution of the first number of computational layers of the first DLA model have a confidence value that is at least a threshold confidence value. (Gadelrab [0113]: “As illustrated, executing inferences in a high efficiency mode begins at block 502, where a system receives an inference request. The inference request generally includes input on which the information is to be performed.”; and
Gadelrab [0114]: “Generally, the executed inference may be performed using a selected high efficiency model (e.g., a machine learning model and weights quantized to a data type that involves less complex computation”;
Gadelrab [0118]: “At block 514, the system determines whether the high efficiency model accuracy and overflow/underflow statistics are within an acceptable range.”;
Gadelrab [0120]: “If the system determines, at block 514, that high efficiency model accuracy is acceptable and overflow/underflow statistics are within the threshold value, the system may return to block 502 and execute a subsequent inference using the high efficiency model.”;
Gadelrab [0121]: “Otherwise, at block 516, the system exits the high efficiency mode and executes subsequent inferences using the high accuracy mode (e.g., as discussed above with respect to FIG. 4)”;
Gadelrab [0053]: “To allow for inferences to be performed on efficient cores while maintaining a sufficient degree of accuracy in the results generated by executing inferences using a quantized machine learning model, embodiments described herein provide various techniques for adaptively quantizing parameters used by a machine learning model and switching between performing inferences using a high accuracy model on high performance cores and performing inferences using quantized models on efficient cores.”;
Gadelrab [0057]: “The quantized weight information may be in a format for which computation is less intensive than computation using the received weight information.”;)
It would have been obvious before the effective filing date to combine the accelerators executing neural networks of Hong with the switching between different models having different resource requirements of Gadelrab to improve resource efficiency by adaptively optimizing/switching the use of computational (core) resources based on acceptable computational and performance results (Gadelrab, [0053]). Hong and Gadelrab are analogous art because they both concern optimizing neural network models.
Claim 22 is rejected over Hong, Gadelrab and Teerapittayanon with the incorporation of claim 21.
Regarding claim 22, Hong does not teach further comprising, responsive to determining that the results from execution of the second DLA model on the results from execution of the first number of computational layers of the first DLA model have a confidence value that is at least the threshold confidence value:
executing, subsequent to the compile time and using the first DLA chip, the first number of computational layers of the first DLA model on data received by the DLA package; and
executing the second DLA model on results from execution of the first number of computational layers of the first DLA model.
However, Gadelrab teaches further comprising, responsive to determining that the results from execution of the second DLA model on the results from execution of the first number of computational layers of the first DLA model have a confidence value that is at least the threshold confidence value:
executing, subsequent to the compile time and using the first DLA chip, the first number of computational layers of the first DLA model on data received by the DLA package; and
executing the second DLA model on results from execution of the first number of computational layers of the first DLA model. (Gadelrab [0113]: “As illustrated, executing inferences in a high efficiency mode begins at block 502, where a system receives an inference request. The inference request generally includes input on which the information is to be performed.”;
Gadelrab [0114]: “Generally, the executed inference may be performed using a selected high efficiency model (e.g., a machine learning model and weights quantized to a data type that involves less complex computation”;
Gadelrab [0118]: “At block 514, the system determines whether the high efficiency model accuracy and overflow/underflow statistics are within an acceptable range.”;
Gadelrab [0120]: “If the system determines, at block 514, that high efficiency model accuracy is acceptable and overflow/underflow statistics are within the threshold value, the system may return to block 502 and execute a subsequent inference using the high efficiency model.”
Gadelrab [0121]: Otherwise, at block 516, the system exits the high efficiency mode and executes subsequent inferences using the high accuracy mode (e.g., as discussed above with respect to FIG. 4)”;
Gadelrab [0053]: “To allow for inferences to be performed on efficient cores while maintaining a sufficient degree of accuracy in the results generated by executing inferences using a quantized machine learning model, embodiments described herein provide various techniques for adaptively quantizing parameters used by a machine learning model and switching between performing inferences using a high accuracy model on high performance cores and performing inferences using quantized models on efficient cores.”;
Gadelrab [0057]: “The quantized weight information may be in a format for which computation is less intensive than computation using the received weight information.”;)
It would have been obvious before the effective filing date to combine the accelerators executing neural networks of Hong with the switching between different models having different resource requirements of Gadelrab to improve resource efficiency by adaptively optimizing/switching the use of computational (core) resources based on acceptable computational and performance results (Gadelrab, [0053]). Hong and Gadelrab are analogous art because they both concern optimizing neural network models.
Claim 23 is rejected over Hong, Gadelrab and Teerapittayanon with the incorporation of claim 21.
Regarding claim 23, Hong teaches executing, using the first DLA chip, [a second number of computational layers of the first DLA model on the representative data, wherein the second number of computational layers includes an additional computational layer of the first DLA model or excludes a computational layer of the first number of computational layers]; (Hong [0023]: “generating a plurality of kernels for each of a first neural network model and a second neural network model; and running a kernel of the first model on a number of cores of an available resource of an accelerator; in response to a start of the running, running a kernel of the second model on a remaining number of cores of the available resource of an accelerator; generating one or more output feature maps based on the running of the kernels.”
Hong [0024]: “The running of the kernel of the second model may include running the kernel of the second model in response to determining that a collision will not occur between memory access patterns during the running of the kernel of the first model and the running of the kernel of the second model.”
Hong [0025]: “For each of the kernel of the first model and the kernel of the second model, the kernel may be selected for running from among the plurality of kernels based on kernel information including any one or more of: a number of accelerator cores;”
Hong [0056]: “in response to a plurality of requests received at the host processor 110, the accelerator 120 may execute a plurality of neural networks according to kernels generated by the host processor 110. Here, the plurality of neural networks executed on the accelerator 120 may be neural networks in different structures … Herein, a neural network may also be referred to as a model for clarity of description.”;
Hong [0058]: “FIG. 2 illustrates an example of a resource allocation process for executing a multi-model on a multi-core accelerator.”;
Hong [0055]: “The neural network may provide an optimal output corresponding to an input by mapping an input and an output having a nonlinear relationship based on deep learning. The deep learning may be a machine learning scheme for solving a given problem from a big data set …”;)
executing, using the second DLA chip, the second DLA model on results from execution of [the second number of computational layers of] the first DLA model on the representative data; and (Hong [0051]: “The accelerator core 131 may include one or more processing elements (PEs) configured to perform operations according to the neural network. Although FIG. 1 illustrates that a single accelerator core 131 is included in the accelerator chip 130 for clarity of description, a plurality of accelerator cores may be included in the accelerator chip 130 and may process operations “; and [0134]: “components in a described system, architecture, device, or circuit are combined in a different manner, and/or replaced or supplemented by other components or their equivalents.”;)
Hong [0096]: “five of the eight cores are allocated to run layer 1 of model 1 and the remaining three of the eight cores are in an idle status. In an “after” situation in which layer 1 of model 1 starts to run in the above situation, layer 1 of model 2 may run by selecting kernel 2 capable of maximally using the remaining three cores.”;
Hong [0097]: “FIG. 8 illustrates another example of dynamically allocating resources.”; and
Hong [0056]: “the plurality of neural networks executed on the accelerator 120 may be neural networks in different structures.”;
Hong [0048]: “The request may be for data inference based on the neural network and may cause the accelerator 120 to execute, that is, run (hereinafter, the terms “execute” and “run” and modifications thereof may be interchangeably used) the neural network and to acquire a data inference result for, for example, an object recognition, a pattern recognition, a computer vision, a voice recognition, a machine translation, a machine interpretation, a recommendation service, a personalized service, a video processing, and/or autonomous driving.”; Note: Different applications have different computational capabilities, features, or functions, requiring different accelerator/core performance; and
Hong [0077]: “The scheduler may select a candidate kernel estimated to have the best accelerator performance when running each candidate kernel based on a current situation of the accelerator and kernel information of candidate kernels. A criterion of performance may include any one or any combination of a throughput of each model, a latency, a fairness, a power consumption amount, and a utilization rate of the accelerator based on a situation or a selection of a user.”; and
Hong [0119]: “example, a workload that uses a relatively large computational amount may be allocated to a relatively large number of processing elements and processed thereby, and a workload that uses a relatively small computational amount may be allocated to a relatively small number of processing elements and processed thereby.”;)
Hong does not teach executing, using the first DLA chip, a second number of computational layers of the first DLA model on the representative data, wherein the second number of computational layers includes an additional computational layer of the first DLA model or excludes a computational layer of the first number of computational layers;
executing, using the second DLA chip, the second DLA model on results from execution of the second number of computational layers of the first DLA model on the representative data; and
determining whether results from execution of the second DLA model on results from execution of the second number of computational layers of the first DLA model have a confidence value that is at least the threshold confidence value.
However, Gadelrab teaches executing, using the first DLA chip, a second number of computational layers of the first DLA model on the representative data, wherein the second number of computational layers includes an additional computational layer of the first DLA model or excludes a computational layer of the first number of computational layers;
executing, using the second DLA chip, the second DLA model on results from execution of the second number of computational layers of the first DLA model on the representative data; and
determining whether results from execution of the second DLA model on results from execution of the second number of computational layers of the first DLA model have a confidence value that is at least the threshold confidence value. (Gadelrab [0114]: “Generally, the executed inference may be performed using a selected high efficiency model (e.g., a machine learning model and weights quantized to a data type that involves less complex computation”;
Gadelrab [0118]: “the system determines whether the high efficiency model accuracy and overflow/underflow statistics are within an acceptable range.”;
Gadelrab [0120]: “If the system determines, at block 514, that high efficiency model accuracy is acceptable and overflow/underflow statistics are within the threshold value, the system may return to block 502 and execute a subsequent inference using the high efficiency model.”
Gadelrab [0121]: “Otherwise, at block 516, the system exits the high efficiency mode and executes subsequent inferences using the high accuracy mode (e.g., as discussed above with respect to FIG. 4)”; and
Gadelrab [0053]: “To allow for inferences to be performed on efficient cores while maintaining a sufficient degree of accuracy in the results generated by executing inferences using a quantized machine learning model, embodiments described herein provide various techniques for adaptively quantizing parameters used by a machine learning model and switching between performing inferences using a high accuracy model on high performance cores and performing inferences using quantized models on efficient cores.”;
Gadelrab [0057]: “The quantized weight information may be in a format for which computation is less intensive than computation using the received weight information.”;)
It would have been obvious before the effective filing date to combine the accelerators executing neural networks of Hong with the switching between different models having different resource requirements of Gadelrab to improve resource efficiency by adaptively optimizing/switching the use of computational (core) resources based on acceptable computational and performance results (Gadelrab, [0053]). Hong and Gadelrab are analogous art because they both concern optimizing neural network models.
Hong does not teach a second number of computational layers of the first DLA model on the representative data, wherein the second number of computational layers includes an additional computational layer of the first DLA model or excludes a computational layer of the first number of computational layers;
wherein the second number of computational layers includes an additional computational layer of the first DLA model or excludes a computational layer of the first number of computational layers
[the second number of computational layers of]
However, Teerapittayanon teaches a second number of computational layers of the first DLA model on the representative data, wherein the second number of computational layers includes an additional computational layer of the first DLA model or excludes a computational layer of the first number of computational layers;
wherein the second number of computational layers includes an additional computational layer of the first DLA model or excludes a computational layer of the first number of computational layers
[the second number of computational layers of] (Teerapittayanon [page 2]: “At each exit point, BranchyNet uses the entropy of a classification result (e.g., by softmax) as a measure of confidence in the prediction. If the entropy of a test sample is below a learned threshold value, meaning that the classifier is confident in the prediction, the sample exits the network with the prediction result at this exit point, and is not processed by the higher network layers. If the entropy value is above the threshold, then the classifier at this exit point is deemed not confident, and the sample continues to the next exit point in the network. If the sample reaches the last exit point, which is the last layer of the baseline neural network, it always performs classification”; Note: Figure 1 of Teerapittayanon discloses the portion of the plurality of layers (exit 3) comprising more layers than either of exits 1 or 2 (the different plurality of layers). Further, section I. of Teerapittayanon teaches by exiting at earlier stages BranchyNet significantly reduces the runtime and energy use of inference. The resource threshold is not defined and, for purposes of examination, the energy use of the network exiting at exit 3 is considered the threshold.)
It would have been obvious before the effective filing date to combine the accelerators executing neural networks of Hong with the early exit branch computational layers of Teerapittayanon to reduce computation costs and save energy (Teerapittayanon, page 2). Hong and Teerapittayanon are analogous art because they both concern optimizing neural network models.
Claim 24 is rejected over Hong, Gadelrab and Teerapittayanon with the incorporation of claim 21.
Regarding claim 24, Hong does not teach further comprising, responsive to determining that the results from execution of the second DLA model on the results from execution [of the second number of computational layers] of the first DLA model have a confidence value that is at least the threshold confidence value:
executing, subsequent to the compile time and using the first DLA chip, [the second number of computational layers of] the first DLA model on data received by the DLA package.
the second number of computational layers of
However, Gadelrab teaches further comprising, responsive to determining that the results from execution of the second DLA model on the results from execution [of the second number of computational layers] of the first DLA model have a confidence value that is at least the threshold confidence value:
executing, subsequent to the compile time and using the first DLA chip, [the second number of computational layers of] the first DLA model on data received by the DLA package. (Gadelrab [0113]: “As illustrated, executing inferences in a high efficiency mode begins at block 502, where a system receives an inference request. The inference request generally includes input on which the information is to be performed.”;
Gadelrab [0114]: “Generally, the executed inference may be performed using a selected high efficiency model (e.g., a machine learning model and weights quantized to a data type that involves less complex computation”;
Gadelrab [0118]: “the system determines whether the high efficiency model accuracy and overflow/underflow statistics are within an acceptable range.”;
Gadelrab [0120]: “If the system determines, at block 514, that high efficiency model accuracy is acceptable and overflow/underflow statistics are within the threshold value, the system may return to block 502 and execute a subsequent inference using the high efficiency model.”
Gadelrab [0121]: “Otherwise, at block 516, the system exits the high efficiency mode and executes subsequent inferences using the high accuracy mode (e.g., as discussed above with respect to FIG. 4)”;
Gadelrab [0053]: “To allow for inferences to be performed on efficient cores while maintaining a sufficient degree of accuracy in the results generated by executing inferences using a quantized machine learning model, embodiments described herein provide various techniques for adaptively quantizing parameters used by a machine learning model and switching between performing inferences using a high accuracy model on high performance cores and performing inferences using quantized models on efficient cores.”;
Gadelrab [0057]: “The quantized weight information may be in a format for which computation is less intensive than computation using the received weight information.”;)
It would have been obvious before the effective filing date to combine the accelerators executing neural networks of Hong with the switching between different models having different resource requirements of Gadelrab to improve resource efficiency by adaptively optimizing/switching the use of computational (core) resources based on acceptable computational and performance results (Gadelrab, [0053]). Hong and Gadelrab are analogous art because they both concern optimizing neural network models.
Teerapittayanon teaches the second number of computational layers of (Teerapittayanon [page 2]: “At each exit point, BranchyNet uses the entropy of a classification result (e.g., by softmax) as a measure of confidence in the prediction. If the entropy of a test sample is below a learned threshold value, meaning that the classifier is confident in the prediction, the sample exits the network with the prediction result at this exit point, and is not processed by the higher network layers. If the entropy value is above the threshold, then the classifier at this exit point is deemed not confident, and the sample continues to the next exit point in the network. If the sample reaches the last exit point, which is the last layer of the baseline neural network, it always performs classification”; Note: Figure 1 of Teerapittayanon discloses the portion of the plurality of layers (exit 3) comprising more layers than either of exits 1 or 2 (the different plurality of layers). Further, section I. of Teerapittayanon teaches by exiting at earlier stages BranchyNet significantly reduces the runtime and energy use of inference. The resource threshold is not defined and, for purposes of examination, the energy use of the network exiting at exit 3 is considered the threshold.)
It would have been obvious before the effective filing date to combine the accelerators executing neural networks of Hong with the early exit branch computational layers of Teerapittayanon to reduce computation costs and save energy (Teerapittayanon, page 2). Hong and Teerapittayanon are analogous art because they both concern optimizing neural network models.
Claim 25 is rejected over Hong, Gadelrab and Teerapittayanon with the incorporation of claim 21.
Regarding claim 25, Hong does not teach further comprising, responsive to determining that the results from execution of the second DLA model on the results from execution of the first number of computational layers of the first DLA model have a confidence value that is less than the threshold confidence value:
executing a number of computational layers of the second DLA model, using the second DLA chip, on the results from execution [of the second number of computational layers] of the first DLA model on the representative data,
However, Gadelrab teaches further comprising, responsive to determining that the results from execution of the second DLA model on the results from execution of the first number of computational layers of the first DLA model have a confidence value that is less than the threshold confidence value:
executing a number of computational layers of the second DLA model, using the second DLA chip, on the results from execution [of the second number of computational layers] of the first DLA model on the representative data, (Gadelrab [0113]: “As illustrated, executing inferences in a high efficiency mode begins at block 502, where a system receives an inference request. The inference request generally includes input on which the information is to be performed.”;
Gadelrab [0114]: “Generally, the executed inference may be performed using a selected high efficiency model (e.g., a machine learning model and weights quantized to a data type that involves less complex computation”;
Gadelrab [0118]: “the system determines whether the high efficiency model accuracy and overflow/underflow statistics are within an acceptable range.”;
Gadelrab [0120]: “If the system determines, at block 514, that high efficiency model accuracy is acceptable and overflow/underflow statistics are within the threshold value, the system may return to block 502 and execute a subsequent inference using the high efficiency model.”;
Gadelrab [0121]: “Otherwise, at block 516, the system exits the high efficiency mode and executes subsequent inferences using the high accuracy mode (e.g., as discussed above with respect to FIG. 4)”;
Gadelrab [0053]: “To allow for inferences to be performed on efficient cores while maintaining a sufficient degree of accuracy in the results generated by executing inferences using a quantized machine learning model, embodiments described herein provide various techniques for adaptively quantizing parameters used by a machine learning model and switching between performing inferences using a high accuracy model on high performance cores and performing inferences using quantized models on efficient cores.”;
Gadelrab [0057]: “The quantized weight information may be in a format for which computation is less intensive than computation using the received weight information.”);
It would have been obvious before the effective filing date to combine the accelerators executing neural networks of Hong with the switching between different models having different resource requirements of Gadelrab to improve resource efficiency by adaptively optimizing/switching the use of computational (core) resources based on acceptable computational and performance results (Gadelrab, [0053]). Hong and Gadelrab are analogous art because they both concern optimizing neural network models.
Teerapittayanon teaches wherein the number of computational layers includes an additional computational layer of the second DLA model or excludes a computational layer of the second DLA model executed on the results from execution of the first number of computational layers of the first DLA model. (Teerapittayanon [page 2]: “At each exit point, BranchyNet uses the entropy of a classification result (e.g., by softmax) as a measure of confidence in the prediction. If the entropy of a test sample is below a learned threshold value, meaning that the classifier is confident in the prediction, the sample exits the network with the prediction result at this exit point, and is not processed by the higher network layers. If the entropy value is above the threshold, then the classifier at this exit point is deemed not confident, and the sample continues to the next exit point in the network. If the sample reaches the last exit point, which is the last layer of the baseline neural network, it always performs classification”; Note: Figure 1 of Teerapittayanon discloses the portion of the plurality of layers (exit 3) comprising more layers than either of exits 1 or 2 (the different plurality of layers). Further, section I. of Teerapittayanon teaches by exiting at earlier stages BranchyNet significantly reduces the runtime and energy use of inference. The resource threshold is not defined and, for purposes of examination, the energy use of the network exiting at exit 3 is considered the threshold.)
It would have been obvious before the effective filing date to combine the accelerators executing neural networks of Hong with the early exit branch computational layers of Teerapittayanon to reduce computation costs and save energy (Teerapittayanon, page 2). Hong and Teerapittayanon are analogous art because they both concern optimizing neural network models.
Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over Hong and Gadelrab in view of Paramasivam et al. (US20220222513A1); hereinafter Paramasivam
Claim 16 is rejected over Hong, Gadelrab and Paramasivam with the incorporation of claim 13.
Regarding claim 16, Hong does not teach wherein the first array of MACs comprises a 256 by 256 array of MACs, and wherein the second array of MACs comprises a 512 by 512 array of MACs.
However, Paramasivam teaches wherein the first array of MACs comprises a 256 by 256 array of MACs, and (See Figure 5 of Paramasivam to see that NPU0 (first DLA chip) is a 256x256 synapse)
wherein the second array of MACs comprises a 512 by 512 array of MACs. (See Figure 5 of Paramasivam to see that NPU1 (second DLA chip) is a 1024x1024 synapse which is greater than NPU0 and contains a 512 by 512 array within)
It would have been obvious before the effective filing date to combine the accelerators executing neural networks of Hong with the neural processing unit 0 and a bigger neural processing unit 1 of Paramasivam for improved efficiency in performing neural network computations associated with one or more neural network applications (Paramasivam, [0044]). Hong and Paramasivam are analogous arts because they both concern deep learning accelerators.
Claim 20 is rejected under 35 U.S.C. 103 as being unpatentable over Hong and Gadelrab in view of Moons et al. (ENVISION: A 0.26-to-10TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable Convolutional Neural Network Processor in 28nm FDSOI); hereinafter Moons
Claim 20 is rejected over Hong, Gadelrab and Moon with the incorporation of claim 18.
Hong does not teach further storing instructions to determine whether the execution of the second DLA model provides at least a threshold quantity of correct inferences per second per watt.
However, Moons teaches further storing instructions to determine whether the execution of the second DLA model provides at least a threshold quantity of correct inferences per second per watt. (Moons [page 1, column 2]: “Figure 14.5.6 shows a comparison with recent ConvNet ASICs. Envision scales efficiency on the AlexNet convolutional layers between 0.8-3.8TOPS/W, compared to 0.16TOPS/W [3] and 0.56-1.4TOPS/W [4]. Efficiency is 2TOPS/W average for VGG-16 and up to 10 TOPS/W peak.”)
It would have been obvious before the effective filing date to combine the accelerators executing neural networks of Hong with the TOPS/W of Moon to effectively measure the efficiently minimized energy-consumption of a neural network (Moons, page 1, column 2). Hong and Moons are analogous art because they both concern improving energy consumption on deep learning accelerations.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DAVID H TRAN whose telephone number is (703)756-1525. The examiner can normally be reached M-F 9:30 am - 5:30 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Viker Lamardo can be reached at (571) 270-5871. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/DAVID H TRAN/Examiner, Art Unit 2147
/VIKER A LAMARDO/Supervisory Patent Examiner, Art Unit 2147