DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
(Submitted 8/12/2025)
In regard to 103 rejections
The examiner first notes that the applicant has made no amendments to the claims.
- The applicant on Page 8 made arguments with respect to obviousness suggesting that the office action falls short of obviousness for combining reference Wei, Raha and Kim and specifically argues and that the examiner merely presents blocks of quotations from the cited references without explaining how the claimed features are actually disclosed therein.
Examiners’ Response
The examiner respectfully “disagrees” with the arguments. The examiner affirms that the BRI is for the claim within the context of references and is NOT a misapply.
The examiner understanding of the core of the invention is to “ provide a plurality of computational layers to perform computations for at least one FNN layer” and notes that other limitations are simply the rudimentary support for loading parameters., loading configurations. From a BRI of these claim limitations, the reference Wei teaches the computational layers and reference Raha provide some specificity on the parameters that load and reference Kim provides the specific inclusion of fully connected configurations. The examiner reaffirms that the combining these references is quite obvious for any skilled person in the art to understand that may not need any extradentary interpretation for the overall understanding of the invention.
- The applicant in Pages 8-9 argues with regard to reference Wei as follows:
Wei in view of Raha and Kim fails to teach "a plurality of computational layers including a multiply-accumulate (MAC) layer that includes a plurality of MAC units" and action appears to assert that Wei's computational layers in [0008] correspond to the claimed computational layers and further states in [0008] clearly states these layers are software, generated by the system to correspond to workload clusters and combined to form an Al model.
Examiner’ s Response:
The examiner is unable to understand the layers are “software” and the examiner asks the applicant where does reference Wei says the layer is “software” reciting [0008] “ In a variation on this embodiment, the system generates a set of computational layers such that a computational layer corresponds to a respective workload cluster in the set of workload clusters. The system then combines the set of computational layers to form the synthetic AI model.”. The examiner reaffirms to the applicant that reference merely suggesting a computational task for a layer and has no mention of such tasks being executed by a software, hardware or combination of it. However, having once again affirming the reference Wei is a very strong primary reference for this invention does teach the option for computational tasks. Further, Examiner submits that the applicant after conclusion that the layers are software cannot include MAC that is based on hardware cannot be combined in this manner to form an AI Model. The examiner submits that reference Wei in [0034] “ benchmarking AI hardware by generating a synthetic AI model that represents the statistical characteristics of the workloads of a set of AI models corresponding to representative applications and their execution frequencies. The AI hardware can be a piece of hardware capable of efficiently processing AI-related operations, such as computing a layer of a neural network. The representative applications are the various applications that AI hardware, such as an AI accelerator, may run. Hence, the performance of the AI hardware is typically determined by benchmarking the AI hardware for the set of AI models. Benchmarking refers to the act of running a computer program, a set of programs, or other operations, to assess the relative performance of a software or hardware system. Benchmarking is typically performed by executing a number of standard tests and trials on the system”. The examiner submits that perhaps it is known to the art that a synthetic AI model is an artificial intelligence system trained to generate high-quality, artificially created data that mimics real-world datasets. The examiner in general submits that the concept is software-defined AI architecture designed to efficiently process complex, large-scale AI models involving breaking the model into "computational layers," where the core calculation is the Multiply-Accumulate (MAC) operation, and allocating these layers to specific processing cluster. Further, examiner submits that the reference Wei in [0088] teaches “ the methods and processes described above can be included in hardware modules. For example, the hardware modules can include, but are not limited to, application-specific integrated circuit (ASIC) chips, field-programmable gate arrays (FPGAs), and other programmable-logic devices now known or later developed. When the hardware modules are activated, the hardware modules perform the methods and processes included within the hardware modules”. The examiner strongly states that perhaps it is known in the art that a hardware-based approach using FPGAs (Field-Programmable Gate Arrays) can be used to combine computational layers and generate a synthetic AI model. As a matter of this is a core part of creating specialized AI hardware accelerators for applications requiring low latency, high energy efficiency, and customization.
- The applicant on Page 9 argues that Wei in view of Raha and Kim fails to teach "load[ing] configurations for the plurality of computational layers" in claim 1 and argues that the Office action suggests that [0046] satisfies this limitation, an Al model is distinct from configurations for computational layers that contain MAC units.
Examiner’s Response:
The examiner respectfully “disagrees” with the argument and submits that reference Wei in
[0046] “ During operation, system 150 can determine AI models 130 based on the representative applications. In some embodiments, system 150 can maintain a list of representative applications (e.g., in a local storage device) and their corresponding AI models. This list can be generated during the configuration of system 150 (e.g., by an administrator). Furthermore, AI models 130 can be loaded onto the memory of device 120 such that system 150 may access a respective one of AI models 130. This allows system 150 to collect information associated with a respective layer of AI models 132, 134, and 136. Collected information can include one or more of: number of channels, number of filters, filter size, stride information, and padding information”.
The examiner submits that perhaps it is known in the art that AI model is structured as a series of computational layers, and a list or file generated during its configuration serves as the instructions for loading and running those layers. This configuration data is critical for reproducibility, flexibility, and managing complex model architectures and with this context the number of channels, number of filters, filter size, stride, and padding are key configuration settings for computational layers in neural networks, most notably for convolutional layers. These hyperparameters control the behavior and output of the layer, defining its architecture and how it processes input data.
- The applicant on Page 9 argues that Wei does not teach the configuration including FNN configuration. Finally, reference Wei in view of Raha and Kim fails to teach "the configurations including at least one fully-connected neural network (FNN) configuration for the MAC layer." The application states cited portions of Kim (col. 4, lines 35-46; col. 11, lines 62-65; col. 7, lines 52-57; col. 6, lines 28-31) at most mention a fully connected layer of CNN but are entirely silent about a fully connected neural network (FNN).
Examiner’s Response:
The examiner respectfully “disagrees” with the argument. The examiners submits that reference Kim teaches in [Col 12, lines 8-17 ‘ “ Referring to FIG. 10, the processor 120 may control the operation module 110 to perform operations according to a CNN algorithm and an RNN algorithm, using the network of the coupled structure as described above. For example, the processor 120 may control the operation module 110 to perform an operation according to the mesh topology network in a convolution layer and a pooling layer of the CNN algorithm and perform an operation according to the tree topology network in the fully connected layer of the CNN algorithm and in each layer of the RNN algorithm”. The examiner submit that this is a hybrid CNN-RNN model where, it is not common to embed a fully-connected (FC) layer inside every layer of the RNN. The purpose of this hybrid architecture is to combine feature extraction (CNN) with sequence modeling (RNN). The most common configuration is to place the FC layer at the end of the entire network to make a final classification or regression based on the features processed by both the CNN and RNN. The RNN processes the sequence of feature vectors extracted by the CNN. In this configuration, the CNN's final layer that must be prepared for the RNN. After the RNN has processed the entire sequence, its final output is used for the final classification or regression task. This is where a fully-connected network configuration may typically comes from. However, the examiner also submits that the reference Wei also teaches in [0070] “ Furthermore, to ensure transition among layers 352, 354, and 356, system 150 can incorporate a rectified linear unit (ReLU) layer and a normalization layer in a respective one of layers 352, 354, and 356. As a result, a respective one of these layers includes convolution, ReLU, and normalization layers. For example, layer 354 can include convolution layer 412, ReLU layer 414, and normalization layer 416. System 150 then appends a fully connected layer 402 and a softmax layer 404 to SAI model 140. In this way, system 150 completes the construction of SAI model 140”. The examiner submits that perhaps it is also known in the art that neural network with a fully connected layer and a softmax layer is a fully connected neural network, also known as an Artificial Neural Network (ANN). The term "fully connected" describes the architecture where every neuron in one layer connects to every neuron in the next layer, and the softmax layer is a common activation function used in the output of such networks for classification tasks. The examiner submits that reference Wei teaches in [0035] “An AI model can be any model that uses AI-based techniques (e.g., a neural network). An AI model can be a deep learning model that represents the architecture of a deep learning representation. For example, a neural network can be based on a collection of connected units or nodes where each connection (e.g., a simplified version of a synapse) between artificial neurons can transmit a signal from one to another. The artificial neuron that receives the signal can process it and then signal artificial neurons connected to it” and teaches in [0079] “ FIG. 6B presents a flowchart 630 illustrating a method of a benchmarking system generating a synthetic AI model representing a set of AI models, in accordance with an embodiment of the present application. During operation, the system determines a layer of the SAI model corresponding to a respective cluster (operation 632). This layer can correspond to a convolution layer and the SAI model can be a synthetic neural network. The system can add additional layers, such as a ReLU layer and a normalization layer, to a respective layer of the SAI model (operation 634). The system can add final layers, which can include a fully connected layer and a softmax layer, to complete the SAI model (operation 636). The examine submits further that reference Rana teaches in [0033] “FIG. 1 illustrates an architecture of an example DNN 100, in accordance with various embodiments. For purpose of illustration, the DNN 100 in FIG. 1 is a Visual Geometry Group (VGG)-based convolutional neural network (CNN). In other embodiments, the DNN 100 may be other types of DNNs. The DNN 100 is trained to receive images and output classifications of objects in the images. In the embodiment of FIG. 1, the DNN 100 receives an input image 105 that includes objects 115, 125, and 135. The DNN 100 includes a sequence of layers comprising a plurality of convolutional layers 110 (individually referred to as “convolutional layer 110”), a plurality of pooling layers 120 (individually referred to as “pooling layer 120”), and a plurality of fully connected layers 130 (individually referred to as “fully connected layer 130”). In other embodiments, the DNN 100 may include fewer, more, or different layers.
- On Page 10, with regard to Claims 6-8, 10, 16-18 and 20 , the applicant argues that references Darvish and Lin are cited for alleged disclosures unrelated to the deficiencies of Wei in view of Raha and Kim and thus cannot cure these deficiencies.
Examiner’s Response:
The examiner respectfully “ disagrees” with the applicant’s argument. The examiner’s argument as “unrelated” is very VAGUE and does not provide any specifics. However, Darvish does teach specifics that related to the invention.
The examiner submits that Darvish teaches in [0041] “ a neural network accelerator is configured to performing training operations for layers of a neural network” is a neural network is indeed performing computational layers. The core purpose of a neural network accelerator is to speed up the computations required by a neural network's layers, whether for training or inference. Further, Darvish teaches in [0042]” The converted result can be used to generate an output tensor of the layer of the neural network, where the output tensor is in normal-precision floating-point format” and the examiner submits that the generated output tensor of a neural network (NN) is a function of its computational layers. Further Darvish teaches in [0047]” The quantization accelerator 186 can be programmed to execute a subgraph, an individual layer, or a plurality of layers of a neural network”.
The examiner submits that reference Kim teaches in [Col 13, lines 16-18]” Referring to FIG. 12, the RNN is an algorithm for performing deep learning for time-series data changed over time, and is used to process tasks” and teaches in [Col 13, lines 25-31] “ A weight WO of the neural network may be stored in each PE of the operation module 110 in advance. The operation module 110 may output a value 122 in which the operation values are accumulated for each of the layers of the RNN, a past value operated in each layer may be temporarily stored in each PE and may be transferred to a current operation process”. These teaching are representative of performing by the computational layers. The examiner submits the Kim teaches MAC operations in [Col 14, lines 55-59]” the convolution operation may be performed by transferring the accumulation for values obtained by multiplying different data values of the input data with each of the plurality of elements to an adjacent processing element” in combine with [Col 13, lines 63-67] and [Col 14, lines 1-3] “ Referring to FIG. 13D, at the next clock t+1, the first input values are moved to the PE included in the third row of the operation module 110 to derive products D1 to D8 of the first input values and the weight stored in the PE included in the third row, and a second accumulation is performed for C1 to C4 in the first row using the tree topology network connection. That is, C1 and C2 are summed and C3 and C4 are summed. The examiner submits that reference Darvish also teaches MAC in [0074] In some examples, a set of parallel multiply-accumulate (MAC) units in each convolutional layer can be used to speed up the computation.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-5, 9, 11-15, and 19 are rejected under 35 U.S.C. 103 unpatentable over
Wei et.al. (hereinafter Wei) US 2020/0042419 A1,
in view of Arnab Raha et.al. (hereinafter Raha) US 2022/0083843 A1,
in view of Kyoung-Hoon Kim et.al. (hereinafter Kim) US 11907826 B2.
In regard to claim 1: (Original)
Wei discloses:
- An apparatus, comprising: a controller configured to dispatch computing tasks of a neural network; a configuration buffer; a data buffer;
In [0085] :
FIG. 8 illustrates an exemplary apparatus that facilitates a benchmarking system for AI hardware,
In [0085] :
Apparatus 800 may be realized using one or more integrated circuits, and may include fewer or more units or apparatuses than those shown in FIG. 8. Further, apparatus 800 may be integrated in a computer system, or realized as a separate device that is capable of communicating with other computer systems and/or devices. Specifically, apparatus 800 can comprise units 802-814, which perform functions or operations similar to modules 720-732 of computer system 700 of FIG. 7, including: a collection unit 802; a workload unit 804; a clustering unit 806; a grouping unit 808; a synthesis unit 810; a performance unit 812; and a communication unit 814.
in [0046]:
During operation, system 150 can determine AI models 130 based on the representative applications. In some embodiments, system 150 can maintain a list of representative applications (e.g., in a local storage device) and their corresponding AI models. This list can be generated during the configuration of system 150 (e.g., by an administrator). Furthermore, AI models 130 can be loaded onto the memory of device 120 such that system 150 may access a respective one of AI models 130. This allows system 150 to collect information associated with a respective layer of AI models 132, 134, and 136. Collected information can include one or more of: number of channels, number of filters, filter size, stride information, and padding information.
In [0081]:
Computer system 700 includes a processor 702, a memory device 704, and a storage device 708. Memory device 704 can include a volatile memory device
In [0081]:
Storage device 708 can store an operating system 716, a benchmarking system 718, and data 736.
- and a plurality of computational layers including a multiply-accumulate (MAC) layer that includes a plurality of MAC units,
In [0008]:
The system generates a set of computational layers such that a computational layer corresponds to a respective workload cluster in the set of workload clusters. The system then combines the set of computational layers to form the synthetic AI model.
In [0056]:
System 150 then computes the workload associated with a respective layer of a respective one of AI models 130. For example, for a layer 220 of AI model 134, system 150 determines layer information 224, which can include number of filters, filter size, stride information, and padding information. In some embodiments, system 150 uses layer information 224 to determine the MAC operations associated with layer 220 and compute MAC time that indicates the time to execute the determined MAC operations. System 150 can use the computed MAC time as workload 222 for that layer
- wherein the controller is configured to: load configurations for the plurality of computational layers to perform computations for the neural network into the configuration buffer,
in [0046]:
During operation, system 150 can determine AI models 130 based on the representative applications. In some embodiments, system 150 can maintain a list of representative applications (e.g., in a local storage device) and their corresponding AI models. This list can be generated during the configuration of system 150 (e.g., by an administrator). Furthermore, AI models 130 can be loaded onto the memory of device 120 such that system 150 may access a respective one of AI models 130. This allows system 150 to collect information associated with a respective layer of AI models 132, 134, and 136. Collected information can include one or more of: number of channels, number of filters, filter size, stride information, and padding information.
In [0059]:
Workload table 240 can include a respective workload computed by system 150. Workload table 240 can map a respective workload to a corresponding AI model identifier, a layer identifier of the layer corresponding to the workload, and an execution frequency of the AI model
Wei does not explicitly disclose:
- the configurations including at least one fully-connected neural network (FNN) configuration for the MAC layer to perform computations for at least one FNN layer;
- load parameters for the neural network into the data buffer, the parameters including weights and biases for the plurality of computational layers;
- and load input data into the data buffer; wherein the MAC layer is configured to:
apply the at least one FNN configuration to perform the computations for the at least one FNN layer, the at least one FNN configuration including settings for a FNN operation topology for the plurality of MAC units to perform the computations for the at least one FNN layer
However, Raha discloses:
- load parameters for the neural network into the data buffer, the parameters including weights and biases for the plurality of computational layers;
in [0042]:
FIG. 2 illustrates a hardware architecture 200 for a layer of a DNN, in accordance with various embodiments. The hardware architecture 200 includes a plurality of PEs 210 (individually referred to as “PE 210”) and column buffers 220 (individually referred to as “column buffer 220”). In other embodiments, the hardware architecture 200 includes other components, such as a static random-access memory (SRAM) for storing input and output of the layer. The hardware architecture 200 may also include a distribution unit for distributing data stored in the SRAM to the column buffers 220,
in [0090]:
A fully connected layer involves weights, biases, and neurons. It connects neurons in one layer to neurons in another layer. It is used to classify images between different category by training.
It would have obvious to one of ordinary skill in the art before the effective filing date of
the present application to combine Wei and Raha.
Wei teaches MAC units and buffers.
Raha teaches loading parameters.
One of ordinary skill would have motivation to combine Wei and Raha to reduce resources
using rearranging process (rearranging weight vector) (Raha [0062]).
Wei and Raha do not explicitly disclose:
- the configurations including at least one fully-connected neural network (FNN) configuration for the MAC layer to perform computations for at least one FNN layer;
- and load input data into the data buffer; wherein the MAC layer is configured to: apply the at least one FNN configuration to perform the computations for the at least one FNN layer, the at least one FNN configuration including settings for a FNN operation topology for the plurality of MAC units to perform the computations for the at least one FNN layer
However, Kim discloses:
- the configurations including at least one fully-connected neural network (FNN) configuration for the MAC layer to perform computations for at least one FNN layer;`
In [Col 4, lines 35-46]:
The operation module 110 is configured to include a plurality of PEs. The plurality of PEs are configured in an array structure of a predetermined pattern to parallel-process data between PEs which are synchronously adjacent to each other, and simultaneously perform the same function. A PE may perform an operation and exchange data between PEs, and may be synchronized with one clock to perform an operation. That is, the plurality of PEs may each perform the same operation for each of the clocks. Since the plurality of PEs share data with the PE which is adjacent thereto on the same path, a connection structure between the PEs may form a geometrically simple symmetrical structure,
In [Col 11, lines -62-65]:
operation module 110 forming the systolic array of the tree topology network structure is also required for the fully connected layer of CNN, both the operation module 110 of the mesh topology network structure and the operation module 110 of the tree topology network structure were required.
In [Col 7, lines 52-57]:
the processor 120 may control the operation module 110 to perform the convolution operation by transferring different data values of the input data, that is, accumulation for the values obtained by multiplying different element values of the feature map with each of the elements of the filter 70 to an adjacent PE,
In [Col 6, lines 28-31]:
When feature values are finally extracted from the convolution layer of the CNN, in a fully connected layer, the extracted feature values are input to the neural network to perform a classification
- and load input data into the data buffer; wherein the MAC layer is configured to: apply the at least one FNN configuration to perform the computations for the at least one FNN layer, the at least one FNN configuration including settings for a FNN operation topology for the plurality of MAC units to perform the computations for the at least one FNN layer
In [Col 2, lines 10-18]):
a processor configured to control the operation module to perform a convolution operation by applying a filter to input data, wherein the processor controls the operation module to perform the convolution operation by inputting each of a plurality of elements configuring a two-dimensional filter to the plurality of processing elements in a predetermined order and sequentially applying the plurality of elements to the input data.
In [Col 8, lines 49-53]):
In addition, the first accumulation values D0 to D7 derived at the previous clock may be moved to the PE adjacent to a lower end by one space and may be each temporarily stored in a memory included in the PE of the second row (to this end, each PE may include the memory)
In [Col 7, lines 52-57]:
the processor 120 may control the operation module 110 to perform the convolution operation by transferring different data values of the input data, that is, accumulation for the values obtained by multiplying different element values of the feature map with each of the elements of the filter 70 to an adjacent PE,
(BRI: accumulation for multiplication is the MAC function)
In [Col 4, lines 47-51]:
For example, PEs may be arranged in various forms of network structures such as a mesh topology network, a tree topology network, and the like. A structure of the mesh topology network and the tree topology network are described below with reference to FIGS. 3 and 9.
(BRI: the arrangement of processors in different network structures constitutes setting a network topology. A fully connected topology, also known as a mesh topology, means that every processor is directly connected to every other processor. This setup ensures direct communication paths between any two processors without relying on intermediate devices)
In [Col 6, lines 28-31]:
When feature values are finally extracted from the convolution layer of the CNN, in a fully connected layer, the extracted feature values are input to the neural network to perform a classification
(BRI: perform a classification is the computation in FCC layer)
In[Col 4, lines 47-51]:
For example, PEs may be arranged in various forms of network structures such as a mesh topology network, a tree topology network, and the like. A structure of the mesh topology network and the tree topology network are described below with reference to FIGS. 3 and 9.
It would have obvious to one of ordinary skill in the art before the effective filing date of
the present application to combine Wei, Raha and Kim.
Wei teaches MAC units and buffers.
Raha teaches load parameters.
Kim teaches fully connected layers and configurations of FNN to compute using MAC.
One of ordinary skill would have motivation to combine Wei, Raha and Kim to reduce the load of the memory and providing increased operation efficiency ( Kim [Col 11, lines 40-44])
In regard to claim 2: (Original)
Wei do not explicitly disclose:
- wherein the configurations further include at least one convolutional neural network (CNN) configuration for the MAC layer to perform computations for at least one CNN layer, the at least one CNN configuration includes settings for a CNN operation topology and settings for cycle-by-cycle operations for the plurality of MAC units to perform the computations for the at least one CNN layer,
However, Raha discloses:
- wherein the configurations further include at least one convolutional neural network (CNN) configuration for the MAC layer to perform computations for at least one CNN layer, the at least one CNN configuration includes settings for a CNN operation topology and settings for cycle-by-cycle operations for the plurality of MAC units to perform the computations for the at least one CNN layer,
In [0033]:
the DNN 100 includes a sequence of layers comprising a plurality of convolutional layers 110 (individually referred to as “convolutional layer 110”), a plurality of pooling layers 120 (individually referred to as “pooling layer 120”),
In [0039]:
the pooling layers 120 may perform the pooling operation through average pooling (calculating the average value for each patch on the feature map), max pooling (calculating the maximum value for each patch of the feature map), or a combination of both. The size of the pooling operation is smaller than the size of the feature maps. In various embodiments, the pooling operation is 2×2 pixels applied with a stride of 2 pixels, so that the pooling operation reduces the size of a feature map by a factor of 2, e.g., the number of pixels or values in the feature map is reduced to one quarter the size.
in [0057]:
FIG. 5 includes a clock 530 to show the amount of time each PE 520 takes to perform an MAC operation on the weight vector assigned to the PE. The PE 520A takes five cycles to perform its MAC operation, versus two cycles for the PEs 520B-C, three cycles for the PE 520D, and seven cycles for the PE 520E.
In [0038]:
The pooling layers 120 downsample feature maps generated by the convolutional layers, e.g., by summarizing the presents of features in the patches of the feature maps. A pooling layer 120 is placed between two convolution layers 110: a preceding convolutional layer 110 (the convolution layer 110 preceding the pooling layer 120 in the sequence of layers) and a subsequent convolutional layer 110 (the convolution layer 110 subsequent to the pooling layer 120 in the sequence of layers)
In [0070]:
As the order of weights changed, the order of elements in the input feature map may also need to be changed. This is because input feature map and weights come into the DNN layer as a pair so if the indices of the weights are changed, the same change needs to be made to the elements in the input feature map.
In [0019]:
Most sparse DNN accelerators rarely achieve the maximum speedup that can be obtained from skipping computation on sparse data due to the various factors including: the underlying DNN dataflow, synchronization barriers during the drain phase (extraction of output points from the computation units to upper memory hierarchies) and the overheads associated with splitting work into multiple smaller tasks among multiple PEs.
Wei and Raha do not explicitly disclose:
- and the settings for the CNN operation topology include settings for operations in a row direction of an input data matrix, settings for operations in a column direction of the input data matrix, settings for operations in a row direction of a weight matrix and settings for operations in a column direction of the weight matrix.
However, Kim discloses:
- and the settings for the CNN operation topology include settings for operations in a row direction of an input data matrix, settings for operations in a column direction of the input data matrix, settings for operations in a row direction of a weight matrix and settings for operations in a column direction of the weight matrix.
In [Col 5, lines 33-43]:
Referring to FIG. 3, a plurality of PEs 20-1 to 20-n included in the operation module 110 form a systolic array of a mesh topology network structure. The plurality of PEs 20-1 to 20-n may share data through lines connecting the PEs 20-1 to 20-n which are adjacent to each other to perform the operation, using the mesh topology network structure. The mesh topology network structure is a structure in which respective PEs may be connected to the PEs which are adjacent thereto in a mesh network to exchange data, as illustrated in FIG. 3,
In [Col 4, lines 65-67], In [Col 5, lines 1-3]:
For example, the processor 120 may control the operation module 110 to perform a convolution operation based on a neural network by applying a filter to input data which is input to the operation module 110. In this case, the filter, which is a mask having a weight, is defined as a matrix. The filter is also referred to as a window or a kernel,
In [Col 5, lines 50-55]:
Referring to FIGS. 4A, 4B, and 4C, FIG. 4A illustrates the conventional method for performing a convolution operation according to the CNN by applying a two-dimensional filter 41 to two-dimension input data, where the input data and the filter 41 may be formed of matrix data including elements having one or more certain values.
In [Col 7, lines 8-19]:
A basic input order is repeated in the order of proceeding in one side direction based on a certain element in the two-dimensional filter, proceeding to an element which is adjacent to a corresponding element in a next row or a next column of the element positioned at the end of the proceeding direction, and proceeding in a direction opposite to one side direction in the adjacent element. In panel (b) of FIG. 6, elements {circle around (1)} to {circle around (4)} are input to the operation module 110 in the order of the numbers. In panel (c) of FIG. 6, elements {circle around (1)} to {circle around (9)} are input to the operation module 110 in the order of the numbers,
In [Col 8, 9-16]:
When the operation of the first elements is completed in the first row and the operation for the second elements starts, the processor 120 then shifts a plurality of values for the first elements in a predetermined direction to perform the accumulation for the values. In this case, the predetermined direction is the same as a direction in which the second elements are disposed based on the first elements in the two-dimensional filter.
In [Col 9, lines 39-49]:
Referring to FIG. 7J, third accumulation values Q1 to Q8 are illustrated in which the second accumulation values K1 to K8 are summed to M1 to M8. The third accumulation values Q1 to Q8 are equal to the value obtained by summing values obtained by each multiplying the elements {circle around (1)} to {circle around (4)} of the filter 70 with elements I to IV of the feature map, and the third accumulation values Q1 to Q8 are output through an output terminal to become elements of a first row and first column of a new feature map derived by the convolution operation.
It would have obvious to one of ordinary skill in the art before the effective filing date of
the present application to combine Wei, Raha and Kim.
Wei teaches MAC units and buffers.
Raha teaches load parameters.
Kim teaches fully connected layers and configurations of FNN to compute using MAC.
One of ordinary skill would have motivation to combine Wei, Raha and Kim to reduce the load of the memory and providing increased operation efficiency ( Kim [Col 11, lines 40-44])
In regard to claim 3: (Original)
Wei discloses:
- wherein the plurality of MAC units are grouped into several groups, and each group includes one or more MAC units and is configured to perform convolutions for one output channel according to the at least one CNN configuration, the one or more MAC units in one group share a same batch of weights but have different input data elements.
In [0040]:
the system can determine the workload of a respective layer, and store the workload information in a workload table. The system then can cluster workloads of the layers (e.g., using k-means) based on the workload table,
(BRI: a cluster is a group)
In [0068]:
For example, suppose that SAI model 140 generates a synthetic image based on an input image. Suppose that the input image size is 224×224×3. The output image dimension can be calculated as (input image size−filter size)/stride+1. Suppose that workload 232 is 36602000 (e.g., a MAC value of 36602000). System 150 then determines channel number as 100, filter size as 11×11, and stride as 4 for input size 332. This leads to an output image size of 55. This can generate a workload of approximately 36602500, which is a close approximation of workload 232, for layer 352.
In [0056]:
System 150 then computes the workload associated with a respective layer of a respective one of AI models 130,
In [0056]:
system 150 uses layer information 224 to determine the MAC operations associated with layer 220,
In [0057]:
System 150 can determine the number of clusters based on a clustering parameter. The parameter can be based on how the workloads are distributed (e.g., based on a range of workloads that can be included in a cluster or a diameter of a cluster) or a predetermined number of clusters,
In [0058]:
the workloads in a cluster also incorporate the execution frequencies, the representative weight for a cluster can be closer to the workload of a layer with a high execution frequency,
In [0035]:
a neural network can be based on a collection of connected units or nodes where each connection (e.g., a simplified version of a synapse) between artificial neurons can transmit a signal from one to another. The artificial neuron that receives the signal can process it and then signal artificial neurons connected to it.
In regard to claim 4: (Original)
Wei does not explicitly disclose:
- wherein the settings for the FNN operation topology include settings for operations in a row direction of an input data matrix and a weight matrix, settings for operations in a column direction of the input data matrix and the weight matrix, and settings for operations of nodes of the at least one FNN layer in batches based on a number of MAC units in the MAC layer.
However, Raha discloses:
- wherein the settings for the FNN operation topology include settings for operations in a row direction of an input data matrix and a weight matrix, settings for operations in a column direction of the input data matrix and the weight matrix, and settings for operations of nodes of the at least one FNN layer in batches based on a number of MAC units in the MAC layer.
In [0089]:
The training dataset can be divided into one or more batches,
In [0017]:
DNN accelerators are suitable to execute these type of workloads as they consist of thousands of parallel MAC units that can simultaneously operate and produce the output in lesser time.
(BRI: batch processing is a common type of computing workload)
In [0041]:
In the embodiment of FIG. 1, N equals 3, as there are three objects 115, 125, and 135 in the input image. Each element of the vector indicates the probability for the input image 105 to belong to a class. To calculate the probabilities, the fully connected layers 130 multiply each input element by weight, makes the sum, and then applies an activation function (e.g., logistic if N=2, softmax if N>2). This is equivalent to multiplying the input vector by the matrix containing the weights,
In [0096]:
the weight vector represents a filter in the layer. The filter may be a matrix of weights that includes X rows and Y columns, where X and Y are integers. The weight vector may include Y sections, each section is a row in the matrix. For instance, the first row is the first section in the weight vector, the second row is the second section in the weight vector, and so on. In other embodiments, the weight vector may represent a portion of the filter or represent multiple filters of the layer. The layer may also include an activation. The activation can be represented by an activation vector that includes a sequence of activations.
It would have obvious to one of ordinary skill in the art before the effective filing date of
the present application to combine Wei and Raha.
Wei teaches MAC units and buffers.
Raha teaches loading parameters.
One of ordinary skill would have motivation to combine Wei and Raha to reduce resources
using rearranging process (rearranging weight vector) (Raha [0062]).
In regard to claim 5: (Original)
Wei discloses:
- wherein the plurality of computational layers further include a K-Means layer configured to cluster the input data into a plurality of clusters according to a K-Means configuration.
In [0052]:
computation load analysis unit 154 can calculate the workload of a layer based on the input parameters and algorithms applicable to the layer. In some embodiments, the workload of a layer can be calculated based on multiply-accumulate (MAC) time for the operations associated with the layer,
In [0040]:
Based on the collected information and the execution frequencies, the system can determine the workload of a respective layer, and store the workload information in a workload table. The system then can cluster workloads of the layers (e.g., using k-means) based on the workload table. The system can also group the input sizes of the layers. The system can determine a representative workload for a respective cluster and a representative input size for a respective input group. The system then matches a respective representative workload to a corresponding representative input size such that the input size can generate the corresponding workload. The system may adjust an input size to match the workload. The system can generate an SAI model that includes a layer corresponding each cluster.
In regard to claim 9: (Original)
Wei does not explicitly disclose:
- wherein the plurality of computational layers further include a pooling layer that includes a plurality of pooling units each configured to compare multiple input values, the pooling layer is configured to perform a max-pooling or a min- pooling according to a pooling configuration that includes settings for a pooling operation topology and settings for cycle-by-cycle operations for the plurality of pooling units.
However, Raha discloses:
- wherein the plurality of computational layers further include a pooling layer that includes a plurality of pooling units each configured to compare multiple input values, the pooling layer is configured to perform a max-pooling or a min- pooling according to a pooling configuration that includes settings for a pooling operation topology and settings for cycle-by-cycle operations for the plurality of pooling units.
In [0033]:
the DNN 100 includes a sequence of layers comprising a plurality of convolutional layers 110 (individually referred to as “convolutional layer 110”), a plurality of pooling layers 120 (individually referred to as “pooling layer 120”),
In [0039]:
the pooling layers 120 may perform the pooling operation through average pooling (calculating the average value for each patch on the feature map), max pooling (calculating the maximum value for each patch of the feature map), or a combination of both. The size of the pooling operation is smaller than the size of the feature maps. In various embodiments, the pooling operation is 2×2 pixels applied with a stride of 2 pixels, so that the pooling operation reduces the size of a feature map by a factor of 2, e.g., the number of pixels or values in the feature map is reduced to one quarter the size.
in [0057]:
FIG. 5 includes a clock 530 to show the amount of time each PE 520 takes to perform an MAC operation on the weight vector assigned to the PE. The PE 520A takes five cycles to perform its MAC operation, versus two cycles for the PEs 520B-C, three cycles for the PE 520D, and seven cycles for the PE 520E.
In [0038]:
The pooling layers 120 downsample feature maps generated by the convolutional layers, e.g., by summarizing the presents of features in the patches of the feature maps. A pooling layer 120 is placed between two convolution layers 110: a preceding convolutional layer 110 (the convolution layer 110 preceding the pooling layer 120 in the sequence of layers) and a subsequent convolutional layer 110 (the convolution layer 110 subsequent to the pooling layer 120 in the sequence of layers)
It would have obvious to one of ordinary skill in the art before the effective filing date of
the present application to combine Wei and Raha.
Wei teaches MAC units and buffers.
Raha teaches pooling layer.
One of ordinary skill would have motivation to combine Wei and Raha to reduce resources
using rearranging process (rearranging weight vector) (Raha [0062]).
In regard to claim 11: (Original)
Wei discloses:
- A method, comprising: loading configurations for computational layers to perform computations for a neural network into a configuration buffer
In [0082]:
Benchmarking system 718 can include instructions, which when executed by computer system 700 can cause computer system 700 to perform methods and/or processes described in this disclosure. Specifically, benchmarking system 718 can include instructions for collecting information associated with a respective layer of a one respective of representative AI models (collection module 720). Benchmarking system 718 can also include instructions for calculating the workload (i.e., the computational load) for a respective layer of a respective one of representative AI models (workload module 722).
in [0046]:
During operation, system 150 can determine AI models 130 based on the representative applications. In some embodiments, system 150 can maintain a list of representative applications (e.g., in a local storage device) and their corresponding AI models. This list can be generated during the configuration of system 150 (e.g., by an administrator). Furthermore, AI models 130 can be loaded onto the memory of device 120 such that system 150 may access a respective one of AI models 130. This allows system 150 to collect information associated with a respective layer of AI models 132, 134, and 136. Collected information can include one or more of: number of channels, number of filters, filter size, stride information, and padding information.
In [0081]:
Computer system 700 includes a processor 702, a memory device 704, and a storage device 708. Memory device 704 can include a volatile memory device
In [0081]:
Storage device 708 can store an operating system 716, a benchmarking system 718, and data 736.
- the computational layers including a multiply- accumulate (MAC) layer,
In [0008]:
The system generates a set of computational layers such that a computational layer corresponds to a respective workload cluster in the set of workload clusters. The system then combines the set of computational layers to form the synthetic AI model.
In [0056]:
System 150 then computes the workload associated with a respective layer of a respective one of AI models 130. For example, for a layer 220 of AI model 134, system 150 determines layer information 224, which can include number of filters, filter size, stride information, and padding information. In some embodiments, system 150 uses layer information 224 to determine the MAC operations associated with layer 220 and compute MAC time that indicates the time to execute the determined MAC operations. System 150 can use the computed MAC time as workload 222 for that layer.
Wei does not explicitly disclose:
- and the configurations including at least one fully-connected neural network (FNN) configuration for the MAC layer to perform computations for at least one FNN layer;
- loading parameters for the neural network into a data buffer, the parameters including weights and biases for the computational layers;
- loading input data into the data buffer;
- and activating the computational layers and applying the configurations to perform the computations for the neural network, including: applying the at least one FNN configuration to the MAC layer, the MAC layer including a plurality of MAC units, the at least one FNN configuration including settings for a FNN operation topology for the plurality of MAC units to perform the computations for the at least one FNN layer.
However, Raha discloses:
- loading parameters for the neural network into a data buffer, the parameters including weights and biases for the computational layers;
in [0042]:
FIG. 2 illustrates a hardware architecture 200 for a layer of a DNN, in accordance with various embodiments. The hardware architecture 200 includes a plurality of PEs 210 (individually referred to as “PE 210”) and column buffers 220 (individually referred to as “column buffer 220”). In other embodiments, the hardware architecture 200 includes other components, such as a static random-access memory (SRAM) for storing input and output of the layer. The hardware architecture 200 may also include a distribution unit for distributing data stored in the SRAM to the column buffers 220,
in [0090]:
A fully connected layer involves weights, biases, and neurons. It connects neurons in one layer to neurons in another layer. It is used to classify images between different category by training.
It would have obvious to one of ordinary skill in the art before the effective filing date of
the present application to combine Wei and Raha.
Wei teaches MAC units and buffers.
Raha teaches loading parameters.
One of ordinary skill would have motivation to combine Wei and Raha to reduce resources
using rearranging process (rearranging weight vector) (Raha [0062]).
Wei and Raha do not explicitly disclose:
- the configurations including at least one fully-connected neural network (FNN) configuration for the MAC layer to perform computations for at least one FNN layer;
- loading input data into the data buffer;
- and activating the computational layers and applying the configurations to perform the computations for the neural network, including: applying the at least one FNN configuration to the MAC layer, the MAC layer including a plurality of MAC units, the at least one FNN configuration including settings for a FNN operation topology for the plurality of MAC units to perform the computations for the at least one FNN layer.
However, Kim discloses:
- the configurations including at least one fully-connected neural network (FNN) configuration for the MAC layer to perform computations for at least one FNN layer;
In [Col 4, lines 35-46]:
The operation module 110 is configured to include a plurality of PEs. The plurality of PEs are configured in an array structure of a predetermined pattern to parallel-process data between PEs which are synchronously adjacent to each other, and simultaneously perform the same function. A PE may perform an operation and exchange data between PEs, and may be synchronized with one clock to perform an operation. That is, the plurality of PEs may each perform the same operation for each of the clocks. Since the plurality of PEs share data with the PE which is adjacent thereto on the same path, a connection structure between the PEs may form a geometrically simple symmetrical structure,
In [Col 11, lines -62-65]:
operation module 110 forming the systolic array of the tree topology network structure is also required for the fully connected layer of CNN, both the operation module 110 of the mesh topology network structure and the operation module 110 of the tree topology network structure were required.
In [Col 7, lines 52-57]:
the processor 120 may control the operation module 110 to perform the convolution operation by transferring different data values of the input data, that is, accumulation for the values obtained by multiplying different element values of the feature map with each of the elements of the filter 70 to an adjacent PE,
In [Col 6, lines 28-31]:
When feature values are finally extracted from the convolution layer of the CNN, in a fully connected layer, the extracted feature values are input to the neural network to perform a classification
- loading input data into the data buffer;
In [Col 2, lines 10-18]):
a processor configured to control the operation module to perform a convolution operation by applying a filter to input data, wherein the processor controls the operation module to perform the convolution operation by inputting each of a plurality of elements configuring a two-dimensional filter to the plurality of processing elements in a predetermined order and sequentially applying the plurality of elements to the input data.
In [Col 8, lines 49-53]):
In addition, the first accumulation values D0 to D7 derived at the previous clock may be moved to the PE adjacent to a lower end by one space and may be each temporarily stored in a memory included in the PE of the second row (to this end, each PE may include the memory)
- and activating the computational layers and applying the configurations to perform the computations for the neural network, including: applying the at least one FNN configuration to the MAC layer, the MAC layer including a plurality of MAC units, the at least one FNN configuration including settings for a FNN operation topology for the plurality of MAC units to perform the computations for the at least one FNN layer.
In [Col 7, lines 52-57]:
the processor 120 may control the operation module 110 to perform the convolution operation by transferring different data values of the input data, that is, accumulation for the values obtained by multiplying different element values of the feature map with each of the elements of the filter 70 to an adjacent PE,
(BRI: accumulation for multiplication is the MAC function)
In [Col 4, lines 47-51]:
For example, PEs may be arranged in various forms of network structures such as a mesh topology network, a tree topology network, and the like. A structure of the mesh topology network and the tree topology network are described below with reference to FIGS. 3 and 9.
(BRI: the arrangement of processors in different network structures constitutes setting a network topology. A fully connected topology, also known as a mesh topology, means that every processor is directly connected to every other processor. This setup ensures direct communication paths between any two processors without relying on intermediate devices)
In [Col 6, lines 28-31]:
When feature values are finally extracted from the convolution layer of the CNN, in a fully connected layer, the extracted feature values are input to the neural network to perform a classification
In [Col 12, lines 32-46]:
Referring to FIG. 11, in panel (a), a classification process by the fully connected layer of a CNN algorithm is illustrated. The fully connected layer derives a final result value by classifying feature data which is compressed and summarized by the convolution layer and the pooling layer of the CNN by a deep neural network. The feature data calculated by repeating the convolution layer and the pooling layer is input to the deep neural network of the fully connected layer as input values i.sub.1 to i.sub.1000, and each of the input values is connected to an edge having weight W.sub.ij. A value obtained by summing values obtained by multiplying each input value with the weight is input to an activation function (e.g., a signoid function) to output activation values j.sub.1 to j.sub.800, where the activation values perform the same function as another weight in the next layer to output the final output value.
It would have obvious to one of ordinary skill in the art before the effective filing date of
the present application to combine Wei , Raha and Kim.
Wei teaches MAC units and buffers.
Raha teaches loading parameters.
Kim teaches fully connected layers and configurations of FNN to compute using MAC.
One of ordinary skill would have motivation to combine Wei, Raha and Kim to reduce the load of the memory and providing increased operation efficiency ( Kim [Col 11, lines 40-44])
In regard to claim 12: (Original)
Wei do not explicitly disclose:
- wherein the configurations further include at least one convolutional neural network (CNN) configuration for the MAC layer to perform computations for at least one CNN layer, the at least one CNN configuration includes settings for a CNN operation topology and settings for cycle-by-cycle operations for the plurality of MAC units to perform the computations for the at least one CNN layer,
However, Raha discloses:
- wherein the configurations further include at least one convolutional neural network (CNN) configuration for the MAC layer to perform computations for at least one CNN layer, the at least one CNN configuration includes settings for a CNN operation topology and settings for cycle-by-cycle operations for the plurality of MAC units to perform the computations for the at least one CNN layer,
In [0033]:
the DNN 100 includes a sequence of layers comprising a plurality of convolutional layers 110 (individually referred to as “convolutional layer 110”), a plurality of pooling layers 120 (individually referred to as “pooling layer 120”),
In [0039]:
the pooling layers 120 may perform the pooling operation through average pooling (calculating the average value for each patch on the feature map), max pooling (calculating the maximum value for each patch of the feature map), or a combination of both. The size of the pooling operation is smaller than the size of the feature maps. In various embodiments, the pooling operation is 2×2 pixels applied with a stride of 2 pixels, so that the pooling operation reduces the size of a feature map by a factor of 2, e.g., the number of pixels or values in the feature map is reduced to one quarter the size.
in [0057]:
FIG. 5 includes a clock 530 to show the amount of time each PE 520 takes to perform an MAC operation on the weight vector assigned to the PE. The PE 520A takes five cycles to perform its MAC operation, versus two cycles for the PEs 520B-C, three cycles for the PE 520D, and seven cycles for the PE 520E.
In [0038]:
The pooling layers 120 downsample feature maps generated by the convolutional layers, e.g., by summarizing the presents of features in the patches of the feature maps. A pooling layer 120 is placed between two convolution layers 110: a preceding convolutional layer 110 (the convolution layer 110 preceding the pooling layer 120 in the sequence of layers) and a subsequent convolutional layer 110 (the convolution layer 110 subsequent to the pooling layer 120 in the sequence of layers)
In [0070]:
As the order of weights changed, the order of elements in the input feature map may also need to be changed. This is because input feature map and weights come into the DNN layer as a pair so if the indices of the weights are changed, the same change needs to be made to the elements in the input feature map.
In [0019]:
Most sparse DNN accelerators rarely achieve the maximum speedup that can be obtained from skipping computation on sparse data due to the various factors including: the underlying DNN dataflow, synchronization barriers during the drain phase (extraction of output points from the computation units to upper memory hierarchies) and the overheads associated with splitting work into multiple smaller tasks among multiple PEs.
Wei and Raha do not explicitly disclose:
- and the settings for the CNN operation topology include settings for operations in a row direction of an input data matrix, settings for operations in a column direction of the input data matrix, settings for operations in a row direction of a weight matrix and settings for operations in a column direction of the weight matrix.
However, Kim discloses:
- and the settings for the CNN operation topology include settings for operations in a row direction of an input data matrix, settings for operations in a column direction of the input data matrix, settings for operations in a row direction of a weight matrix and settings for operations in a column direction of the weight matrix.
In [Col 5, lines 33-43]:
Referring to FIG. 3, a plurality of PEs 20-1 to 20-n included in the operation module 110 form a systolic array of a mesh topology network structure. The plurality of PEs 20-1 to 20-n may share data through lines connecting the PEs 20-1 to 20-n which are adjacent to each other to perform the operation, using the mesh topology network structure. The mesh topology network structure is a structure in which respective PEs may be connected to the PEs which are adjacent thereto in a mesh network to exchange data, as illustrated in FIG. 3,
In [Col 4, lines 65-67], In [Col 5, lines 1-3]:
For example, the processor 120 may control the operation module 110 to perform a convolution operation based on a neural network by applying a filter to input data which is input to the operation module 110. In this case, the filter, which is a mask having a weight, is defined as a matrix. The filter is also referred to as a window or a kernel,
In [Col 5, lines 50-55]:
Referring to FIGS. 4A, 4B, and 4C, FIG. 4A illustrates the conventional method for performing a convolution operation according to the CNN by applying a two-dimensional filter 41 to two-dimension input data, where the input data and the filter 41 may be formed of matrix data including elements having one or more certain values.
In [Col 7, lines 8-19]:
A basic input order is repeated in the order of proceeding in one side direction based on a certain element in the two-dimensional filter, proceeding to an element which is adjacent to a corresponding element in a next row or a next column of the element positioned at the end of the proceeding direction, and proceeding in a direction opposite to one side direction in the adjacent element. In panel (b) of FIG. 6, elements {circle around (1)} to {circle around (4)} are input to the operation module 110 in the order of the numbers. In panel (c) of FIG. 6, elements {circle around (1)} to {circle around (9)} are input to the operation module 110 in the order of the numbers,
In [Col 8, 9-16]:
When the operation of the first elements is completed in the first row and the operation for the second elements starts, the processor 120 then shifts a plurality of values for the first elements in a predetermined direction to perform the accumulation for the values. In this case, the predetermined direction is the same as a direction in which the second elements are disposed based on the first elements in the two-dimensional filter.
In [Col 9, lines 39-49]:
Referring to FIG. 7J, third accumulation values Q1 to Q8 are illustrated in which the second accumulation values K1 to K8 are summed to M1 to M8. The third accumulation values Q1 to Q8 are equal to the value obtained by summing values obtained by each multiplying the elements {circle around (1)} to {circle around (4)} of the filter 70 with elements I to IV of the feature map, and the third accumulation values Q1 to Q8 are output through an output terminal to become elements of a first row and first column of a new feature map derived by the convolution operation.
It would have obvious to one of ordinary skill in the art before the effective filing date of
the present application to combine Wei, Raha, and Kim.
Wei teaches MAC units and buffers.
Raha teaches load parameters.
Kim teaches fully connected layers and configurations of FNN to compute using MAC.
One of ordinary skill would have motivation to combine Wei, Raha and Kim to reduce the load of the memory and providing increased operation efficiency ( Kim [Col 11, lines 40-44])
In regard to claim 13: (Original)
Wei discloses:
- wherein the plurality of MAC units are grouped into several groups, and each group includes one or more MAC units and is configured to perform convolutions for one output channel according to the at least one CNN configuration, the one or more MAC units in one group share a same batch of weights but have different input data elements.
In [0040]:
the system can determine the workload of a respective layer, and store the workload information in a workload table. The system then can cluster workloads of the layers (e.g., using k-means) based on the workload table,
(BRI: a cluster is a group)
In [0068]:
For example, suppose that SAI model 140 generates a synthetic image based on an input image. Suppose that the input image size is 224×224×3. The output image dimension can be calculated as (input image size−filter size)/stride+1. Suppose that workload 232 is 36602000 (e.g., a MAC value of 36602000). System 150 then determines channel number as 100, filter size as 11×11, and stride as 4 for input size 332. This leads to an output image size of 55. This can generate a workload of approximately 36602500, which is a close approximation of workload 232, for layer 352.
In [0056]:
System 150 then computes the workload associated with a respective layer of a respective one of AI models 130,
In [0056]:
system 150 uses layer information 224 to determine the MAC operations associated with layer 220,
In [0057]:
System 150 can determine the number of clusters based on a clustering parameter. The parameter can be based on how the workloads are distributed (e.g., based on a range of workloads that can be included in a cluster or a diameter of a cluster) or a predetermined number of clusters,
In [0058]:
the workloads in a cluster also incorporate the execution frequencies, the representative weight for a cluster can be closer to the workload of a layer with a high execution frequency,
In [0035]:
a neural network can be based on a collection of connected units or nodes where each connection (e.g., a simplified version of a synapse) between artificial neurons can transmit a signal from one to another. The artificial neuron that receives the signal can process it and then signal artificial neurons connected to it.
In regard to claim 14: (Original)
Wei does not explicitly disclose:
- wherein the settings for the FNN operation topology include settings for operations in a row direction of an input data matrix and a weight matrix, settings for operations in a column direction of the input data matrix and the weight matrix, and settings for operations of nodes of the at least one FNN layer in batches based on a number of MAC units in the MAC layer.
However, Raha discloses:
- wherein the settings for the FNN operation topology include settings for operations in a row direction of an input data matrix and a weight matrix, settings for operations in a column direction of the input data matrix and the weight matrix, and settings for operations of nodes of the at least one FNN layer in batches based on a number of MAC units in the MAC layer.
In [0089]:
The training dataset can be divided into one or more batches,
In [0017]:
DNN accelerators are suitable to execute these type of workloads as they consist of thousands of parallel MAC units that can simultaneously operate and produce the output in lesser time.
(BRI: batch processing is a common type of computing workload)
In [0041]:
In the embodiment of FIG. 1, N equals 3, as there are three objects 115, 125, and 135 in the input image. Each element of the vector indicates the probability for the input image 105 to belong to a class. To calculate the probabilities, the fully connected layers 130 multiply each input element by weight, makes the sum, and then applies an activation function (e.g., logistic if N=2, softmax if N>2). This is equivalent to multiplying the input vector by the matrix containing the weights,
In [0096]:
the weight vector represents a filter in the layer. The filter may be a matrix of weights that includes X rows and Y columns, where X and Y are integers. The weight vector may include Y sections, each section is a row in the matrix. For instance, the first row is the first section in the weight vector, the second row is the second section in the weight vector, and so on. In other embodiments, the weight vector may represent a portion of the filter or represent multiple filters of the layer. The layer may also include an activation. The activation can be represented by an activation vector that includes a sequence of activations.
It would have obvious to one of ordinary skill in the art before the effective filing date of
the present application to combine Wei and Raha.
Wei teaches MAC units and buffers.
Raha teaches operations in batches.
One of ordinary skill would have motivation to combine Wei and Raha to reduce resources
using rearranging process (rearranging weight vector) (Raha [0062]).
In regard to claim 15: (Original)
Wei discloses:
- wherein the plurality of computational layers further include a K-Means layer configured to cluster the input data into a plurality of clusters according to a K-Means configuration.
In [0052]:
computation load analysis unit 154 can calculate the workload of a layer based on the input parameters and algorithms applicable to the layer. In some embodiments, the workload of a layer can be calculated based on multiply-accumulate (MAC) time for the operations associated with the layer,
In [0040]:
Based on the collected information and the execution frequencies, the system can determine the workload of a respective layer, and store the workload information in a workload table. The system then can cluster workloads of the layers (e.g., using k-means) based on the workload table. The system can also group the input sizes of the layers. The system can determine a representative workload for a respective cluster and a representative input size for a respective input group. The system then matches a respective representative workload to a corresponding representative input size such that the input size can generate the corresponding workload. The system may adjust an input size to match the workload. The system can generate an SAI model that includes a layer corresponding each cluster.
In regard to claim 19: (Original)
Wei does not explicitly disclose:
- wherein the plurality of computational layers further include a pooling layer that includes a plurality of pooling units each configured to compare multiple input values, the pooling layer is configured to perform a max-pooling or a min- pooling according to a pooling configuration that includes settings for a pooling operation topology and settings for cycle-by-cycle operations for the plurality of pooling units.
However, Raha discloses:
- wherein the plurality of computational layers further include a pooling layer that includes a plurality of pooling units each configured to compare multiple input values, the pooling layer is configured to perform a max-pooling or a min- pooling according to a pooling configuration that includes settings for a pooling operation topology and settings for cycle-by-cycle operations for the plurality of pooling units.
In [0033]:
the DNN 100 includes a sequence of layers comprising a plurality of convolutional layers 110 (individually referred to as “convolutional layer 110”), a plurality of pooling layers 120 (individually referred to as “pooling layer 120”),
In [0039]:
the pooling layers 120 may perform the pooling operation through average pooling (calculating the average value for each patch on the feature map), max pooling (calculating the maximum value for each patch of the feature map), or a combination of both. The size of the pooling operation is smaller than the size of the feature maps. In various embodiments, the pooling operation is 2×2 pixels applied with a stride of 2 pixels, so that the pooling operation reduces the size of a feature map by a factor of 2, e.g., the number of pixels or values in the feature map is reduced to one quarter the size.
in [0057]:
FIG. 5 includes a clock 530 to show the amount of time each PE 520 takes to perform an MAC operation on the weight vector assigned to the PE. The PE 520A takes five cycles to perform its MAC operation, versus two cycles for the PEs 520B-C, three cycles for the PE 520D, and seven cycles for the PE 520E.
In [0038]:
The pooling layers 120 downsample feature maps generated by the convolutional layers, e.g., by summarizing the presents of features in the patches of the feature maps. A pooling layer 120 is placed between two convolution layers 110: a preceding convolutional layer 110 (the convolution layer 110 preceding the pooling layer 120 in the sequence of layers) and a subsequent convolutional layer 110 (the convolution layer 110 subsequent to the pooling layer 120 in the sequence of layers)
(BRI: In the context of neural networks, "frequency of operation" for a pooling unit refers to how often the pooling operation is performed, and this is typically determined by the stride and kernel size settings, which control the downsampling rate)
It would have obvious to one of ordinary skill in the art before the effective filing date of
the present application to combine Wei and Raha.
Wei teaches MAC units and buffers.
Raha teaches pooling layer and cycle by cycle operations.
One of ordinary skill would have motivation to combine Wei and Raha to reduce resources
using rearranging process (rearranging weight vector) (Raha [0062]).
Claims 6-8, and 16-18 are rejected under 35 U.S.C. 103 unpatentable over
Wei et.al. (hereinafter Wei) US 2020/0042419 A1,
in view of Arnab Raha et.al. (hereinafter Raha) US 2022/0083843 A1,
in view of Kyoung-Hoon Kim et.al. (hereinafter Kim) , further in view of Rouhani Darvish et.al. (hereinafter Darvish) US 2020/0210840 A1.
In regard to claim 6: (Original)
Wei, Raha and Kim do not explicitly disclose:
- wherein the plurality of computational layers further include a quantization layer configured to transform data values from real numbers to quantized numbers and from quantized numbers to real numbers.
However, Darvish discloses:
- wherein the plurality of computational layers further include a quantization layer configured to transform data values from real numbers to quantized numbers and from quantized numbers to real numbers.
In [0074]:
In some examples, a set of parallel multiply-accumulate (MAC) units in each convolutional layer can be used to speed up the computation. Also, parallel multiplier units can be used in the fully-connected and dense-matrix multiplication stages,
In [0007]:
FIG. 3 is a diagram depicting certain aspects of converting a normal floating-point format to a quantized floating-point format,
In [0039]:
A given number can be represented using different precision (e.g., different quantized precision) formats. For example, a number can be represented in a higher precision format (e.g., float32) and a lower precision format (e.g., float16). Lowering the precision of a number can include reducing the number of bits used to represent the mantissa or exponent of the number,
In [0094]:
The de-quantization function converts quantized floating-point values to normal-precision floating-point values.
It would have obvious to one of ordinary skill in the art before the effective filing date of
the present application to combine Wei , Raha , Kim and Darvish.
Wei teaches MAC units and buffers.
Raha teaches loading parameters.
Kim teaches fully connected layers and configurations of FNN to compute using MAC.
Darvish teaches quantization.
One of ordinary skill would have motivation to combine Wei, Raha, Kim and Darvish that can improve latency and throughput of DNN processing (Darvish [0027]).
In regard to claim 7: (Original)
Wei, Raha and Kim do not explicitly disclose:
- wherein the quantization layer is configured to perform data transformation driven by another computational layer.
However, Darvish discloses:
- wherein the quantization layer is configured to perform data transformation driven by another computational layer.
In [0074]:
In some examples, a set of parallel multiply-accumulate (MAC) units in each convolutional layer can be used to speed up the computation. Also, parallel multiplier units can be used in the fully-connected and dense-matrix multiplication stages,
In [0098]:
The weight error term ∂W can be described mathematically as:
PNG
media_image1.png
32
370
media_image1.png
Greyscale
where ∂w.sub.i is the weight error term for the layer i, ∂y.sub.i is the output error term for the layer i, y.sub.i is the output for the layer i, h( ) is a backward function of the layer, Q( ) is a quantization function, and Q.sup.−1( ) is a dequantization or up-scale function. The backward function h( ) can be can be the backward function of ƒ( ) for a gradient with respect to W.sub.i−1 or a portion of the weight error equation 6. The weight error term of the layer can be the de-quantized representation of h( ) or the weight error term can include additional terms that are performed using normal-precision floating-point (after de-quantization) or using quantized floating-point (before de-quantization). The weight error term can include additional terms that are performed using normal-precision floating-point.
In regard to claim 8: (Original)
Wei, Raha and Kim do not explicitly disclose
- wherein the quantization layer is configured to perform data transformation according to a quantization configuration.
However, Darvish discloses:
- wherein the quantization layer is configured to perform data transformation according to a quantization configuration.
In [0074]:
In some examples, a set of parallel multiply-accumulate (MAC) units in each convolutional layer can be used to speed up the computation. Also, parallel multiplier units can be used in the fully-connected and dense-matrix multiplication stages,
In [0110]:
operation of the quantization circuits 722, 724, and/or 742 or the de-quantization circuits 732, 734, and/or 754 that were previously configured to transform data from a normal precision floating-point format to a first block floating-point format can be reconfigured so that values received in the normal precision floating-point format are converted to the second block floating-point format prior to sending to the quantized layer 710.
In regard to claim 16: (Original)
Wei, Raha and Kim do not explicitly disclose:
- wherein the plurality of computational layers further include a quantization layer configured to transform data values from real numbers to quantized numbers and from quantized numbers to real numbers.
However, Darvish discloses:
- wherein the plurality of computational layers further include a quantization layer configured to transform data values from real numbers to quantized numbers and from quantized numbers to real numbers.
In [0074]:
In some examples, a set of parallel multiply-accumulate (MAC) units in each convolutional layer can be used to speed up the computation. Also, parallel multiplier units can be used in the fully-connected and dense-matrix multiplication stages,
In [0007]:
FIG. 3 is a diagram depicting certain aspects of converting a normal floating-point format to a quantized floating-point format,
In [0039]:
A given number can be represented using different precision (e.g., different quantized precision) formats. For example, a number can be represented in a higher precision format (e.g., float32) and a lower precision format (e.g., float16). Lowering the precision of a number can include reducing the number of bits used to represent the mantissa or exponent of the number,
In [0094]:
The de-quantization function converts quantized floating-point values to normal-precision floating-point values.
It would have obvious to one of ordinary skill in the art before the effective filing date of
the present application to combine Wei , Raha, Kim and Darvish.
Wei teaches MAC units and buffers.
Raha teaches loading parameters.
Kim teaches fully connected layers and configurations of FNN to compute using MAC.
Darvish teaches quantization.
One of ordinary skill would have motivation to combine Wei, Raha, Kim and Darvish that can improve latency and throughput of DNN processing (Darvish [0027]).
In regard to claim 17: (Original)
Wei , Raha and Kim do not explicitly disclose:
- wherein the quantization layer is configured to perform data transformation driven by another computational layer.
However, Darvish discloses:
- wherein the quantization layer is configured to perform data transformation driven by another computational layer.
In [0074]:
In some examples, a set of parallel multiply-accumulate (MAC) units in each convolutional layer can be used to speed up the computation. Also, parallel multiplier units can be used in the fully-connected and dense-matrix multiplication stages,
In [0098]:
The weight error term ∂W can be described mathematically as:
PNG
media_image1.png
32
370
media_image1.png
Greyscale
where ∂w.sub.i is the weight error term for the layer i, ∂y.sub.i is the output error term for the layer i, y.sub.i is the output for the layer i, h( ) is a backward function of the layer, Q( ) is a quantization function, and Q.sup.−1( ) is a dequantization or up-scale function. The backward function h( ) can be can be the backward function of ƒ( ) for a gradient with respect to W.sub.i−1 or a portion of the weight error equation 6. The weight error term of the layer can be the de-quantized representation of h( ) or the weight error term can include additional terms that are performed using normal-precision floating-point (after de-quantization) or using quantized floating-point (before de-quantization). The weight error term can include additional terms that are performed using normal-precision floating-point.
In regard to claim 18: (Original)
Wei, Raha and Kim do not explicitly disclose
- wherein the quantization layer is configured to perform data transformation according to a quantization configuration.
However, Darvish discloses:
- wherein the quantization layer is configured to perform data transformation according to a quantization configuration.
In [0074]:
In some examples, a set of parallel multiply-accumulate (MAC) units in each convolutional layer can be used to speed up the computation. Also, parallel multiplier units can be used in the fully-connected and dense-matrix multiplication stages,
In [0110]:
operation of the quantization circuits 722, 724, and/or 742 or the de-quantization circuits 732, 734, and/or 754 that were previously configured to transform data from a normal precision floating-point format to a first block floating-point format can be reconfigured so that values received in the normal precision floating-point format are converted to the second block floating-point format prior to sending to the quantized layer 710.
Claims 10 and 20 are rejected under 35 U.S.C. 103 unpatentable over
Wei et.al. (hereinafter Wei) US 2020/0042419 A1,
in view of Arnab Raha et.al. (hereinafter Raha) US 2022/0083843 A1,
in view of Kyoung-Hoon Kim et.al. (hereinafter Kim) US 11907826 B2,
further in view of Dexu LIN et.al. (hereinafter LIN) US 2018/0060278 A1.
In regard to claim 10: (Original)
Wei, Raha and Kim do not explicitly disclose:
- wherein the plurality of computational layers further include a lookup table layer configured to generate an output value for an activation function by looking up a segment of an activation function curve enclosing an input data value and performing an interpolation based on activation function values for an upper value and a lower value of the segment.
However, LIN discloses:
- wherein the plurality of computational layers further include a lookup table layer configured to generate an output value for an activation function by looking up a segment of an activation function curve enclosing an input data value and performing an interpolation based on activation function values for an upper value and a lower value of the segment.
In [0024]:
One or more embodiments of the present disclosure may be used to increase the speed and to reduce the memory requirements for repetitive computations. For example, as shown in FIG. 1, in a single layer neural network 100,
In [0024]:
The activation function 104 may be designed to force the result of the transfer function 102 into a finite range, for example, {−1, 1}. Examples of the activation function 104 in a neural network are the tan h(•) and sigmoid functions. In a typical multi-layer neural network, a single pass through all the layers may involve tens of thousands of instances where an activation function 104 is applied. Therefore, the computation of the activation function 104 in a neural network may be a significant contributor to network latency.
In [0025]:
One or more of the embodiments of the present disclosure may be used to calculate nonlinear functions more accurately and efficiently in hardware using look-up tables (LUTs) and interpolation or extrapolation. Determining the value of a nonlinear function ƒ(x) for any value x may require time and/or memory space. For certain applications, aspects of the present disclosure may reduce the computation time and/or memory requirements for calculating certain nonlinear functions. By way of example and not limitation, computation of ƒ(x)=tan h(x), useful as an activation function in a neural network.
In [0029]:
In one aspect, the LUTs corresponding to the exponentially spaced segments (e.g., the boundaries of segments restricted to powers of 2) of the non-linear function ƒ(x) may allow a reduced complexity look-up of the segment that includes value of the input variable x. Utilizing exponentially spaced segments may also allow for higher precision calculation of the nonlinear function ƒ(x) for a certain range of the input and lower precision calculation of the nonlinear function ƒ(x) for other values of the input. For example, for activation functions in neural networks (e.g., sigmoid or tan h), the most interesting region may be close to 0. In a neural network, the activation function forces the result of a transfer function into a finite range {−1, 1}.
It would have obvious to one of ordinary skill in the art before the effective filing date of
the present application to combine Wei, Raha, Kim and LIN.
Wei teaches MAC units and buffers.
Raha teaches loading parameters.
Kim teaches fully connected layers and configurations of FNN to compute using MAC.
LIN teaches LUT.
One of ordinary skill would have motivation to combine Wei, Raha , Kim and LIN that can reduce the memory space and improve the processing time (LIN [0043]).
In regard to claim 20: (Original)
Wei, Raha and Kim do not explicitly disclose:
- further comprising generating an output value for an activation function using a lookup
table layer of the plurality of computational layers, where in the look up table layer to look up a segment of an activation function curve enclosing an input data value and perform an
interpolation based on activation function values for an upper value and a lower value of the segment.
However, LIN discloses:
- further comprising generating an output value for an activation function using a lookup
table layer of the plurality of computational layers, where in the look up table layer to look up a segment of an activation function curve enclosing an input data value and perform an
interpolation based on activation function values for an upper value and a lower value of the segment.
In [0024]:
One or more embodiments of the present disclosure may be used to increase the speed and to reduce the memory requirements for repetitive computations. For example, as shown in FIG. 1, in a single layer neural network 100,
In [0024]:
The activation function 104 may be designed to force the result of the transfer function 102 into a finite range, for example, {−1, 1}. Examples of the activation function 104 in a neural network are the tan h(•) and sigmoid functions. In a typical multi-layer neural network, a single pass through all the layers may involve tens of thousands of instances where an activation function 104 is applied. Therefore, the computation of the activation function 104 in a neural network may be a significant contributor to network latency.
In [0025]:
One or more of the embodiments of the present disclosure may be used to calculate nonlinear functions more accurately and efficiently in hardware using look-up tables (LUTs) and interpolation or extrapolation. Determining the value of a nonlinear function ƒ(x) for any value x may require time and/or memory space. For certain applications, aspects of the present disclosure may reduce the computation time and/or memory requirements for calculating certain nonlinear functions. By way of example and not limitation, computation of ƒ(x)=tan h(x), useful as an activation function in a neural network.
In [0029]:
In one aspect, the LUTs corresponding to the exponentially spaced segments (e.g., the boundaries of segments restricted to powers of 2) of the non-linear function ƒ(x) may allow a reduced complexity look-up of the segment that includes value of the input variable x. Utilizing exponentially spaced segments may also allow for higher precision calculation of the nonlinear function ƒ(x) for a certain range of the input and lower precision calculation of the nonlinear function ƒ(x) for other values of the input. For example, for activation functions in neural networks (e.g., sigmoid or tan h), the most interesting region may be close to 0. In a neural network, the activation function forces the result of a transfer function into a finite range {−1, 1}.
It would have obvious to one of ordinary skill in the art before the effective filing date of
the present application to combine Wei, Raha , Kim and LIN.
Wei teaches MAC units and buffers.
Raha teaches loading parameters.
Kim teaches fully connected layers and configurations of FNN to compute using MAC.
LIN teaches LUT.
One of ordinary skill would have motivation to combine Wei, Raha, Kim and LIN that can reduce the memory space and improve the processing time (LIN [0043]).
Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the
examiner should be directed to TIRUMALE KRISHNASWAMY RAMESH whose telephone number is (571)272-4605. The examiner can normally be reached by phone.
Examiner interviews are available via telephone, in-person, and video conferencing
using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached on phone (571-272-3768). The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be
obtained from Patent Center. Unpublished application information in Patent Center is available to registered users.
To file and manage patent submissions in Patent Center, visit:
https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for
information about filing in DOCX format. For additional questions, contact the Electronic
Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO
Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/TIRUMALE K RAMESH/Examiner, Art Unit 2121
/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121