DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Applicant’s claim for the benefit of a prior-filed PCT international application number PCT/US2019/019306 filed on February 22, 2019, which claims priority to U.S. provisional patent application number 62/634,785 filed on February 23, 2018, which are acknowledged.
Drawings
The drawings were received on 08/17/2020. These drawings are acceptable.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 10/26/2020 been considered by the examiner.
Response to Arguments
Applicant's arguments filed 09/11/2025 have been fully considered and the remarks are directed to subject matter not previous examined. See current action for updated rejections.
Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.
The following is a quotation of the first paragraph of pre-AIA 35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.
Claim 1-18 and 21-28 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA 35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention. Specifically, the independent claim limitations (claims 1, 9, and 22) recite claimed elements directed to “computing a sparse neural network having a plurality of output layers, each output layer having a neuron value, the system comprising: a plurality of processing engines (PEs) coupled in parallel in a fully connected neural network (NN)” as highlighted in exemplar claim 1 limitation that limits the claimed invention to an embodiment not specified nor anticipated by the boarder recitation in the original disclosure.
Applicant alleges that support is provided from Figs. 2A-E and paragraphs 0009 and 00022. Additionally, states that a fully connected layer means every neuron in one layer is connected to very neuron in the next layer, see button of page Pg. 10 and top of page 11 of remarks filed 08/17/2020.
Examiner notes, that the applicant remarks further highlight the deficiency of the original disclosure with regard to providing sufficient support for of the amended claim limitation. Specifically, the original disclosure is silent regarding the configuration of the claimed processing elements in parallel in a fully connected layer or a fully connected neural network. The claimed coupling of the processing elements refers to hardware configuration of the processing elements. The processing elements are capable of processing neural network computations and model the associated neurons and connections, the specific configuration noted in the claimed amendments is not disclosed anywhere in the original disclosure.
Furthermore, the original specification discloses the compression process for pruning deep neural networks by citing the publication Han et al., "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding” which discloses a process for pruning neural networks for compressing both convolutional and fully-connected layers. This is not the same as specifying an specific architecture of the parallel processing elements as noted in the amended claim limitations. One of ordinary skill in the art would understand that a pruning process is not considered a condition for specifying a parallel architecture. While both elements can be combined the original specification does not expressly support the specific combinations as noted in the newly amended claim language.
Thus, the limitation is considered new matter not inferred or expressly supported by the original disclosure.
The dependent claims fail to resolve the noted deficiency and are thus appropriated rejected based on the noted rejected above.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
Claims 1-18 , 21 and 27 are rejected under 35 U.S.C. 102(a)(1) and 102(a)(2) as being anticipated by Dally et al. (US 2018/0046900, hereinafter Dal).
Regarding independent claim 1 limitations, Dal teaches: a system for computing a sparse neural network having a plurality of output layers, each output layer having a neuron value, (in 0035-0036: Neural networks typically have significant redundancy and can be pruned dramatically during training without substantively affecting accuracy of the neural network… Eliminating weights results in a neural network with a substantial number of zero values [computing a sparse neural network having a plurality of output layers, each output layer having a neuron value], which can potentially reduce the computational requirements of inference… Since the multiplication of weights and activations is the key computation for inference, the combination of activations that are zero and weights that are zero can reduce the amount of computation required by over an order of magnitude. A sparse CNN (SCNN) [computing a sparse neural network having a plurality of output layers, each output layer having a neuron value] accelerator architecture described herein, exploits weight and/or activation sparsity to reduce energy consumption and improve processing throughput. The SCNN accelerator architecture couples an algorithmic dataflow that eliminates all multiplications with a zero operand while employing a compressed representation of both weights and activations through almost the entire computation…)
the system comprising: a plurality of processing engines (PEs) coupled in parallel in a fully connected neural network (NN); (in [0051] The SCNN 200 may be configured to implement CNN algorithms that are a cascaded set of pattern recognition filters trained with supervision. A CNN consists of a series of layers, which include convolutional layers, non-linear scalar operator layers, and layers that downsample the intermediate data, for example by pooling. The convolutional layers represent the core of the CNN computation and are characterized by a set of filters that are usually 1×1 or 3×3, and occasionally 5×5 or larger. The values of these filters are the weights that are trained using a training set for the network. Some deep neural networks (DNNs) also include fully-connected layers [a plurality of processing engines (PEs) coupled in parallel in a fully connected neural network (NN)], typically toward the end of the DNN…; And see Fig. 2A:
PNG
media_image1.png
608
496
media_image1.png
Greyscale
[0044] FIG. 2A illustrates a block diagram of the SCNN 200, in accordance with one embodiment. SCNN 200 couples an algorithmic dataflow that eliminates all multiplications with a zero operand while transmitting a compact representation of weights and/or input activations between memory and logic blocks within the SCNN 200. The SCNN 200 includes a memory interface 205, layer sequencer 215, and an array of processing elements (PEs) 210. In one embodiment, the SCNN 200 is a processor and the PEs 210 are parallel processing units [a plurality of processing engines (PEs) coupled in parallel in a fully connected neural network (NN)].)
the system comprising: a plurality of processing engines (PEs) coupled in parallel in a fully connected neural network (NN); a main memory having memory cells into which are stored values for input neurons and coming weights; wherein each PE of said plurality of PEs comprises a local memory, a multiplexor, a multiplier and an integrator which are configured to receive values for a corresponding input neuron and coming weight from said main memory; (As depicted in Fig. 2A, in 0044-0046: FIG. 2A illustrates a block diagram of the SCNN 200, in accordance with one embodiment. SCNN 200 couples an algorithmic dataflow that eliminates all multiplications with a zero operand while transmitting a compact representation of weights and/or input activations between memory [claimed the system comprising: a plurality of processing engines (PEs); and a main memory having memory cells into which are stored values for input neurons and coming weights] and logic blocks within the SCNN 200. The SCNN 200 includes a memory interface 205, layer sequencer 215, and an array of processing elements (PEs) 210. In one embodiment, the SCNN 200 is a processor and the PEs 210 are parallel processing units. The memory interface 205 reads weight and activation data from a memory coupled to the SCNN 200 the memory interface 205 may also write weight and/or activation data from the SCNN 200 to the memory. … The memory may be implemented using dynamic random access memory (DRAM), or the like. In one embodiment, the memory interface 205 or the PEs 210 are configured to compact multi-bit data, such as the weights, input activations, and output activations… Each PE 210 includes a multiplier array that accepts a vector of weights (weight vector) [claimed wherein each PE of said plurality of PEs comprises a local memory, a multiplexor, a multiplier and an integrator which are configured to receive values for a corresponding input … coming weight from said main memory] and a vector of input activations (activation vector) [claimed wherein each PE of said plurality of PEs comprises a local memory, a multiplexor, a multiplier and an integrator which are configured to receive values for a corresponding input neuron … said main memory], where each multiplier within the array is configured to generate a product from one input activation value in the activation vector and one weight in the weight vector [claimed wherein each PE of said plurality of PEs comprises a local memory, a multiplexor, a multiplier and an integrator which are configured to receive a corresponding input neuron and coming weight from said main memory]…: claimed multiplexor, multiplier and integrator (e.g. accumulator), in 0138 and depicted in Figs. 3A-D)
wherein each said PE receives inputs of neuron weight, input neuron and local memory address from said main memory, with local memory address received by the local memory, and the input neuron received by the local memory and the multiplexor; … wherein output from said multiplexor is output as PE output, with this PE output also input to the multiplier which multiplies PE output and neuron weight and outputs a multiplier output, as a partial sum of the output neuron, to the integrator which integrates these partial results from the multiplier and outputs final results of the integration in which all corresponding neurons are calculated and summed; (Depicted in Figs 3A-D: And in 0072-0072: FIG. 3A illustrates a block diagram of a PE 210, in accordance with one embodiment. The PE 210 is configured to support the PTIS-sparse dataflow. Like, the PE 220 shown in FIG. 2C, the PE 210 includes a weight buffer 305, an input activations buffer 310, and an FxI multiplier array 325 [wherein each said PE receives inputs of neuron weight, input neuron and local memory address from said main memory, with local memory address received by the local memory, and the input neuron received by the local memory and the multiplexor]. Parallelism within a PE 210 is accomplished by processing a vector of F non-zero filter weights a vector of I non-zero input activations in within the FxI multiplier array 325. FxI products are generated each processing cycle by each PE 210 in the SCNN accelerator 200. In one embodiment F=I=4. In other embodiments, F and I may be any positive integer and the value of F may be greater than or less than I. The values of F and I may each be tuned to balance overall performance and circuit area. With typical density values of 30% for both weights and activations, 16 multiplies of the compressed sparse weight and input activation values is equivalent to 178 multiplies in a dense accelerator that processes weight and input activation values including zeros. The accumulator array 340 [wherein each said PE receives inputs of neuron weight, input neuron and local memory address from said main memory, with local memory address received by the local memory, and the input neuron received by the local memory and the multiplexor] may include one or more accumulation buffers and adders to store the products generated in the multiplier array 325 and sum the products into the partial sums [wherein output from said multiplexor is output as PE output, with this PE output also input to the multiplier which multiplies PE output and neuron weight and outputs a multiplier output, as a partial sum of the output neuron, to the integrator which integrates these partial results from the multiplier and outputs final results of the integration in which all corresponding neurons are calculated and summed;]. The PE 210 also includes position buffers 315 and 320, indices buffer 355, destination calculation unit 330, F*I arbitrated crossbar 335, and a postprocessing unit 345.
wherein each said PE stores the input neuron in the local memory, to be reused in computing a partial sum of the output neuron using different input weights; wherein if the local memory is initially empty, said multiplexor directly feeds the input neuron to said multiplier; (in 0048-0049: The layer sequencer 215 reads the weights and outputs weight vectors to be multiplied by the PEs 210. In one embodiment, the weights are in compact form and are read from off-chip DRAM only once and stored within the SCNN accelerator 200. In one embodiment, the layer sequencer 215 broadcasts a weight vector to each PE 210 and sequences through multiple activation vectors before broadcasting another weight vector. In one embodiment, the layer sequencer 215 broadcasts an input activation vector to each PE 210 [wherein each said PE stores the input neuron in the local memory, to be reused in computing a partial sum of the output neuron] and sequences through multiple weight vectors [using different input weights] before broadcasting another input activation vector. Products generated by the multipliers within each PE 210 are accumulated to produce intermediate values (e.g., partial sums) [claimed wherein each said PE stores the input neuron in the local memory, to be reused in computing a partial sum of the output neuron using different input weights] that become the output activations after one or more iterations. When the output activations for a neural network layer have been computed and stored in an output activation buffer, the layer sequencer 215 may proceed to process a next layer [wherein each said PE stores the input neuron in the local memory, to be reused in computing a partial sum of the output neuron using different input weights] by applying the output activations as input activations. Each PE 210 includes a multiplier array that accepts a vector of weights (weight vector) and a vector of input activations (activation vector), where each multiplier within the array is configured to generate a product from one input activation value in the activation vector and one weight in the weight vector. The weights and input activations in the vectors can all be multiplied by one another in the manner of a Cartesian product…)
wherein if the local memory is initially empty, said multiplexor directly feeds the input neuron to said multiplier; (Examiner notes that the claim is not positively recited and includes a contingent clause that appears to not be required as the local memory is required to include an address and other data per the recited previous limitation under broadest reasonable interpretation.)
and wherein each said PE is configured to compute a neuron value of a corresponding output layer by multiplying the coming weight and input neuron from said neuron in sequence, generating partial results for integration, and outputting a final value. (in 0048-0049: The layer sequencer 215 reads the weights and outputs weight vectors to be multiplied by the PEs 210. In one embodiment, the weights are in compact form and are read from off-chip DRAM only once and stored within the SCNN accelerator 200. In one embodiment, the layer sequencer 215 broadcasts a weight vector to each PE 210 and sequences through multiple activation vectors before broadcasting another weight vector. In one embodiment, the layer sequencer 215 broadcasts an input activation vector to each PE 210 and sequences through multiple weight vectors before broadcasting another input activation vector. Products generated by the multipliers within each PE 210 are accumulated to produce intermediate values (e.g., partial sums) [claimed , generating partial results for integration, …] that become the output activations [claimed generating partial results for integration, and outputting a final value] after one or more iterations. When the output activations for a neural network layer have been computed and stored in an output activation buffer [claimed generating partial results for integration, and outputting a final value], the layer sequencer 215 may proceed to process a next layer by applying the output activations as input activations. Each PE 210 includes a multiplier array that accepts a vector of weights (weight vector) and a vector of input activations (activation vector), where each multiplier within the array is configured to generate a product from one input activation value in the activation vector and one weight in the weight vector. The weights and input activations in the vectors can all be multiplied by one another in the manner of a Cartesian product…)
and wherein processed neuron results from the PEs are output, along with a neuron index for each PE, to a parallel-to-serial first-in-first-out (FIFO) circuit whose neuron index and output neuron values are fed back and stored in the main memory for coming the next layer's result. (in 0083-0086: In one embodiment, the weight buffer 305 is a first-in first-out FIFO buffer (WFIFO) [and wherein processed neuron results from the PEs are output, along with a neuron index for each PE, to a parallel-to-serial first-in-first-out (FIFO) circuit whose neuron index and output neuron values are fed back and stored in the main memory for coming the next layer's result]. The weight buffer 305 should have enough storage capacity to hold all of the non-zero weights for one input channel within one tile (i.e., for the inner most nested "For" in TABLE 3). When possible, the weights and input activations are held in the weight buffer 305 and input activations buffer 310, respectively, and are never swapped out to DRAM. If the output activation volume of a neural network layer can serve as the input activation volume for the next neural network layer, then the output activations buffer 350 is logically swapped with the input activations buffer 310 between processing of the different neural network layers. Similarly, the indices buffer 355 is logically swapped with the buffer 320 between processing the different neural network layers [… to a parallel-to-serial first-in-first-out (FIFO) circuit whose neuron index and output neuron values are fed back and stored in the main memory for coming the next layer's result]… In one embodiment, the weight buffer 305 is a FIFO buffer that includes a tail pointer, a channel pointer, and a head pointer [and wherein processed neuron results from the PEs are output, along with a neuron index for each PE, to a parallel-to-serial first-in-first-out (FIFO) circuit whose neuron index and output neuron values are fed back and stored in the main memory for coming the next layer's result]]. The layer sequencer 215 controls the "input" side of the weight buffer 305, pushing weight vectors into the weight buffer 305. The tail pointer is not allowed to advance over the channel pointer. A full condition is signaled when the tail pointer will advance past the channel pointer when another write vector is stored. The buffer 315 may be implemented in the same manner as weight buffer 305 and is configured to store the positions associated with each weight vector. In one embodiment, the weight buffer 305 outputs a weight vector of F weights {w[0] ... w[F-ll} and the buffer 315 outputs the associated positions { x[0-] ... x[F-1 l}. Each position specifies r, s, and k for a weight. The output channel k is encoded relative to the tile.)
Regarding claim 2, the rejection of claim 1 is incorporated and Dal further teaches the system of claim 1, wherein the coming weights stored in the main memory are only non-zero weights. (as depicted in Fig. 1, in 0039: At step 105, a first vector comprising only non-zero weight values [claimed wherein the coming weights stored in the main memory are only non-zero weights] and first associated positions of the non-zero weight values within a three-dimensional (3D) space are received. In one embodiment, the first vector is received from a memory…)
Regarding claim 3, the rejection of claim 1 is incorporated and Dal further teaches the system of claim 1, wherein said sparse neural network is described through relative address coding. (in 0071: The PTIS-sparse technique is a natural extension of the PTIS-dense technique, with the PTIS-sparse technique exploiting sparsity in the weights and activations. The PTIS-sparse dataflow is specifically designed to operate on compressed-sparse (i.e., compacted) encodings [claimed wherein said sparse neural network is described through relative address coding] of the weights and input activations and to produce a compressedsparse encoding of the output activations… What is key is that decoding a sparse format ultimately yields a non-zero data value and a position indicating [claimed wherein said sparse neural network is described through relative address coding] the coordinates of the value in the weight or input activation matrices. In one embodiment, the position is defined by an index or an address [claimed wherein said sparse neural network is described through relative address coding], such as an address corresponding to one of the accumulation buffers 250 or adder units 255.)
Regarding claim 4, the rejection of claim 1 is incorporated and Dal further teaches the system of claim 1, wherein computation of zero in data flows of the sparse neural network is bypassed. (in 0044: FIG. 2A illustrates a block diagram of the SCNN 200, in accordance with one embodiment. SCNN 200 couples an algorithmic dataflow that eliminates all multiplications with a zero operand while transmitting a compact representation of weights and/or input activations between memory and logic blocks within the SCNN 200…)
Regarding claim 5, the rejection of claim 1 is incorporated and Dal further teaches the system of claim 1, wherein computed input neurons in each said PE are stored in PE memory for data reuse when computing a next output neuron. (in 0044-0049: … The SCNN 200 includes a memory interface 205, layer sequencer 215, and an array of processing elements (PEs) 210. In one embodiment, the SCNN 200 is a processor and the PEs 210 are parallel processing units… positions associated with the non-zero elements… The layer sequencer 215 reads the weights and outputs weight vectors to be multiplied by the PEs 210. In one embodiment, the weights are in compact form and are read from off-chip DRAM only once and stored within the SCNN accelerator 200. In one embodiment, the layer sequencer 215 broadcasts a weight vector to each PE 210 and sequences through multiple activation vectors before broadcasting another weight vector. In one embodiment, the layer sequencer 215 broadcasts an input activation vector to each PE 210 and sequences through [sequences through broadcasts inputs as claimed reuse for processing as claimed wherein computed input neurons in each said PE are stored in PE memory for data reuse when computing a next output neuron] multiple weight vectors before broadcasting another input activation vector. Products generated by the multipliers within each PE 210 are accumulated to produce intermediate values (e.g., partial sums) that become the output activations after one or more iterations…)
Regarding claim 6, the rejection of claim 5 is incorporated and Dal further teaches the system of claim 5, wherein data reuse is implemented to reduce power consumption during multiple output neurons' computation. (in 0036-0037: … Since the multiplication of weights and activations is the key computation for inference, the combination of activations that are zero and weights that are zero can reduce the amount of computation required by over an order of magnitude. A sparse CNN (SCNN) accelerator architecture described herein, exploits weight and/or activation sparsity to reduce energy consumption [claimed wherein data reuse is implemented to reduce power consumption during multiple output neurons' computation] and improve processing throughput. The SCNN accelerator architecture couples an algorithmic dataflow that eliminates all multiplications with a zero operand while employing a compressed representation of both weights and activations through almost the entire computation…; And the data reuse during sequences process of multiple output computations, in 0044-0049: … The SCNN 200 includes a memory interface 205, layer sequencer 215, and an array of processing elements (PEs) 210. In one embodiment, the SCNN 200 is a processor and the PEs 210 are parallel processing units… positions associated with the non-zero elements… The layer sequencer 215 reads the weights and outputs weight vectors to be multiplied by the PEs 210. In one embodiment, the weights are in compact form and are read from off-chip DRAM only once and stored within the SCNN accelerator 200. In one embodiment, the layer sequencer 215 broadcasts a weight vector to each PE 210 and sequences through multiple activation vectors before broadcasting another weight vector. In one embodiment, the layer sequencer 215 broadcasts an input activation vector to each PE 210 and sequences through [sequences through broadcasts inputs as claimed reuse for processing as claimed wherein data reuse is implemented… during multiple output neurons' computation] multiple weight vectors before broadcasting another input activation vector. Products generated by the multipliers within each PE 210 are accumulated to produce intermediate values (e.g., partial sums) that become the output activations [claimed wherein data reuse is implemented… during multiple output neurons' computation]after one or more iterations…; And in 0050: Importantly, only non-zero weights and input activations are transmitted to the multiplier array within each PE 210. Additionally, the input activation vectors may be reused within each PE 210 in an input stationary fashion against a number of weight vectors to reduce data accesses [claimed wherein data reuse is implemented… during multiple output neurons' computation]. The products generated by the multipliers are then summed together to generate the partial sums and the output activations [claimed wherein data reuse is implemented… during multiple output neurons' computation]….)
Regarding claim 7, the rejection of claim 5 is incorporated and Dal further teaches the system of claim 5, wherein computation of each output neuron first uses any input neuron stored in the PE memory, and if the input neuron is not stored in the PE memory, then the PE reads the neuron from the main memory. (As depicted in Fig. 2A, in 0044-0046: FIG. 2A illustrates a block diagram of the SCNN 200, in accordance with one embodiment. SCNN 200 couples an algorithmic dataflow that eliminates all multiplications with a zero operand while transmitting a compact representation of weights and/or input activations between memory and logic blocks within the SCNN 200. The SCNN 200 includes a memory interface 205, layer sequencer 215, and an array of processing elements (PEs) 210. In one embodiment, the SCNN 200 is a processor and the PEs 210 are parallel processing units. The memory interface 205 reads weight and activation data from a memory coupled to the SCNN 200 the memory interface 205 may also write weight and/or activation data from the SCNN 200 to the memory. … The memory may be implemented using dynamic random access memory (DRAM), or the like. In one embodiment, the memory interface 205 or the PEs 210 are configured to compact multi-bit data, such as the weights, input activations, and output activations… Each PE 210 includes a multiplier array that accepts a vector of weights (weight vector) [claimed wherein computation of each output neuron first uses any input neuron stored in the PE memory] and a vector of input activations (activation vector) [claimed wherein computation of each output neuron first uses any input neuron stored in the PE memory], where each multiplier within the array is configured to generate a product from one input activation value in the activation vector and one weight in the weight vector [claimed wherein computation of each output neuron first uses any input neuron stored in the PE memory]…)
Regarding claim 8, the rejection of claim 7 is incorporated and Dal further teaches the system of claim 7, wherein seldom-used stored input neurons are replaced with frequently-used input neurons. (in 0052: Sparsity in a layer of a CNN is defined as the fraction of zeros in the layer's weight and input activation matrices. The primary technique for creating weight sparsity is to prune the network during training. In one embodiment, any weight with an absolute value that is close to zero ( e.g. below a defined threshold) is set to zero [claimed wherein seldom-used stored input neurons are replaced with frequently-used input neurons.]. The pruning process has the effect of removing weights from the filters, and sometimes even forcing an output activation to always equal zero. The remaining network may be retrained, to regain the accuracy lost through naive pruning. The result is a smaller network with accuracy extremely close to the original network [replaces neurons with those with low weight values as claimed wherein seldom-used stored input neurons are replaced with frequently-used input neurons.]. The process can be iteratively repeated to reduce network size while maintaining accuracy.)
Regarding independent claim 9 limitations, Dal teaches: a system for computing where the limitations are similar to claim 1 and thus considered rejected under the same rationale. Additionally, Dal teaches wherein a number of neuron outputs are used in determining an intermediate neuron which is then used with the remaining neurons for outputting the final value, toward reducing local memory storage requirements; (in As depicted in Fig 2A and Fig. 2B and in [0056] FIG. 2B ill illustrates input activations, weights, and output activations for a single CNN layer [wherein a number of neuron outputs are used in determining an intermediate neuron which is then used with the remaining neurons for outputting the final value comprising the determined neural for processing the final output of the neural network sequence of layers], in accordance with one embodiment. The set of computations for the complete layer [wherein a number of neuron outputs are used in determining an intermediate neuron which is then used with the remaining neurons for outputting the final value] can be formulated as a loop nest over the seven variables (N, K, C, W, H, R, and S). Because multiply-add operations are associative (modulo rounding errors, which are ignored in the context of the following description), all permutations of the seven loop variables are legal. TABLE 1 shows an example loop nest based on one such permutation. The nest may be concisely described as N.fwdarw.K.fwdarw.C.fwdarw.W.fwdarw.H.fwdarw.R.fwdarw.S. Each point in the seven-dimensional space formed from the variables represents a single multiply-accumulate operation. Note that for the remainder of the description, a batch size of 1 is assumed, which is a common batch size for inferencing tasks. And in [0112] To increase parallelism beyond a single PE 210, multiple PEs 210 can be operated in parallel with each working on a disjoint three-dimensional tile of input activations. Because of the end-to-end compression of activations, both the input and output activations of each tile may be stored local to the PE 210 that processes the tile, further reducing energy-hungry data transmission […toward reducing local memory storage requirements]. Overall, the SCNN accelerator 200 provides efficient compressed storage […toward reducing local memory storage requirements] and delivery of input operands to the F×I multiplier array 325, high reuse of the input operands in the F×I multiplier array 325, and that spends no processing cycles on multiplications with zero operands.)
Regarding claims 10-14 the limitations are similar to those in claims 4-8 respectively; and are rejected under the same rationale.
Regarding independent claim 15 limitations, Dal teaches: a system for computing where the limitations are similar to claim 1 and thus considered rejected under the same rationale.
Regarding claims 16, 17, and 18, the limitations are similar to those in claims 7, 8, and 6 respectively; and claims 16, 17, and 18, are rejected under the same rationale as claims 7, 8, and 6 respectively.
Regarding claim 21, the rejection of claim 1 is incorporated, Dal further teaches the apparatus of claim 1, wherein said integrator integrates partial results from the multiplier to generate an intermediate neuron as an additional input neuron to compute the final output together with other un-computed neurons. Dal teaches in 0050-0051: … A CNN consists of a series of layers, which include convolutional layers, nonlinear scalar operator layers, and layers that downsample the intermediate data, for example by pooling. The convolutional layers [partial results from the multiplier to generate an intermediate neuron as an additional input neuron to compute the final output together] represent the core of the CNN computation and are characterized by a set of filters that are usually 1 x 1 or 3x3, and occasionally 5x5 or larger. The values of these filters are the weights that are trained using a training set for the network. Some deep neural networks (DNNs) also include fully-connected layers, typically toward the end of the DNN [with other un-computed neurons]. During classification, a new image (in the case of image recognition) is presented to the neural network, which classifies images into the training categories by computing [wherein said integrator integrates partial results from the multiplier to generate an intermediate neuron as an additional input neuron to compute the final output together with other un-computed neurons] in succession each of the layers in the neural network.; And in 0067: The multiplier outputs (e.g., products) are sent to the accumulation unit 245 [wherein said integrator integrates partial results from the multiplier to generate an intermediate neuron as an additional input neuron to compute the final output together with other un-computed neurons], which updates the partial sums [partial results from the multiplier to generate an intermediate neuron as an additional input neuron to compute the final output together] stored in the accumulation buffer 250. Each product is accumulated with a partial sum at the output coordinates in the output activation space that matches (i.e., equals) a position associated with the product. The output positions for the products are computed in parallel with the products (not shown in FIG. 2C). In one embodiment, coordinates defining the output positions are computed by a state machine in the accumulation unit 245. The number of adders in the adder unit 255 does not necessarily equal the number of multipliers in the FxI multiplier array 240…; And in 0103: The energy of accessing the accumulator array 340 may be reduced by combining products associated with the same output position. In one embodiment, to maximize the probability of combining, products are buffered at the accumulator units 368 in a combining buffer (e.g., a FIFO with 8 entries) and the products are only accumulated into the partial sum when the combining buffer becomes full [other un-computed neurons]. Addresses of arriving products [wherein said integrator integrates partial results from the multiplier to generate an intermediate neuron as an additional input neuron to compute the final output together with other un-computed neurons] are compared to entries in the combining buffer [with other un-computed neurons; as entries in the buffer that is not associated with a current arriving product] and when an address of an arriving product matches the address of a stored product [with other un-computed neurons; as entries in the buffer that is not associated with a current arriving product], the arriving product is summed with the stored product [wherein said integrator integrates partial results from the multiplier to generate an intermediate neuron as an additional input neuron to compute the final output together with other un-computed neurons]…)
Regarding claim 27, the rejection of claim 15 is incorporated, and the claim limitations are similar to claim 21 limitations and are rejected under the same rationale.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-18 , 21 and 27 are rejected under 35 U.S.C. 103 as being unpatentable over Dally et al. (US 2018/0046900, hereinafter Dal) in view of Pu et al. (US 20190073585, hereinafter ‘Pu’).
Regarding independent claim 1 limitations, Dal teaches: a system for computing a sparse neural network having a plurality of output layers, each output layer having a neuron value, (in 0035-0036: Neural networks typically have significant redundancy and can be pruned dramatically during training without substantively affecting accuracy of the neural network… Eliminating weights results in a neural network with a substantial number of zero values [computing a sparse neural network having a plurality of output layers, each output layer having a neuron value], which can potentially reduce the computational requirements of inference… Since the multiplication of weights and activations is the key computation for inference, the combination of activations that are zero and weights that are zero can reduce the amount of computation required by over an order of magnitude. A sparse CNN (SCNN) [computing a sparse neural network having a plurality of output layers, each output layer having a neuron value] accelerator architecture described herein, exploits weight and/or activation sparsity to reduce energy consumption and improve processing throughput. The SCNN accelerator architecture couples an algorithmic dataflow that eliminates all multiplications with a zero operand while employing a compressed representation of both weights and activations through almost the entire computation…)
the system comprising: a plurality of processing engines (PEs) coupled in parallel in a fully connected neural network (NN); (in [0051] The SCNN 200 may be configured to implement CNN algorithms that are a cascaded set of pattern recognition filters trained with supervision. A CNN consists of a series of layers, which include convolutional layers, non-linear scalar operator layers, and layers that downsample the intermediate data, for example by pooling. The convolutional layers represent the core of the CNN computation and are characterized by a set of filters that are usually 1×1 or 3×3, and occasionally 5×5 or larger. The values of these filters are the weights that are trained using a training set for the network. Some deep neural networks (DNNs) also include fully-connected layers [a plurality of processing engines (PEs) coupled in parallel in a fully connected neural network (NN)], typically toward the end of the DNN…; And see Fig. 2A:
PNG
media_image1.png
608
496
media_image1.png
Greyscale
[0044] FIG. 2A illustrates a block diagram of the SCNN 200, in accordance with one embodiment. SCNN 200 couples an algorithmic dataflow that eliminates all multiplications with a zero operand while transmitting a compact representation of weights and/or input activations between memory and logic blocks within the SCNN 200. The SCNN 200 includes a memory interface 205, layer sequencer 215, and an array of processing elements (PEs) 210. In one embodiment, the SCNN 200 is a processor and the PEs 210 are parallel processing units [a plurality of processing engines (PEs) coupled in parallel in a fully connected neural network (NN)].)
the system comprising: a plurality of processing engines (PEs) coupled in parallel in a fully connected neural network (NN); a main memory having memory cells into which are stored values for input neurons and coming weights; wherein each PE of said plurality of PEs comprises a local memory, a multiplexor, a multiplier and an integrator which are configured to receive values for a corresponding input neuron and coming weight from said main memory; (As depicted in Fig. 2A, in 0044-0046: FIG. 2A illustrates a block diagram of the SCNN 200, in accordance with one embodiment. SCNN 200 couples an algorithmic dataflow that eliminates all multiplications with a zero operand while transmitting a compact representation of weights and/or input activations between memory [claimed the system comprising: a plurality of processing engines (PEs); and a main memory having memory cells into which are stored values for input neurons and coming weights] and logic blocks within the SCNN 200. The SCNN 200 includes a memory interface 205, layer sequencer 215, and an array of processing elements (PEs) 210. In one embodiment, the SCNN 200 is a processor and the PEs 210 are parallel processing units. The memory interface 205 reads weight and activation data from a memory coupled to the SCNN 200 the memory interface 205 may also write weight and/or activation data from the SCNN 200 to the memory. … The memory may be implemented using dynamic random access memory (DRAM), or the like. In one embodiment, the memory interface 205 or the PEs 210 are configured to compact multi-bit data, such as the weights, input activations, and output activations… Each PE 210 includes a multiplier array that accepts a vector of weights (weight vector) [claimed wherein each PE of said plurality of PEs comprises a local memory, a multiplexor, a multiplier and an integrator which are configured to receive values for a corresponding input … coming weight from said main memory] and a vector of input activations (activation vector) [claimed wherein each PE of said plurality of PEs comprises a local memory, a multiplexor, a multiplier and an integrator which are configured to receive values for a corresponding input neuron … said main memory], where each multiplier within the array is configured to generate a product from one input activation value in the activation vector and one weight in the weight vector [claimed wherein each PE of said plurality of PEs comprises a local memory, a multiplexor, a multiplier and an integrator which are configured to receive a corresponding input neuron and coming weight from said main memory]…: claimed multiplexor, multiplier and integrator (e.g. accumulator), in 0138 and depicted in Figs. 3A-D)
wherein each said PE receives inputs of neuron weight, input neuron and local memory address from said main memory, with local memory address received by the local memory, and the input neuron received by the local memory and the multiplexor; … wherein output from said multiplexor is output as PE output, with this PE output also input to the multiplier which multiplies PE output and neuron weight and outputs a multiplier output, as a partial sum of the output neuron, to the integrator which integrates these partial results from the multiplier and outputs final results of the integration in which all corresponding neurons are calculated and summed; (Depicted in Figs 3A-D: And in 0072-0072: FIG. 3A illustrates a block diagram of a PE 210, in accordance with one embodiment. The PE 210 is configured to support the PTIS-sparse dataflow. Like, the PE 220 shown in FIG. 2C, the PE 210 includes a weight buffer 305, an input activations buffer 310, and an FxI multiplier array 325 [wherein each said PE receives inputs of neuron weight, input neuron and local memory address from said main memory, with local memory address received by the local memory, and the input neuron received by the local memory and the multiplexor]. Parallelism within a PE 210 is accomplished by processing a vector of F non-zero filter weights a vector of I non-zero input activations in within the FxI multiplier array 325. FxI products are generated each processing cycle by each PE 210 in the SCNN accelerator 200. In one embodiment F=I=4. In other embodiments, F and I may be any positive integer and the value of F may be greater than or less than I. The values of F and I may each be tuned to balance overall performance and circuit area. With typical density values of 30% for both weights and activations, 16 multiplies of the compressed sparse weight and input activation values is equivalent to 178 multiplies in a dense accelerator that processes weight and input activation values including zeros. The accumulator array 340 [wherein each said PE receives inputs of neuron weight, input neuron and local memory address from said main memory, with local memory address received by the local memory, and the input neuron received by the local memory and the multiplexor] may include one or more accumulation buffers and adders to store the products generated in the multiplier array 325 and sum the products into the partial sums [wherein output from said multiplexor is output as PE output, with this PE output also input to the multiplier which multiplies PE output and neuron weight and outputs a multiplier output, as a partial sum of the output neuron, to the integrator which integrates these partial results from the multiplier and outputs final results of the integration in which all corresponding neurons are calculated and summed;]. The PE 210 also includes position buffers 315 and 320, indices buffer 355, destination calculation unit 330, F*I arbitrated crossbar 335, and a postprocessing unit 345.
wherein each said PE stores the input neuron in the local memory, to be reused in computing a partial sum of the output neuron using different input weights; wherein if the local memory is initially empty, said multiplexor directly feeds the input neuron to said multiplier; (in 0048-0049: The layer sequencer 215 reads the weights and outputs weight vectors to be multiplied by the PEs 210. In one embodiment, the weights are in compact form and are read from off-chip DRAM only once and stored within the SCNN accelerator 200. In one embodiment, the layer sequencer 215 broadcasts a weight vector to each PE 210 and sequences through multiple activation vectors before broadcasting another weight vector. In one embodiment, the layer sequencer 215 broadcasts an input activation vector to each PE 210 [wherein each said PE stores the input neuron in the local memory, to be reused in computing a partial sum of the output neuron] and sequences through multiple weight vectors [using different input weights] before broadcasting another input activation vector. Products generated by the multipliers within each PE 210 are accumulated to produce intermediate values (e.g., partial sums) [claimed wherein each said PE stores the input neuron in the local memory, to be reused in computing a partial sum of the output neuron using different input weights] that become the output activations after one or more iterations. When the output activations for a neural network layer have been computed and stored in an output activation buffer, the layer sequencer 215 may proceed to process a next layer [wherein each said PE stores the input neuron in the local memory, to be reused in computing a partial sum of the output neuron using different input weights] by applying the output activations as input activations. Each PE 210 includes a multiplier array that accepts a vector of weights (weight vector) and a vector of input activations (activation vector), where each multiplier within the array is configured to generate a product from one input activation value in the activation vector and one weight in the weight vector. The weights and input activations in the vectors can all be multiplied by one another in the manner of a Cartesian product…)
wherein if the local memory is initially empty, said multiplexor directly feeds the input neuron to said multiplier; (Examiner notes that the claim is not positively recited and includes a contingent clause that appears to not be required as the local memory is required to include an address and other data per the recited previous limitation under broadest reasonable interpretation.)
and wherein each said PE is configured to compute a neuron value of a corresponding output layer by multiplying the coming weight and input neuron from said neuron in sequence, generating partial results for integration, and outputting a final value. (in 0048-0049: The layer sequencer 215 reads the weights and outputs weight vectors to be multiplied by the PEs 210. In one embodiment, the weights are in compact form and are read from off-chip DRAM only once and stored within the SCNN accelerator 200. In one embodiment, the layer sequencer 215 broadcasts a weight vector to each PE 210 and sequences through multiple activation vectors before broadcasting another weight vector. In one embodiment, the layer sequencer 215 broadcasts an input activation vector to each PE 210 and sequences through multiple weight vectors before broadcasting another input activation vector. Products generated by the multipliers within each PE 210 are accumulated to produce intermediate values (e.g., partial sums) [claimed , generating partial results for integration, …] that become the output activations [claimed generating partial results for integration, and outputting a final value] after one or more iterations. When the output activations for a neural network layer have been computed and stored in an output activation buffer [claimed generating partial results for integration, and outputting a final value], the layer sequencer 215 may proceed to process a next layer by applying the output activations as input activations. Each PE 210 includes a multiplier array that accepts a vector of weights (weight vector) and a vector of input activations (activation vector), where each multiplier within the array is configured to generate a product from one input activation value in the activation vector and one weight in the weight vector. The weights and input activations in the vectors can all be multiplied by one another in the manner of a Cartesian product…)
and wherein processed neuron results from the PEs are output, along with a neuron index for each PE, to a parallel-to-serial first-in-first-out (FIFO) circuit whose neuron index and output neuron values are fed back and stored in the main memory for coming the next layer's result. (in 0083-0086: In one embodiment, the weight buffer 305 is a first-in first-out FIFO buffer (WFIFO) [and wherein processed neuron results from the PEs are output, along with a neuron index for each PE, to a parallel-to-serial first-in-first-out (FIFO) circuit whose neuron index and output neuron values are fed back and stored in the main memory for coming the next layer's result]. The weight buffer 305 should have enough storage capacity to hold all of the non-zero weights for one input channel within one tile (i.e., for the inner most nested "For" in TABLE 3). When possible, the weights and input activations are held in the weight buffer 305 and input activations buffer 310, respectively, and are never swapped out to DRAM. If the output activation volume of a neural network layer can serve as the input activation volume for the next neural network layer, then the output activations buffer 350 is logically swapped with the input activations buffer 310 between processing of the different neural network layers. Similarly, the indices buffer 355 is logically swapped with the buffer 320 between processing the different neural network layers [… to a parallel-to-serial first-in-first-out (FIFO) circuit whose neuron index and output neuron values are fed back and stored in the main memory for coming the next layer's result]… In one embodiment, the weight buffer 305 is a FIFO buffer that includes a tail pointer, a channel pointer, and a head pointer [and wherein processed neuron results from the PEs are output, along with a neuron index for each PE, to a parallel-to-serial first-in-first-out (FIFO) circuit whose neuron index and output neuron values are fed back and stored in the main memory for coming the next layer's result]]. The layer sequencer 215 controls the "input" side of the weight buffer 305, pushing weight vectors into the weight buffer 305. The tail pointer is not allowed to advance over the channel pointer. A full condition is signaled when the tail pointer will advance past the channel pointer when another write vector is stored. The buffer 315 may be implemented in the same manner as weight buffer 305 and is configured to store the positions associated with each weight vector. In one embodiment, the weight buffer 305 outputs a weight vector of F weights {w[0] ... w[F-ll} and the buffer 315 outputs the associated positions { x[0-] ... x[F-1 l}. Each position specifies r, s, and k for a weight. The output channel k is encoded relative to the tile.)
Additionally, Pu teaches the system comprising: a plurality of processing engines (PEs) coupled in parallel in a fully connected neural network (NN); (in [0038] The connections between layers of a neural network may be fully connected or locally connected. FIG. 2A illustrates an example of a fully connected neural network 202. In a fully connected neural network 202 [the system comprising: a plurality of processing engines (PEs) coupled in parallel in a fully connected neural network (NN)], a neuron in a first layer may communicate its output to every neuron in a second layer, so that each neuron in the second layer will receive input from every neuron in the first layer. FIG. 2B illustrates an example of a locally connected neural network 204. In a locally connected neural network 204, a neuron in a first layer may be connected to a limited number of neurons in the second layer. More generally, a locally connected layer of the locally connected neural network 204 may be configured so that each neuron in a layer will have the same or a similar connectivity pattern, but with connections strengths that may have different values (e.g., 210, 212, 214, and 216)…. [0051] The processing of each layer of a convolutional network may be considered a spatially invariant template or basis projection. If the input is first decomposed into multiple channels, such as the red, green, and blue channels of a color image, then the convolutional network trained on that input may be considered three-dimensional (3D), with two spatial dimensions along the axes of the image and a third dimension capturing color information. And in [0056] FIG. 7 is a block diagram illustrating a three-dimensional (3D) asynchronous network-on-chip (ANOC) including a multi-tier, multi-core ultra-low power neuromorphic accelerator [the system comprising: a plurality of processing engines (PEs) coupled in parallel in a fully connected neural network (NN)], in accordance with certain aspects of the present disclosure. A 3D ANOC accelerator 700 includes multiple tiers 702 (702-1, . . . , 702-N) in a homogeneous configuration that are stacked to reduce space consumption. In this homogeneous configuration, each of the multiple tiers 702 includes multiple cores 720. For example, each of the multiple cores 720 includes a processing element (PE) 730 [the system comprising: a plurality of processing engines (PEs) coupled in parallel in a fully connected neural network (NN)] (e.g., a neuron), a local power manager 740 (e.g., power management integrated circuit (PMIC)), a communications module 750, and a memory 760 (e.g., synapses).)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the prior art for preforming neural network operations using neuromorphic artificial intelligence computing accelerators as disclosed by Pu with the teachings of the prior art for preforming neural network operations using high performance circuit architecture and parallel processing techniques as disclosed by Dal.
One of ordinary skill in the arts would have been motivated to combine the disclosed methods disclosed by Pu and Dal as noted above. Doing so has the advantage of implementing accelerators to achieve performance improvements at reduced power within a smaller footprint than conventional processes (Pu, 0028).
Regarding independent claim 9 limitations, Dal teaches: a system for computing where the limitations are similar to claim 1 and thus considered rejected under the same rationale. Additionally, Dal teaches wherein a number of neuron outputs are used in determining an intermediate neuron which is then used with the remaining neurons for outputting the final value, toward reducing local memory storage requirements; (in As depicted in Fig 2A and Fig. 2B and in [0056] FIG. 2B ill illustrates input activations, weights, and output activations for a single CNN layer [wherein a number of neuron outputs are used in determining an intermediate neuron which is then used with the remaining neurons for outputting the final value comprising the determined neural for processing the final output of the neural network sequence of layers], in accordance with one embodiment. The set of computations for the complete layer [wherein a number of neuron outputs are used in determining an intermediate neuron which is then used with the remaining neurons for outputting the final value] can be formulated as a loop nest over the seven variables (N, K, C, W, H, R, and S). Because multiply-add operations are associative (modulo rounding errors, which are ignored in the context of the following description), all permutations of the seven loop variables are legal. TABLE 1 shows an example loop nest based on one such permutation. The nest may be concisely described as N.fwdarw.K.fwdarw.C.fwdarw.W.fwdarw.H.fwdarw.R.fwdarw.S. Each point in the seven-dimensional space formed from the variables represents a single multiply-accumulate operation. Note that for the remainder of the description, a batch size of 1 is assumed, which is a common batch size for inferencing tasks. And in [0112] To increase parallelism beyond a single PE 210, multiple PEs 210 can be operated in parallel with each working on a disjoint three-dimensional tile of input activations. Because of the end-to-end compression of activations, both the input and output activations of each tile may be stored local to the PE 210 that processes the tile, further reducing energy-hungry data transmission […toward reducing local memory storage requirements]. Overall, the SCNN accelerator 200 provides efficient compressed storage […toward reducing local memory storage requirements] and delivery of input operands to the F×I multiplier array 325, high reuse of the input operands in the F×I multiplier array 325, and that spends no processing cycles on multiplications with zero operands.)
Additionally, Pu teaches the system comprising: a plurality of processing engines (PEs) coupled in parallel in a fully connected neural network (NN); … (in [0038] The connections between layers of a neural network may be fully connected or locally connected. FIG. 2A illustrates an example of a fully connected neural network 202. In a fully connected neural network 202 [the system comprising: a plurality of processing engines (PEs) coupled in parallel in a fully connected neural network (NN)], a neuron in a first layer may communicate its output to every neuron in a second layer, so that each neuron in the second layer will receive input from every neuron in the first layer. FIG. 2B illustrates an example of a locally connected neural network 204. In a locally connected neural network 204, a neuron in a first layer may be connected to a limited number of neurons in the second layer. More generally, a locally connected layer of the locally connected neural network 204 may be configured so that each neuron in a layer will have the same or a similar connectivity pattern, but with connections strengths that may have different values (e.g., 210, 212, 214, and 216)…. [0051] The processing of each layer of a convolutional network may be considered a spatially invariant template or basis projection. If the input is first decomposed into multiple channels, such as the red, green, and blue channels of a color image, then the convolutional network trained on that input may be considered three-dimensional (3D), with two spatial dimensions along the axes of the image and a third dimension capturing color information. And in [0056] FIG. 7 is a block diagram illustrating a three-dimensional (3D) asynchronous network-on-chip (ANOC) including a multi-tier, multi-core ultra-low power neuromorphic accelerator [the system comprising: a plurality of processing engines (PEs) coupled in parallel in a fully connected neural network (NN)], in accordance with certain aspects of the present disclosure. A 3D ANOC accelerator 700 includes multiple tiers 702 (702-1, . . . , 702-N) in a homogeneous configuration that are stacked to reduce space consumption. In this homogeneous configuration, each of the multiple tiers 702 includes multiple cores 720. For example, each of the multiple cores 720 includes a processing element (PE) 730 [the system comprising: a plurality of processing engines (PEs) coupled in parallel in a fully connected neural network (NN)] (e.g., a neuron), a local power manager 740 (e.g., power management integrated circuit (PMIC)), a communications module 750, and a memory 760 (e.g., synapses).)
… wherein a number of neuron outputs are used in determining an intermediate neuron which is then used with the remaining neurons for outputting the final value, toward reducing local memory storage requirements; (As depicted in Fig. 3 and in[0043] In the example of FIG. 3, the second set of feature maps 320 is convolved to generate a first feature vector 324 [wherein a number of neuron outputs are used in determining an intermediate neuron which is then used with the remaining neurons for outputting the final value, toward reducing local memory storage requirements]. Furthermore, the first feature vector 324 is further convolved to generate a second feature vector 328. Each feature of the second feature vector 328 may include a number that corresponds to a possible feature of the image 326, such as “sign,” “60,” and “100.” A softmax function (not shown) may convert the numbers in the second feature vector 328 to a probability. As such, an output 322 of the DCN 300 is a probability of the image 326 including one or more features; And in [0056] FIG. 7 is a block diagram illustrating a three-dimensional (3D) asynchronous network-on-chip (ANOC) including a multi-tier, multi-core ultra-low power neuromorphic accelerator, in accordance with certain aspects of the present disclosure. A 3D ANOC accelerator 700 includes multiple tiers 702 (702-1, . . . , 702-N) in a homogeneous configuration that are stacked to reduce space consumption [toward reducing local memory storage requirements;]. In this homogeneous configuration, each of the multiple tiers 702 includes multiple cores 720. For example, each of the multiple cores 720 includes a processing element (PE) 730 [wherein a number of neuron outputs are used in determining an intermediate neuron which is then used with the remaining neurons for outputting the final value, toward reducing local memory storage requirements] (e.g., a neuron), a local power manager 740 (e.g., power management integrated circuit (PMIC)), a communications module 750, and a memory 760 (e.g., synapses).)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the prior art for preforming neural network operations using neuromorphic artificial intelligence computing accelerators as disclosed by Pu with the teachings of the prior art for preforming neural network operations using high performance circuit architecture and parallel processing techniques as disclosed by Dal.
One of ordinary skill in the arts would have been motivated to combine the disclosed methods disclosed by Pu and Dal as noted above. Doing so has the advantage of implementing accelerators to achieve performance improvements at reduced power within a smaller footprint than conventional processes (Pu, 0028).
Claims 2-8, 10-18 , 21 and 27 are rejected under the same rationale noted above. As noted by the cited teaches in Dal.
Claims 1-18 , 21 and 27 are rejected under 35 U.S.C. 103 as being unpatentable over Dally et al. (US 2018/0046900, hereinafter Dal) in view of Venkataramani et al. (US 20190303743, hereinafter ‘Ven’).
Regarding independent claim 1 limitations, Dal teaches: a system for computing a sparse neural network having a plurality of output layers, each output layer having a neuron value, (in 0035-0036: Neural networks typically have significant redundancy and can be pruned dramatically during training without substantively affecting accuracy of the neural network… Eliminating weights results in a neural network with a substantial number of zero values [computing a sparse neural network having a plurality of output layers, each output layer having a neuron value], which can potentially reduce the computational requirements of inference… Since the multiplication of weights and activations is the key computation for inference, the combination of activations that are zero and weights that are zero can reduce the amount of computation required by over an order of magnitude. A sparse CNN (SCNN) [computing a sparse neural network having a plurality of output layers, each output layer having a neuron value] accelerator architecture described herein, exploits weight and/or activation sparsity to reduce energy consumption and improve processing throughput. The SCNN accelerator architecture couples an algorithmic dataflow that eliminates all multiplications with a zero operand while employing a compressed representation of both weights and activations through almost the entire computation…)
the system comprising: a plurality of processing engines (PEs) coupled in parallel in a fully connected neural network (NN); (in [0051] The SCNN 200 may be configured to implement CNN algorithms that are a cascaded set of pattern recognition filters trained with supervision. A CNN consists of a series of layers, which include convolutional layers, non-linear scalar operator layers, and layers that downsample the intermediate data, for example by pooling. The convolutional layers represent the core of the CNN computation and are characterized by a set of filters that are usually 1×1 or 3×3, and occasionally 5×5 or larger. The values of these filters are the weights that are trained using a training set for the network. Some deep neural networks (DNNs) also include fully-connected layers [a plurality of processing engines (PEs) coupled in parallel in a fully connected neural network (NN)], typically toward the end of the DNN…; And see Fig. 2A:
PNG
media_image1.png
608
496
media_image1.png
Greyscale
[0044] FIG. 2A illustrates a block diagram of the SCNN 200, in accordance with one embodiment. SCNN 200 couples an algorithmic dataflow that eliminates all multiplications with a zero operand while transmitting a compact representation of weights and/or input activations between memory and logic blocks within the SCNN 200. The SCNN 200 includes a memory interface 205, layer sequencer 215, and an array of processing elements (PEs) 210. In one embodiment, the SCNN 200 is a processor and the PEs 210 are parallel processing units [a plurality of processing engines (PEs) coupled in parallel in a fully connected neural network (NN)].)
the system comprising: a plurality of processing engines (PEs) coupled in parallel in a fully connected neural network (NN); a main memory having memory cells into which are stored values for input neurons and coming weights; wherein each PE of said plurality of PEs comprises a local memory, a multiplexor, a multiplier and an integrator which are configured to receive values for a corresponding input neuron and coming weight from said main memory; (As depicted in Fig. 2A, in 0044-0046: FIG. 2A illustrates a block diagram of the SCNN 200, in accordance with one embodiment. SCNN 200 couples an algorithmic dataflow that eliminates all multiplications with a zero operand while transmitting a compact representation of weights and/or input activations between memory [claimed the system comprising: a plurality of processing engines (PEs); and a main memory having memory cells into which are stored values for input neurons and coming weights] and logic blocks within the SCNN 200. The SCNN 200 includes a memory interface 205, layer sequencer 215, and an array of processing elements (PEs) 210. In one embodiment, the SCNN 200 is a processor and the PEs 210 are parallel processing units. The memory interface 205 reads weight and activation data from a memory coupled to the SCNN 200 the memory interface 205 may also write weight and/or activation data from the SCNN 200 to the memory. … The memory may be implemented using dynamic random access memory (DRAM), or the like. In one embodiment, the memory interface 205 or the PEs 210 are configured to compact multi-bit data, such as the weights, input activations, and output activations… Each PE 210 includes a multiplier array that accepts a vector of weights (weight vector) [claimed wherein each PE of said plurality of PEs comprises a local memory, a multiplexor, a multiplier and an integrator which are configured to receive values for a corresponding input … coming weight from said main memory] and a vector of input activations (activation vector) [claimed wherein each PE of said plurality of PEs comprises a local memory, a multiplexor, a multiplier and an integrator which are configured to receive values for a corresponding input neuron … said main memory], where each multiplier within the array is configured to generate a product from one input activation value in the activation vector and one weight in the weight vector [claimed wherein each PE of said plurality of PEs comprises a local memory, a multiplexor, a multiplier and an integrator which are configured to receive a corresponding input neuron and coming weight from said main memory]…: claimed multiplexor, multiplier and integrator (e.g. accumulator), in 0138 and depicted in Figs. 3A-D)
wherein each said PE receives inputs of neuron weight, input neuron and local memory address from said main memory, with local memory address received by the local memory, and the input neuron received by the local memory and the multiplexor; … wherein output from said multiplexor is output as PE output, with this PE output also input to the multiplier which multiplies PE output and neuron weight and outputs a multiplier output, as a partial sum of the output neuron, to the integrator which integrates these partial results from the multiplier and outputs final results of the integration in which all corresponding neurons are calculated and summed; (Depicted in Figs 3A-D: And in 0072-0072: FIG. 3A illustrates a block diagram of a PE 210, in accordance with one embodiment. The PE 210 is configured to support the PTIS-sparse dataflow. Like, the PE 220 shown in FIG. 2C, the PE 210 includes a weight buffer 305, an input activations buffer 310, and an FxI multiplier array 325 [wherein each said PE receives inputs of neuron weight, input neuron and local memory address from said main memory, with local memory address received by the local memory, and the input neuron received by the local memory and the multiplexor]. Parallelism within a PE 210 is accomplished by processing a vector of F non-zero filter weights a vector of I non-zero input activations in within the FxI multiplier array 325. FxI products are generated each processing cycle by each PE 210 in the SCNN accelerator 200. In one embodiment F=I=4. In other embodiments, F and I may be any positive integer and the value of F may be greater than or less than I. The values of F and I may each be tuned to balance overall performance and circuit area. With typical density values of 30% for both weights and activations, 16 multiplies of the compressed sparse weight and input activation values is equivalent to 178 multiplies in a dense accelerator that processes weight and input activation values including zeros. The accumulator array 340 [wherein each said PE receives inputs of neuron weight, input neuron and local memory address from said main memory, with local memory address received by the local memory, and the input neuron received by the local memory and the multiplexor] may include one or more accumulation buffers and adders to store the products generated in the multiplier array 325 and sum the products into the partial sums [wherein output from said multiplexor is output as PE output, with this PE output also input to the multiplier which multiplies PE output and neuron weight and outputs a multiplier output, as a partial sum of the output neuron, to the integrator which integrates these partial results from the multiplier and outputs final results of the integration in which all corresponding neurons are calculated and summed;]. The PE 210 also includes position buffers 315 and 320, indices buffer 355, destination calculation unit 330, F*I arbitrated crossbar 335, and a postprocessing unit 345.
wherein each said PE stores the input neuron in the local memory, to be reused in computing a partial sum of the output neuron using different input weights; wherein if the local memory is initially empty, said multiplexor directly feeds the input neuron to said multiplier; (in 0048-0049: The layer sequencer 215 reads the weights and outputs weight vectors to be multiplied by the PEs 210. In one embodiment, the weights are in compact form and are read from off-chip DRAM only once and stored within the SCNN accelerator 200. In one embodiment, the layer sequencer 215 broadcasts a weight vector to each PE 210 and sequences through multiple activation vectors before broadcasting another weight vector. In one embodiment, the layer sequencer 215 broadcasts an input activation vector to each PE 210 [wherein each said PE stores the input neuron in the local memory, to be reused in computing a partial sum of the output neuron] and sequences through multiple weight vectors [using different input weights] before broadcasting another input activation vector. Products generated by the multipliers within each PE 210 are accumulated to produce intermediate values (e.g., partial sums) [claimed wherein each said PE stores the input neuron in the local memory, to be reused in computing a partial sum of the output neuron using different input weights] that become the output activations after one or more iterations. When the output activations for a neural network layer have been computed and stored in an output activation buffer, the layer sequencer 215 may proceed to process a next layer [wherein each said PE stores the input neuron in the local memory, to be reused in computing a partial sum of the output neuron using different input weights] by applying the output activations as input activations. Each PE 210 includes a multiplier array that accepts a vector of weights (weight vector) and a vector of input activations (activation vector), where each multiplier within the array is configured to generate a product from one input activation value in the activation vector and one weight in the weight vector. The weights and input activations in the vectors can all be multiplied by one another in the manner of a Cartesian product…)
wherein if the local memory is initially empty, said multiplexor directly feeds the input neuron to said multiplier; (Examiner notes that the claim is not positively recited and includes a contingent clause that appears to not be required as the local memory is required to include an address and other data per the recited previous limitation under broadest reasonable interpretation.)
and wherein each said PE is configured to compute a neuron value of a corresponding output layer by multiplying the coming weight and input neuron from said neuron in sequence, generating partial results for integration, and outputting a final value. (in 0048-0049: The layer sequencer 215 reads the weights and outputs weight vectors to be multiplied by the PEs 210. In one embodiment, the weights are in compact form and are read from off-chip DRAM only once and stored within the SCNN accelerator 200. In one embodiment, the layer sequencer 215 broadcasts a weight vector to each PE 210 and sequences through multiple activation vectors before broadcasting another weight vector. In one embodiment, the layer sequencer 215 broadcasts an input activation vector to each PE 210 and sequences through multiple weight vectors before broadcasting another input activation vector. Products generated by the multipliers within each PE 210 are accumulated to produce intermediate values (e.g., partial sums) [claimed , generating partial results for integration, …] that become the output activations [claimed generating partial results for integration, and outputting a final value] after one or more iterations. When the output activations for a neural network layer have been computed and stored in an output activation buffer [claimed generating partial results for integration, and outputting a final value], the layer sequencer 215 may proceed to process a next layer by applying the output activations as input activations. Each PE 210 includes a multiplier array that accepts a vector of weights (weight vector) and a vector of input activations (activation vector), where each multiplier within the array is configured to generate a product from one input activation value in the activation vector and one weight in the weight vector. The weights and input activations in the vectors can all be multiplied by one another in the manner of a Cartesian product…)
and wherein processed neuron results from the PEs are output, along with a neuron index for each PE, to a parallel-to-serial first-in-first-out (FIFO) circuit whose neuron index and output neuron values are fed back and stored in the main memory for coming the next layer's result. (in 0083-0086: In one embodiment, the weight buffer 305 is a first-in first-out FIFO buffer (WFIFO) [and wherein processed neuron results from the PEs are output, along with a neuron index for each PE, to a parallel-to-serial first-in-first-out (FIFO) circuit whose neuron index and output neuron values are fed back and stored in the main memory for coming the next layer's result]. The weight buffer 305 should have enough storage capacity to hold all of the non-zero weights for one input channel within one tile (i.e., for the inner most nested "For" in TABLE 3). When possible, the weights and input activations are held in the weight buffer 305 and input activations buffer 310, respectively, and are never swapped out to DRAM. If the output activation volume of a neural network layer can serve as the input activation volume for the next neural network layer, then the output activations buffer 350 is logically swapped with the input activations buffer 310 between processing of the different neural network layers. Similarly, the indices buffer 355 is logically swapped with the buffer 320 between processing the different neural network layers [… to a parallel-to-serial first-in-first-out (FIFO) circuit whose neuron index and output neuron values are fed back and stored in the main memory for coming the next layer's result]… In one embodiment, the weight buffer 305 is a FIFO buffer that includes a tail pointer, a channel pointer, and a head pointer [and wherein processed neuron results from the PEs are output, along with a neuron index for each PE, to a parallel-to-serial first-in-first-out (FIFO) circuit whose neuron index and output neuron values are fed back and stored in the main memory for coming the next layer's result]]. The layer sequencer 215 controls the "input" side of the weight buffer 305, pushing weight vectors into the weight buffer 305. The tail pointer is not allowed to advance over the channel pointer. A full condition is signaled when the tail pointer will advance past the channel pointer when another write vector is stored. The buffer 315 may be implemented in the same manner as weight buffer 305 and is configured to store the positions associated with each weight vector. In one embodiment, the weight buffer 305 outputs a weight vector of F weights {w[0] ... w[F-ll} and the buffer 315 outputs the associated positions { x[0-] ... x[F-1 l}. Each position specifies r, s, and k for a weight. The output channel k is encoded relative to the tile.)
Additionally, Ven teaches the system comprising: a plurality of processing engines (PEs) coupled in parallel in a fully connected neural network (NN); (in [0134] In yet another embodiment, a non-transitory machine readable medium that stores code that when executed by a machine causes the machine to perform a method including receiving a neural network comprising a plurality of fully connected layers and a plurality of convolutional layers with a processing system, wherein the processing system comprises a plurality of fully connected layer chips coupled by an interconnect, a plurality of convolutional layer chips each coupled by an interconnect to a respective fully connected layer chip of the plurality of fully connected layer chips [the system comprising: a plurality of processing engines (PEs) coupled in parallel in a fully connected neural network (NN)], and each of the plurality of fully connected layer chips and the plurality of convolutional layer chips comprising an interconnect to couple each of a forward propagation compute intensive tile, a back propagation compute intensive tile, and a weight gradient compute intensive tile of a column of compute intensive tiles between a first memory intensive tile and a second memory intensive tile; and mapping the plurality of fully connected layers of the neural network to the plurality of fully connected layer chips and the plurality of convolution layers of the neural network to the plurality of convolutional layer chips. The method may include generating updated weight gradients for the neural network with the processing system. The method may include a convolution layer chip operating on a set of inputs in parallel to generate updated weight gradients and a respective fully connected layer chip operating on a set of outputs from the convolution layer chip. The method may include accumulating partial output features from the compute intensive tiles into a third memory intensive tile, and computing an activation function. The method may include each of the plurality of fully connected layer chips and the plurality of convolutional layer chips including a plurality of rows and columns of compute intensive tiles coupled to a plurality of rows and columns of memory intensive tiles, and the mapping including allocating columns for each layer of the neural network to the memory intensive tiles…)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the prior art for preforming neural network operations using parallel processing elements/chips as disclosed by Ven with the teachings of the prior art for preforming neural network operations using high performance circuit architecture and parallel processing techniques as disclosed by Dal.
One of ordinary skill in the arts would have been motivated to combine the disclosed methods disclosed by Ven and Dal as noted above. Doing so has the advantage of reducing the data transferred through the parallel network and also lessen the memory bandwidth of each fully connected layer chip, (Ven, 0105).
Regarding independent claim 9 limitations, Dal teaches: a system for computing where the limitations are similar to claim 1 and thus considered rejected under the same rationale. Additionally, Dal teaches wherein a number of neuron outputs are used in determining an intermediate neuron which is then used with the remaining neurons for outputting the final value, toward reducing local memory storage requirements; (in As depicted in [0055] These layers may take multiple input features and produce a (e.g., equal) plurality of output features. In FP, output features may be produced by down-sampling (e.g., max-pooling or averaging) the input features, e.g., reducing feature size [wherein a number of neuron outputs are used in determining an intermediate neuron which is then used with the remaining neurons for outputting the final value, toward reducing local memory storage requirements]… [0103] In certain embodiments, one of the key performance parameters at the node-level is for the substantially high memory bandwidth required by the fully connected layer chips to maintain a same throughput as the convolutional layer chips [wherein a number of neuron outputs are used in determining an intermediate neuron which is then used with the remaining neurons for outputting the final value, toward reducing local memory storage requirements]. One embodiment herein reduces the memory bandwidth by aggregating inputs to the fully connected layers, e.g., and execute them as a batch in the fully connected layer chip. This may allow for the layer parameters to be fetched only once per batch, e.g., reducing bandwidth proportional to the batch size [wherein a number of neuron outputs are used in determining an intermediate neuron which is then used with the remaining neurons for outputting the final value, toward reducing local memory storage requirements]. FIG. 15 illustrates an example of a computing system 1500 with a node architecture to couple a plurality of fully connected layer chips and convolutional layer chips according to embodiments of the disclosure. As shown in FIG. 15, a chip cluster is formed by connecting multiple (e.g., illustrated as four) convolutional layer chips and one fully connected layer chip as a wheel (e.g., via a wheel interconnect). The convolutional layer chips in each depicted cluster are located at the circumference and the fully connected layer chip is present at the center. The convolutional layer chips (e.g., of a cluster) may operate on different training/evaluation inputs in parallel, e.g., while the fully connected layer chip receives inputs from all the convolutional layer chips (e.g., of a cluster) and executes them in a batch….)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the prior art for preforming neural network operations using parallel processing elements/chips as disclosed by Ven with the teachings of the prior art for preforming neural network operations using high performance circuit architecture and parallel processing techniques as disclosed by Dal.
One of ordinary skill in the arts would have been motivated to combine the disclosed methods disclosed by Ven and Dal as noted above. Doing so has the advantage of reducing the data transferred through the parallel network and also lessen the memory bandwidth of each fully connected layer chip, (Ven, 0105).
Claims 2-8, 10-18 , 21 and 27 are rejected under the same rationale noted above. As noted by the cited teaches in Dal.
Claims 22 and 28 are rejected under 35 U.S.C. 103 as being unpatentable Dally et al. (US 2018/0046900, hereinafter Dal) in view of Ven in further view of Henry et al. (US 20180189633, hereinafter ‘Henry’).
Regarding claim 22, the rejection of claim 21 is incorporated, Dal further teaches the apparatus of claim 21 intermediate neuron is calculated together with remaining neurons to get a final output, thus reducing the number of neural outputs received at the input neuron. (in 0050-0051: … A CNN consists of a series of layers [intermediate neuron is calculated together with remaining neurons to get a final output, thus reducing the number of neural outputs received at the input neuron], which include convolutional layers, nonlinear scalar operator layers, and layers that downsample the intermediate data [intermediate neuron is calculated together with remaining neurons to get a final output thus reducing the number of neural outputs received at the input neuron], for example by pooling [thus reducing the number of neural outputs received at the input neuron]. The convolutional layers n represent the core of the CNN computation and are characterized by a set of filters that are usually 1 x 1 or 3x3, and occasionally 5x5 or larger. The values of these filters are the weights that are trained using a training set for the network. Some deep neural networks (DNNs) also include fully-connected layers, typically toward the end of the DNN [intermediate neuron is calculated together with remaining neurons to get a final output, thus reducing the number of neural outputs received at the input neuron]. During classification, a new image (in the case of image recognition) is presented to the neural network, which classifies images into the training categories by computing in succession each of the layers in the neural network.)
While Dal teaches the using of processing elements for computing neurons in a series of layers in a neural network, and one of ordinary skill in the art would understand that neural network outputs of a previous layer serve as the input of the next layer of neurons that can reduce number of neurons.
Additionally, Henry discloses neural network outputs of a previous layer serve as the input of the next layer of neurons that can reduce number of neurons, in 0225: As may be observed with respect to the embodiment of FIG. 23, the number of result words (neuron outputs) produced and written back to the data RAM 122 or weight RAM 124 is half the square root of the number of data inputs (connections) received and the written back row of results has holes, i.e., every other narrow word result is invalid, more specifically, the B narrow NPU results are not meaningful. Thus, the embodiment of FIG. 23 may be particularly efficient in neural networks having two successive layers in which, for example, the first layer has twice as many neurons as the second layer (e.g., the first layer has 1024 neurons fully connected to a second layer of 512 neurons) [intermediate neuron is calculated together with remaining neurons to get a final output, thus reducing the number of neural outputs received at the input neuron]. Furthermore, the other execution units 112 (e.g., media units, such as x86 AVX units) may perform pack operations on a disperse row of results (i.e., having holes) to make compact it (i.e., without holes), if necessary, for use in subsequent computations while the NNU 121 is performing other computations associated with other rows of the data RAM 122 and/or weight RAM 124.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the prior art for preforming neural network operations using high performance circuit architecture and parallel processing techniques as collectively disclosed by Ven and Dal with the method for implementing neural network operators using hardware processing circuitry and neural network processing algorithms as disclosed by Henry.
One of ordinary skill in the arts would have been motivated to combine the disclosed methods disclosed by Dal, Ven and Henry in order to help improved performance and efficiency of computations associated with ANNs, (Henry, 0002) and developing circuitry configured to perform classic multiply-accumulate function in an artificial network, (Henry 0117) Doing so has the advantage of efficiently performing as an artificial neural network layer for a large number of different connection inputs, (Henry, 0117).
Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Nariyambut Murali et al. (US 20170206434): Teaches activation of neurons/nodes between convolution layers, feature maps resulting from a previous convolution layer may activate convolution neurons/nodes in a subsequent convolution layer.
Du et al. (NPL: An Accelerator for High Efficient Vision Processing): teaches that convolutional layers allow different input-output neuron pairs to share synaptic weights (i.e., with the kernel), and pooling layers do not have synaptic weights. In contrast, classifier layers are usually fully connected, and there is no sharing of synaptic weights among different input-output neuron pairs. As a result, classifier layers often consume the largest space in the SB (e.g., 97.28% for LeNet-5 [12]). We present the general scheduling of a classifier layer in Fig. 16. When processing a classifier layer, each PE works on a single output neuron, and will not move to another output neuron until the current one has been computed. In each cycle, Px × Py synaptic weights and a single input neuron for all Px × Py PEs are loaded to NFU. After that, each PE multiplies the synaptic weight and input neuron, and accumulates the result to the stored partial sum, for obtaining the dot product associated with an output neuron. The obtained result will be sent to the ALU for the computation of activation function.
Park (US 2016/0162782): teaches the number of outputs reduced in the sequence of convolutional layers.
Masgonty et al. (US 2003/0061184): teaches processing data into empty memory locations.
Han et al (NPL: “ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA”, hereinafter ‘Han’): teaches using FIFO and sparse matrix remapping.
Li et al. (US 2018/0046914): teaches using densifying a sparse neural network by looking for weights that are close to zero value for compressing the neural network.
Yan et al. (2018/0218518): teaches the compression of data associated with processing a sparse neural network.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to OLUWATOSIN ALABI whose telephone number is (571)272-0516. The examiner can normally be reached Monday-Friday, 8:00am-5:00pm EST..
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael Huntley can be reached on (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/OLUWATOSIN ALABI/ Primary Examiner, Art Unit 2129