DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55.
Reopening after Pre-Appeal Brief
The status of claims is as follows:
Claims 1-19 and 22-30 remain pending in the application.
Claims 1, 10, 22, 25, and 27 are amended.
Claims 20-21 are cancelled.
Response to Arguments
Applicant’s arguments with respect to rejections under 35 USC 103 have been fully considered but are moot in light of the new combination of references applied.
Examiner acknowledges Applicant’s argument presented in the pre-appeal brief that He only refers to multiple binarization thresholds to produce multiple binary feature maps, as a technique done in previous works, and He does not perform this technique themselves. While Examiner notes that He had suggested doing so (“ABC-Net [10] … using binarization functions with various thresholds … which will be discussed in the main body”), a generic vague reference as such does not meet the high bar required by a rejection under 35 USC 102.
Therefore, a new Non-Final rejection under 35 USC 103 has been issued in which an additional reference, Lin, is applied, which explicitly recites this technique. Examiner also notes that the combination is obvious because the Lin technique “ABC-Net” is, in fact, explicitly noted as a useful technique by He, who provides a motivation for making such a combination on Page 534: “recent research efforts in [10], [16] has brought up multiple binarization method to compensate the information loss due to the aggressive quantization. Thus, further integrating such multiple binarization with in-memory computing technique will provide accurate deep neural network inference result with high throughput.”
Examiner notes that Lin was relied upon for several dependent claims in the most recent office action, but is now moved up into the combination to teach the independent claims.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-6, 9, 10-15, 18-19, and 30 are rejected under 35 U.S.C. 103 as being unpatentable over He et al. (“Accelerating Low Bit-Width Deep Convolution Neural Network in MRAM”; hereinafter “He”) in view of Lin et al. (“Towards Accurate Binary Convolutional Neural Network”; hereinafter “Lin”)
As per Claim 1, He teaches a neuromorphic method, the method comprising:
generating a plurality of binary feature maps by multi-channel binarizing pixel values of an input feature map (He, Page 535 Figure 1:
PNG
media_image1.png
610
1005
media_image1.png
Greyscale
He, Page 534 End of Section II: “Thus, further integrating such multiple binarization with in memory computing technique will provide accurate deep neural network inference result with high throughput.” Also He discloses “image bank”, and images comprise pixels. He, Page 535: “Quantizer: This unit binarize the interlayer tensor or weight w.r.t Eq. (1), Eq. (2) and Eq. (3).”
providing pixel values of each of the plurality of binary feature maps as input values to a crossbar array circuitry (He, as shown above in Figure 1, discloses inputting each of the binary feature maps to “sensory circuitry”. He, Page 537 Section C, discloses: “In this subsection, we describe a magnetic crossbar architecture consisting of perpendicularly coupled magnetic domain wall motion racetracks.”)
storing weight values of a machine model in respective synaptic circuits included in the crossbar array circuitry (He, Page 535 under Figure 1: “The general overview of the system architecture for performing low bit-width CNN is shown in Fig. 1.a [18]. This architecture mainly consists of Image Bank, Kernel Bank, bit-wise In-Memory Convolution Engine (IMCE), and Digital Processing Unit (DPU). Since linear layer can be visualized as convolution layer with 1×1 kernel size, the computation of linear layer could be implemented by the same convolution accelerator as well. Assuming the Input feature maps (I) and Kernels (W) are initially stored in the Image Banks and Kernel Banks of memory, respectively.” Here, He discloses that the weight values (“kernels”) are stored in respective circuits (“Kernels (W) … Kernel Banks of memory, respectively.” He also explicitly discloses types of synaptic circuits on Page 537: “The device structure of the computational magnetic crossbar design is shown in Fig. 6a, which mainly constitutes of equally spaced ferromagnetic nanowires both longitudinally and latitudinally. Each nanowire could work individually as a normal domain wall racetrack memory to store the interlayer tensor and weight, where binary data are represented by the magnetization directions and stored in the form of domain wall pair train within the nanowires [23].”)
generating output values of the crossbar array circuitry for the plurality of binary feature maps by implementing multiplications respectively between each of a plurality of the input values and corresponding weight values stored in the synaptic circuits (He, Figure 1 shown above, discloses “Activation Function” and “Counter-Shifter-Partial Sum” comprising XNOR operations. He, Page 535 Bottom Left: “Activ. Function: the activation function module perform the element-wise computation, which normally takes ReLU as the activation function.” Examiner notes that an activation function multiplies input values and weight values to produce output activations. Furthermore, more detail is provided in Page 534 Top Right: “Therefore, the computation for one convolution layer or linear layer can be described as:
PNG
media_image2.png
60
418
media_image2.png
Greyscale
where xq and wq are the vectorized form of quantized interlayer tensor and weight. N is the vector size of xq. Since xq and wq only consist of -1 and +1, in order to map the computation of xq,i · wq,i into hardware, we use single bit of 0 and 1 to represent -1 and +1 respectively. For the converted form x’q,i and w’q,i, the computation of xq,i ·wq,i is equivalent to XNOR.” Thus, He teaches multiplying inputs by weight values using XNOR gates.)
generating an output feature map by generating pixel values of the output feature map by selectively merging the output values (He, Page 535 Top Right, discloses: “Then, the produced binary result is processed in parallel by partial sum (subtract) units.” Furthermore, Fig. 1(a) shows a circled “+” icon under the partial sum output values (“P-Sum”) indicating them being merged into a feature map (“Output fmaps”). Examiner notes that the term “selectively” in the claim is not given any further detail, nor is “selectively” given any description in the Specification, and notes that in He each of the partial sums is selected and combined, and the BRI of “selectively merging” includes “selecting every element and adding all together.”)
However, He does not explicitly teach wherein the multi-channel binarizing is based on a plurality of different thresholds that are set to have respective different values, and wherein each of the plurality of different thresholds is used applied to the input feature map to generate a corresponding one of the plurality of binary feature maps, wherein the application of the different thresholds to the input feature map includes applying a first threshold to values of first pixels of the input feature map to generate a first binary feature map of the plurality of binary feature maps, and applying a different second threshold to the values of the first pixels of the input feature map to generate a second binary feature map of the plurality of binary feature maps.
Lin teaches wherein the multi-channel binarizing is based on a plurality of different thresholds that are set to have respective different values, and wherein each of the plurality of different thresholds is used applied to the input feature map to generate a corresponding one of the plurality of binary feature maps, wherein the application of the different thresholds to the input feature map includes applying a first threshold to values of first pixels of the input feature map to generate a first binary feature map of the plurality of binary feature maps, and applying a different second threshold to the values of the first pixels of the input feature map to generate a second binary feature map of the plurality of binary feature maps (Lin, Pages 4-5 Section 3.2, discloses: “As mentioned above, a convolution can be implemented without multiplications when weights are binarized. However, to utilize the bitwise operation, the activations must be binarized as well, as they are the inputs of convolutions. Similar to the activation binarization procedure in [Zhou et al., 2016], we binarize activations after passing it through a bounded activation function h, which ensures h(x) 2 [0, 1]. We choose the bounded rectifier as h. Formally, it can be defined as: hv(x) = clip(x + v, 0, 1), (8) where v is a shift parameter.” Lin, Page 5 above Eq 11, discloses: “Secondly, we estimate the real-value activation R using the linear combination of N binary activations.”
Here, Examiner notes that Lin discloses “N binary activations”, each with a different threshold, as each threshold is adjusted by a “shift parameter”.)
Lin is analogous art because it is in the field of endeavor of binary neural network acceleration. It would have been obvious before the effective filing date of the claimed invention to combine the teachings of He and Lin. One of ordinary skill in the art would have been motivated to do so in order to compensate for information loss from binary quantization (He, Page 534 End of Section II: “recent research efforts in [10], [16] has brought up multiple binarization method to compensate the information loss due to the aggressive quantization Thus, further integrating such multiple binarization with in-memory computing technique will provide accurate deep neural network inference result with high throughput” and Lin, Top of Page 3: “We relied on the idea of finding the best approximation of full- precision convolution using multiple binary operations, and employing multiple binary activations to allow more information passing through.”)
As per Claim 2, the combination of He and Lin teaches the method of Claim 1. Lin teaches wherein the generating of the plurality of binary feature maps comprises determining pixel values of a binary feature map by comparing each of the plurality of different thresholds with pixel values of the input feature map, and setting respective pixel values of the plurality of binary feature maps to binary values based on results of the comparing (Lin, end of Page 4: “We constrain the binary activations to either 1 or -1. In order to transform the real-valued activation R into binary activation, we use the following binarization function:
PNG
media_image3.png
20
394
media_image3.png
Greyscale
”.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Lin with He for at least the reasons recited in the rejection to Claim 1.
As per Claim 3, the combination of He and Lin teaches the method of Claim 1. Lin teaches wherein the generating of the plurality of binary feature maps further comprises: for each of the plurality of different thresholds, determining whether a pixel value of the input feature map is greater than a threshold, and when the determining of whether the pixel value of the input feature map is greater than the threshold indicates that the pixel value of the input feature map is greater than the threshold determining a corresponding pixel value of a binary feature map to be 1; and when the determining of whether the pixel value of the input feature map is greater than the threshold indicates that the pixel value of the input feature map is not greater than the threshold or when another performed determining of whether the pixel value is less than the threshold or another threshold indicates that the pixel value is respectively less than the threshold or the other threshold, determining the corresponding pixel value of the binary feature map to be 0 or -1. (Lin, as shown above in the rejection to Claim 2, assigns a value of 1 or -1 based on the threshold.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Lin with He for at least the reasons recited in the rejection to Claim 1.
As per Claim 4, He teaches the method of Claim 1. Lin teaches wherein each of the pixel values of the output feature map are represented by multiple bits (Lin, Page 5: “Secondly, we estimate the real-value activation R using the linear combination of N binary activations … are expected to learn and utilize the statistical features of full-precision activations.” Examiner notes that here, the binary activations are linearly combined to produce a feature map at full precision, and therefore multiple bits.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Lin with He for at least the reasons recited in the rejection to Claim 1.
As per Claim 5, He teaches the method of Claim 1. Lin teaches wherein a plural number of bits of a pixel value of the output feature map has a same plural number of bits as a pixel value of the input feature map (Lin, Page 5: “Secondly, we estimate the real-value activation R using the linear combination of N binary activations … are expected to learn and utilize the statistical features of full-precision activations.” Examiner notes that here, the binary activations are linearly combined to produce a feature map at full precision, and therefore multiple bits. Furthermore, the original full precision activations are converted to binary, and then back to full precision, therefore the output feature map has the same number of bits as the input feature map.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Lin with He for at least the reasons recited in the rejection to Claim 1.
As per Claim 6, He teaches the method of Claim 1. Lin teaches wherein the generating of the pixel values of the output feature map comprises generating pixel values of the output feature map by applying an activation function to the merged output values. (Lin, Page 5: “Secondly, we estimate the real-value activation R using the linear combination of N binary activations … Different from that of weights, the parameters βn’s and vn’s (n = 1, · · · ,N) here are both trainable, just like the scale and shift parameters in batch normalization. Without the explicit linear regression approach, βn’s and vn’s are tuned by the network itself during training and fixed in test-time. They are expected to learn and utilize the statistical features of full-precision activations.” Here, Lin discloses treating the linear combination of binarization outputs just like any other layer of an neural network, in which the weights of said combination are tuned by the network. One of ordinary skill in the art will appreciate that a layer of a neural network that is trained comprises a linear combination followed by an activation function. Lin discloses the linear combination, and Examiner further points out that “identity” or “linear”, in which the outcome of the linear combination is not changed, is also considered an activation function in the art (as evidenced by Deval Shah “Activation Functions”, provided herein: “Identity or linear activation function. You will get the exact same curve. Input maps to same output.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Lin with He for at least the reasons recited in the rejection to Claim 1.
As per Claim 9, the combination of He and Lin teaches the method of Claim 1. He teaches A computer-readable medium comprising instructions, which when executed by a processor, configure the processor to implement the method of claim 1. (He, Page 535 Top Left: “The general overview of the system architecture for performing low bit-width CNN is shown in Fig. 1.a [18]. This architecture mainly consists of Image Bank, Kernel Bank, bit-wise In-Memory Convolution Engine (IMCE), and Digital Processing Unit (DPU).”)
As per Claims 10-15, these are device claims corresponding to method Claims 1-6. The difference is that it recites an on-chip memory including a crossbar array circuitry and a processor. He, Page 535 under Figure 1: “The general overview of the system architecture for performing low bit-width CNN is shown in Fig. 1.a [18]. This architecture mainly consists of Image Bank, Kernel Bank, bit-wise In-Memory Convolution Engine (IMCE), and Digital Processing Unit (DPU).” He, Page 537 Section C, discloses: “In this subsection, we describe a magnetic crossbar architecture consisting of perpendicularly coupled magnetic domain wall motion racetracks.” Therefore, Claims 10-12 are rejected for similar reasons as Claims 1-3.
As per Claim 18, the combination of He and Lin teaches the device of Claim 10. He teaches wherein the machine model is a neural network (He, Page 534 Section III: “In this section, we will first provide an over-view introduction for low bit-width CNN accelerator in block level.”)
wherein the processor is further configured to: generate a training input feature map of an n-th layer of the neural network by performing forward propagation from a first layer to an (n-1)-th layer of the neural network; (He, Page 534 Section II: “For each parameters optimization iteration during the training process, the full precision w will be updated first, then the wq will be calculated correspondingly. In this work, we mainly discuss the binary format weight and interlayer tensor. The mathematical formula for weight w and interlayer tensor x binarization firstly discussed in [6] can be described as:
PNG
media_image4.png
100
332
media_image4.png
Greyscale
where q is the input (w or x) while r is the output (wq or xq) to the binarization function. In the forward path (i.e. inference phase), w or x is binarized using Sign() function. Note that, the sign function owns zero derivatives almost everywhere, which makes it impossible to calculate the gradient using chain rule in backward path (i.e. training phase). Thus, the Straight-Through Estimator (STE) [6], [15] is applied to calculate gradient in this work. In the backward path, the
input gradient of binarization activation function clones the gradient at output, if the input q is in the range from -1 to +1. Otherwise, the gradient is cancelled to preserve training performance … However, for the weight scaling factor computation, the current best solution is to iteratively compute based on the weight distribution during training, which can be
written as:
PNG
media_image5.png
50
360
media_image5.png
Greyscale
where E(|Wl|) is the mean of the absolute value of full precision weights in lth layer. Note that, the backward path of Eq. (3) is identical to Eq. (2).”
Examiner notes that here, He discloses forward and backward propagation applied to training, and points out that one of ordinary skill in the art will appreciate that in order to train a layer, one must carry out forward propagation of each layer leading up to the layer to be trained, therefore generating an input feature map.)
generate a plurality of binary training feature maps by multi-channel binarizing pixel values of an input training feature map of the n-th layer based on a plurality of training different thresholds (He continues in the same section: “Since xq and wq only consist of -1 and +1, in order to map the computation of xq,i · wq,i into hardware, we use single bit of 0 and 1 to represent -1 and +1 respectively. For the converted form x q,i and w q,i, the computation of xq,i ·wq,i is equivalent to XNOR(x q,i,x q,i)”. Also recall above He teaches “using binarization functions with various thresholds.”)
performing a back propagation from a last layer to the n-th layer of the neural network to train a plurality of kernels corresponding to the plurality of binary training feature maps of the n-th layer (See the passage from He above, as He discloses a subsequent backward propagation step.)
wherein the storing of the weight value includes obtaining the trained plurality of kernels and storing elements of a least one of the trained plurality of kernels as the weight values stored in the respective synaptic circuits included in the crossbar array circuitry (He, Page 535 under Figure 1: “The general overview of the system architecture for performing low bit-width CNN is shown in Fig. 1.a [18]. This architecture mainly consists of Image Bank, Kernel Bank, bit-wise In-Memory Convolution Engine (IMCE), and Digital Processing Unit (DPU). Since linear layer can be visualized as convolution layer with 1×1 kernel size, the computation of linear layer could be implemented by the same convolution accelerator as well. Assuming the Input feature maps (I) and Kernels (W) are initially stored in the Image Banks and Kernel Banks of memory, respectively.” Here, He discloses that the weight values (“kernels”) are stored in respective circuits (“Kernels (W) … Kernel Banks of memory, respectively.”)
As per Claim 19, He teaches the device of Claim 10. He teaches wherein the implementation of the convolution layer includes shifting a feature window across the input feature map. (He, Page 535, discloses: “Since linear layer can be visualized as convolution layer with 1 × 1 kernel size, the computation of linear layer could be implemented by the same convolution accelerator as well. Assuming the Input feature maps (I) and Kernels (W) are initially stored in the Image Banks and Kernel Banks of memory, respectively.” Here, He discloses “kernels” and one of ordinary skill in the art will appreciate that a convolution in a standard CNN is accomplished via a kernel, which is a window that is shifted across a feature map. A kernel is a matrix of values, and Examiner notes that the BRI of a “feature window” in this case includes a matrix of values that is slid across a feature map, as the term “feature window” is not given a strict definition in the claim or in the Specification. Examiner notes that a “feature window” is treated differently in the language of claims 22-29 (which recite a first and second feature window of an input feature map, and shifting from the first to the second window), and in those claims, another reference is incorporated to teach this matter. However, in this claim, the language of shifting a feature window is more broadly recited, and the BRI is taught by He.)
However, He does not teach wherein the device is a mobile device and the machine model is a neural network, wherein the processor is further configured to output a classification result by implementing a convolutional layer of the neural network, with respect to the input feature map, and to determine the classification result based on the generated pixel values of the output feature map
Lin teaches wherein the device is a mobile device and the machine model is a neural network, wherein the processor is further configured to output a classification result by implementing a convolutional layer of the neural network, with respect to the input feature map, and to determine the classification result based on the generated pixel values of the output feature map (Lin, Page 1: “Convolutional neural networks (CNNs) have achieved state-of-the-art results on real-world applications such as image classification [He et al., 2016] and object detection [Ren et al., 2015], with the best results obtained with large models and sufficient computation resources. Concurrent to these progresses, the deployment of CNNs on mobile devices for consumer applications is gaining more and more attention, due to the widespread commercial value and the exciting prospect. On mobile applications, it is typically assumed that training is performed on the server and test or inference is executed on the mobile devices.” Here, Lin discloses “image classification” and “mobile devices”)
Lin is analogous art because it is in the field of endeavor of binary neural network acceleration. It would have been obvious before the effective filing date of the claimed invention to combine the teachings of He and Lin. One of ordinary skill in the art would have been motivated to do so in order to compensate for information loss from binary quantization (He, Page 534 End of Section II: “recent research efforts in [10], [16] has brought up multiple binarization method to compensate the information loss due to the aggressive quantization” and Lin, Top of Page 3: “We relied on the idea of finding the best approximation of full-precision convolution using multiple binary operations, and employing multiple binary activations to allow more information passing through.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Lin with He for at least the reasons recited in the rejection to Claim 1.
As per Claim 30, He teaches the method of Claim 1. Lin teaches wherein a number of the plurality of binary feature maps is determined by a number of the plurality of different thresholds (Lin continues in Section 3.1.1: “we estimate the real-value weight filter W 2 Rw⇥h⇥cin⇥cout using the linear combination of M binary filters … we fix Bi’s as follows
PNG
media_image6.png
26
380
media_image6.png
Greyscale
where ¯W = W − mean(W), and ui is a shift parameter. For example, one can choose ui’s to
be ui = −1 + (i − 1) 2 M−1 , i = 1, 2, · · · ,M, to shift evenly over the range [−std(W), std(W)],
or leave it to be trained by the network.” Here, Lin discloses calculating M thresholds, based on a shift parameter that spans the statistical distribution of the weights, and then uses this number of thresholds to use them as a basis for creating a corresponding number of binary feature maps.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Lin with He for at least the reasons recited in the rejection to Claim 1.
Claims 7-8 and 16-17 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of He and Lin and further in view of Yan et al. (“iCELIA: A Full-Stack Framework for STT-MRAM-Based Deep Learning Acceleration”; hereinafter “Yan”)
As per Claim 7, He teaches the method of Claim 1 as well as crossbar array circuit of the crossbar array circuitry (see He in rejection to Claim 1). However, He does not appear to explicitly teach providing the output feature map as a new input feature map for another layer of a neural network, as the machine model; generating a plurality of new binary feature maps by multi-channel binarizing pixel values of the new input feature map, wherein the multi-channel binarizing is based on a plurality of new different thresholds; and generating a new convolution result, using a new crossbar array circuit of the crossbar array circuitry or a new crossbar array circuitry, including providing pixel values of the plurality of new binary feature maps as input values of the new crossbar array circuit of the crossbar array circuitry or the new crossbar array circuitry.”
Lin teaches providing the output feature map as a new input feature map for another layer of a neural network, as the machine model; generating a plurality of new binary feature maps by multi-channel binarizing pixel values of the new input feature map, wherein the multi-channel binarizing is based on a plurality of new different thresholds; and generating a new convolution result, [using a new crossbar array circuit of the crossbar array circuitry or a new crossbar array circuitry], including providing pixel values of the plurality of new binary feature maps as input values [of the new crossbar array circuit of the crossbar array circuitry or the new crossbar array circuitry] (Lin, Page 3 Section 3.1: “Consider a L-layer CNN architecture. Without loss of generality, we assume the weights of each convolutional layer are tensors of dimension (w, h, cin, cout), which represents filter width, filter height, input-channel and output-channel respectively. We propose two variations of binarization method for weights at each layer: 1) approximate weights as a whole and 2) approximate weights channel-wise.”
Examiner notes that here Lin discloses performing this binarization, which was shown above in the rejection to Claim 5 to include converting from “full precision” to binarized, then back to “full precision” for a given layer, and thus when one goes to the next layer, the output full precision feature map will become full precision input to the next layer.
Lin continues in Section 3.1.1: “we estimate the real-value weight filter W 2 Rw⇥h⇥cin⇥cout using the linear combination of M binary filters … we fix Bi’s as follows
PNG
media_image6.png
26
380
media_image6.png
Greyscale
where ¯W = W − mean(W), and ui is a shift parameter. For example, one can choose ui’s to
be ui = −1 + (i − 1) 2 M−1 , i = 1, 2, · · · ,M, to shift evenly over the range [−std(W), std(W)],
or leave it to be trained by the network.”
Here, Lin discloses that the thresholds for each binarization are adjusted by a “shift parameter” between each threshold, and this “shift parameter” is based on the statistical distribution of the weights of that current layer. Therefore, each layer will have new thresholds.
Lin, Pages 4-5 Section 3.2, discloses: “As mentioned above, a convolution can be implemented without multiplications when weights are binarized. However, to utilize the bitwise operation, the activations must be binarized as well, as they are the inputs of convolutions. Similar to the activation binarization procedure in [Zhou et al., 2016], we binarize activations after passing it through a bounded activation function h, which ensures h(x) 2 [0, 1]. We choose the bounded rectifier as h. Formally, it can be defined as: hv(x) = clip(x + v, 0, 1), (8)
where v is a shift parameter.”
Examiner notes that here, Lin discloses performing the same process with activations (feature maps) as for weights. That is, adjusting the threshold by a shift parameter for each binarization, and the shift parameter is based on statistics of the current layer, and therefore the plurality of thresholds is new at each layer.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Lin with He for at least the reasons recited in the rejection to Claim 1.
However, the combination of He and Lin does not explicitly teach generating a new convolution result, using a new crossbar array circuit of the crossbar array circuitry or a new crossbar array circuitry, including providing pixel values of the plurality of new binary feature maps as input values of the new crossbar array circuit of the crossbar array circuitry or the new crossbar array circuitry
Yan teaches generating a new convolution result, using a new crossbar array circuit of the crossbar array circuitry or a new crossbar array circuitry, including providing pixel values of the plurality of new binary feature maps as input values of the new crossbar array circuit of the crossbar array circuitry or the new crossbar array circuitry (Yan, Page 416 End of Section 7: “Before input data enter the CU pipeline, synaptic weights have already been programmed into crossbar arrays using our proposed nonuniform quantization with a largely reduced bit width (Sections 5) … The CUs are allocated to different layers in a CNN model based on the computation need of each layer and the total available CUs in the system.”)
Yan is analogous art because it is in the field of endeavor of convolutional neural network acceleration. It would have been obvious before the effective filing date of the claimed invention to combine the teachings of He and Lin with Yan. One of ordinary skill in the art would have been motivated to do so in order to optimize the use of computational resources (Yan, Page 416 End of Section 7: “The CUs are allocated to different layers in a CNN model based on the computation need of each layer and the total available CUs in the system.”)
As per Claim 8, the combination of He, Lin, and Yan teaches the method of Claim 7. Lin teaches wherein at least one of the plurality of different thresholds has a different value from each of the plurality of new different thresholds. (Lin, Pages 4-5 Section 3.2, discloses: “As mentioned above, a convolution can be implemented without multiplications when weights are binarized. However, to utilize the bitwise operation, the activations must be binarized as well, as they are the inputs of convolutions. Similar to the activation binarization procedure in [Zhou et al., 2016], we binarize activations after passing it through a bounded activation function h, which ensures h(x) 2 [0, 1]. We choose the bounded rectifier as h. Formally, it can be defined as: hv(x) = clip(x + v, 0, 1), (8) where v is a shift parameter.”
Examiner notes that here, Lin discloses performing the same process with activations (feature maps) as for weights. That is, adjusting the threshold by a shift parameter for each binarization, and the shift parameter is based on statistics of the current layer, and therefore the plurality of thresholds is new at each layer.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Lin with He for at least the reasons recited in the rejection to Claim 1.
As per Claims 16-17, these are device claims corresponding to method Claims 7-8. The difference is that it recites an on-chip memory including a crossbar array circuitry and a processor. He, Page 535 under Figure 1: “The general overview of the system architecture for performing low bit-width CNN is shown in Fig. 1.a [18]. This architecture mainly consists of Image Bank, Kernel Bank, bit-wise In-Memory Convolution Engine (IMCE), and Digital Processing Unit (DPU).” He, Page 537 Section C, discloses: “In this subsection, we describe a magnetic crossbar architecture consisting of perpendicularly coupled magnetic domain wall motion racetracks.” Therefore, Claims 16-17 are rejected for similar reasons as Claims 7-8.
Claims 22-29 are rejected under 35 U.S.C. 103 as being unpatentable over He in view of Lin, further in view of Wu et al. (“Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions”; hereinafter “Wu”)
As per Claim 22, He teaches A neuromorphic device, the device comprising: a processor configured to output a classification result by implementing a convolutional layer of a neural network with respect to an input feature map of the convolutional layer, to determine the classification result based on generated pixel values of an output feature map of the convolutional layer, wherein, for the implementation of the convolutional layer, the processor is configured to:
(He, Abstract: “Deep Convolution Neural Network (CNN) has achieved outstanding performance in image recognition over large scale dataset … In this work, we present different emerging nonvolatile Magnetic Random Access Memory (MRAM) designs that could be leveraged to implement ‘bit-wise in-memory convolution engine’, which could simultaneously store network parameters and compute low bit-width convolution.” Examiner notes that an image comprises pixels, and that a CNN comprises convolutional layers, and that “image recognition” is a form of classification.)
generate a plurality of binary feature maps by multi-channel binarizing pixel values of the input feature map (He, Page 535 Figure 1:
PNG
media_image1.png
610
1005
media_image1.png
Greyscale
He, Page 534 End of Section II: “Thus, further integrating such multiple binarization with in memory computing technique will provide accurate deep neural network inference result with high throughput.” Also He discloses “image bank”, and images comprise pixels. He, Page 535: “Quantizer: This unit binarize the interlayer tensor or weight w.r.t Eq. (1), Eq. (2) and Eq. (3).”
provide a first binary feature map, from among the generated plurality of binary feature maps, corresponding to a first feature window of the input feature map to a first set of synaptic circuits that are set with respect to a first kernel of the convolutional layer, and receive a respective output from the first set of synaptic circuits provided the first binary feature map (He, Figure 1 shown above, discloses “Activation Function” and “Counter-Shifter-Partial Sum” comprising XNOR operations. He, Page 535 Bottom Left: “Activ. Function: the activation function module perform the element-wise computation, which normally takes ReLU as the activation function.” Examiner notes that an activation function multiplies input values and weight values to produce output activations. Furthermore, more detail is provided in Page 534 Top Right: “Therefore, the computation for one convolution layer or linear layer can be described as:
PNG
media_image2.png
60
418
media_image2.png
Greyscale
where xq and wq are the vectorized form of quantized interlayer tensor and weight. N is the vector size of xq. Since xq and wq only consist of -1 and +1, in order to map the computation of xq,i · wq,i into hardware, we use single bit of 0 and 1 to represent -1 and +1 respectively. For the converted form x’q,i and w’q,i, the computation of xq,i ·wq,i is equivalent to XNOR.” Thus, He teaches multiplying inputs by weight values using XNOR gates.
He also explicitly discloses types of synaptic circuits on Page 537: “The device structure of the computational magnetic crossbar design is shown in Fig. 6a, which mainly constitutes of equally spaced ferromagnetic nanowires both longitudinally and latitudinally. Each nanowire could work individually as a normal domain wall racetrack memory to store the interlayer tensor and weight, where binary data are represented by the magnetization directions and stored in the form of domain wall pair train within the nanowires [23].” Examiner also points out that a “set” can comprise one element, and therefore each synaptic circuit alone can be a “different set” that is applied to a different binarization.)
provide a second binary feature map from among the plurality of binary feature maps corresponding to the first feature window to a second set of synaptic circuits that are set with respect to a second kernel of the convolutional layer, and receive a respective output from the second set of synaptic circuits provided the second binary feature map (He, Figure 1 shown above, discloses N sub-arrays, and therefore N partial sums based on N kernels, each in its own synaptic circuit as described above.)
provide a third binary feature map from among the generated plurality of binary feature maps corresponding to the [second feature window] to a third set of synaptic circuits that are set with respect to a third kernel of the convolutional layer, and receive a respective output from the third set of synaptic circuits provided the third binary feature map (He, Figure 1 shown above, discloses N sub-arrays, and therefore N partial sums based on N kernels, each in its own synaptic circuit as described above.)
provide a fourth binary feature map from among the generated plurality of binary feature maps corresponding to the [second feature window] to a fourth set of synaptic circuits that are set with respect to a fourth kernel of the convolutional layer, and receive a respective output from the fourth set of synaptic circuits provided the fourth binary feature map (He, Figure 1 shown above, discloses N sub-arrays, and therefore N partial sums based on N kernels, each in its own synaptic circuit as described above.)
generate the output feature map of the convolutional layer through generation of the pixel values of the output feature map based on the respective outputs of the first set of synaptic circuits, the second set of synaptic circuits, the third set of synaptic circuits, and the fourth set of synaptic circuits (He, Page 535 Top Right, discloses: “Then, the produced binary result is processed in parallel by partial sum (subtract) units.” Furthermore, Fig. 1(a) shows a circled “+” icon under the partial sum output values (“P-Sum”) indicating them being merged into a feature map (“Output fmaps”). Examiner notes that the term “selectively” in the claim is not given any further detail, nor is “selectively” given any description in the Specification, and notes that in He each of the partial sums is selected and combined, and the BRI of “selectively merging” includes “selecting every element and adding all together.”)
However, He does not teach wherein the multi-channel binarizing is based on a plurality of different thresholds that are set to have respective different values, and wherein each of the plurality of different thresholds is applied to the input feature to generate a corresponding one of the plurality of binary feature maps, wherein the application of the different thresholds to the input feature map includes an application of a first threshold to values of first pixels of the input feature map to generate one binary feature map of the plurality of binary feature maps, and an application of a different second threshold to the values of the first pixels of the input feature map to generate another binary feature map of the plurality of binary feature maps; shift from the first feature window to a second feature window of the input feature map
Lin teaches wherein the multi-channel binarizing is based on a plurality of different thresholds that are set to have respective different values, and wherein each of the plurality of different thresholds is applied to the input feature to generate a corresponding one of the plurality of binary feature maps, wherein the application of the different thresholds to the input feature map includes an application of a first threshold to values of first pixels of the input feature map to generate one binary feature map of the plurality of binary feature maps, and an application of a different second threshold to the values of the first pixels of the input feature map to generate another binary feature map of the plurality of binary feature maps (Lin, Pages 4-5 Section 3.2, discloses: “As mentioned above, a convolution can be implemented without multiplications when weights are binarized. However, to utilize the bitwise operation, the activations must be binarized as well, as they are the inputs of convolutions. Similar to the activation binarization procedure in [Zhou et al., 2016], we binarize activations after passing it through a bounded activation function h, which ensures h(x) 2 [0, 1]. We choose the bounded rectifier as h. Formally, it can be defined as: hv(x) = clip(x + v, 0, 1), (8) where v is a shift parameter.” Lin, Page 5 above Eq 11, discloses: “Secondly, we estimate the real-value activation R using the linear combination of N binary activations.”
Here, Examiner notes that Lin discloses “N binary activations”, each with a different threshold, as each threshold is adjusted by a “shift parameter”.)
Lin is analogous art because it is in the field of endeavor of binary neural network acceleration. It would have been obvious before the effective filing date of the claimed invention to combine the teachings of He and Lin. One of ordinary skill in the art would have been motivated to do so in order to compensate for information loss from binary quantization (He, Page 534 End of Section II: “recent research efforts in [10], [16] has brought up multiple binarization method to compensate the information loss due to the aggressive quantization Thus, further integrating such multiple binarization with in-memory computing technique will provide accurate deep neural network inference result with high throughput” and Lin, Top of Page 3: “We relied on the idea of finding the best approximation of full- precision convolution using multiple binary operations, and employing multiple binary activations to allow more information passing through.”)
However, the combination of He and Lin does not teach shift from the first feature window to a second feature window of the input feature map
Wu teaches shift from the first feature window to a second feature window of the input feature map (Wu, Page 2: “In this paper, we present the shift operation (Figure 1) as an alternative to spatial convolutions. The shift operation moves each channel of its input tensor in a different spatial direction. A shift-based module interleaves shift operations with point-wise convolutions, which further mixes spatial information across channels.” Wu, Page 3 Section 2.1: “For a shift operation with kernel size DK, there exist D2K possible shift matrices, each of them corresponding to a shift direction. If the channel size M is no smaller than D2K, we can construct a shift matrix that allows each output position (k, l) access to all values within a DK × DK window in the input. We can then apply another point-wise convolution per Eq. (3) to exchange information across channels.” Here, Wu discloses shifting a kernel or window across a feature map in such a way that “mixes spatial information across channels.”)
Wu is analogous art because it is in the field of endeavor of convolutional neural networks. It would have been obvious before the effective filing date of the claimed invention to combine the CNN acceleration of He and Lin with the shifting of feature windows of Wu. One of ordinary skill in the art would have been motivated to do so in order to increase efficiency in processing of the CNN (Wu, Page 2: “Unlike spatial convolutions, the shift operation itself requires zero FLOPs and zero parameters. As opposed to depth-wise convolutions, shift operations can be easily and efficiently implemented.”)
As per Claim 23, the combination of He, Lin, and Wu teaches the device of Claim 22. He teaches wherein the generating of the pixel values of the output feature map includes generating a pixel value of the output feature map by merging the respective outputs of the first set of synaptic circuits, the second set of synaptic circuits, the third set of synaptic circuits, and the fourth set of synaptic circuits (He, Page 535 Top Right, discloses: “Then, the produced binary result is processed in parallel by partial sum (subtract) units.” Furthermore, Fig. 1(a) shows a circled “+” icon under the partial sum output values (“P-Sum”) indicating them being merged into a feature map (“Output fmaps”). Examiner notes that the term “selectively” in the claim is not given any further detail, nor is “selectively” given any description in the Specification, and notes that in He each of the partial sums is selected and combined, and the BRI of “selectively merging” includes “selecting every element and adding all together.”)
As per Claim 24, the combination of He, Lin, and Wu teaches the device of Claim 22. He teaches wherein the device further comprises an on-chip memory including one or more crossbar array circuitries including the first set of synaptic circuits, the second set of synaptic circuits, the third set of synaptic circuits, and the fourth set of synaptic circuits, and wherein at least two of the first set of synaptic circuits, the second set of synaptic circuits, the third set of synaptic circuits, and the fourth set of synaptic circuits are different sets of synaptic circuits. (He also explicitly discloses types of synaptic circuits on Page 537: “The device structure of the computational magnetic crossbar design is shown in Fig. 6a, which mainly constitutes of equally spaced ferromagnetic nanowires both longitudinally and latitudinally. Each nanowire could work individually as a normal domain wall racetrack memory to store the interlayer tensor and weight, where binary data are represented by the magnetization directions and stored in the form of domain wall pair train within the nanowires [23].” Here, He discloses each synaptic circuit (“ferromagnetic nanowires”) for each binarization (“interlayer tensor and weight”). Examiner also points out that a “set” can comprise one element, and therefore each synaptic circuit alone can be a “different set” that is applied to a different binarization.)
As per Claim 25, the combination of He, Lin, and Wu teaches the device of Claim 24 as well as first and second feature window (see Wu in rejection to Claim 22). He teaches wherein, for the generation of the plurality of binary feature maps, the processor is configured to generate the first and second binary feature maps by multi-channel binarizing the first feature window of the input feature map (He, Page 533 Top Right: “The essential tricks extracted from the aforementioned works can be summarized and further optimized as (1) introducing scaling factor for both binarized interlayer tensor and weight, and (2) using binarization functions with various thresholds (e.g. vanilla binarization function taken 0 as default threshold) to avoid information loss, which will be discussed in the main body.”)
generate the third and fourth plural binary feature maps by multi-channel binarizing the second feature window of the input feature map (He, Page 533 Top Right: “The essential tricks extracted from the aforementioned works can be summarized and further optimized as (1) introducing scaling factor for both binarized interlayer tensor and weight, and (2) using binarization functions with various thresholds (e.g. vanilla binarization function taken 0 as default threshold) to avoid information loss, which will be discussed in the main body.”)
and wherein the processor is configured to: perform the provision of the first binary feature map, the provision of the second binary feature map, the provision of the third binary feature map, and the provision of the fourth binary feature map respectively by provision of pixel values of each of the first binary feature map, the second binary feature map, the third binary feature map, and the fourth binary feature map as respective input voltage values to the one or more crossbar array circuitries (He, Page 535: “As depicted in Fig. 1.a, inputs need to be constantly quantized before mapping into computational sub-arrays. However, quantized shared kernels can be utilized for different inputs. This step is performed using DPU’s Quantizer and then the results are mapped to IMCE’s sub-arrays (Fig. 1.b).” here, He discloses the values are “mapped” to respective “sub-arrays”. He, Page 536: “The key idea to perform memory read and in-memory computing is to choose different thresholds when sensing the selected memory cell(s). As shown in Fig. 3a, for memory read operation, a single memory cell is addressed and routed in the memory read path to generate a sense voltage (Vsense), which will be compared with a reference voltage (Vref).” Here, He discloses storing binary values in the crossbar array circuitry as input voltage values (“sense voltage”)).
store weights of the first kernel in the first set of synaptic circuits, weights of the second kernel in the second set of synaptic circuits, weights of the third kernel in the third set of synaptic circuits, and weights of the fourth kernel in the fourth set of synaptic circuits (He, Page 535 under Figure 1: “The general overview of the system architecture for performing low bit-width CNN is shown in Fig. 1.a [18]. This architecture mainly consists of Image Bank, Kernel Bank, bit-wise In-Memory Convolution Engine (IMCE), and Digital Processing Unit (DPU). Since linear layer can be visualized as convolution layer with 1×1 kernel size, the computation of linear layer could be implemented by the same convolution accelerator as well. Assuming the Input feature maps (I) and Kernels (W) are initially stored in the Image Banks and Kernel Banks of memory, respectively.” Here, He discloses that the weight values (“kernels”) are stored in respective circuits (“Kernels (W) … Kernel Banks of memory, respectively.”
obtain output values from the one or more crossbar array circuitries resulting from implemented multiplications respectively between the pixel values of each of the first binary feature map and the stored weights of the first kernel in the first set of synaptic circuits, the second binary feature map and the stored weights of the second kernel in the second set of synaptic circuits, the third binary feature map and the stored weights of the third kernel in the third set of synaptic circuits, and fourth binary feature map and the stored weights of the fourth kernel in the fourth set of synaptic circuits (He, Page 535 under Figure 1: “The general overview of the system architecture for performing low bit-width CNN is shown in Fig. 1.a [18]. This architecture mainly consists of Image Bank, Kernel Bank, bit-wise In-Memory Convolution Engine (IMCE), and Digital Processing Unit (DPU). Since linear layer can be visualized as convolution layer with 1×1 kernel size, the computation of linear layer could be implemented by the same convolution accelerator as well. Assuming the Input feature maps (I) and Kernels (W) are initially stored in the Image Banks and Kernel Banks of memory, respectively.” Here, He discloses that the weight values (“kernels”) are stored in respective circuits (“Kernels (W) … Kernel Banks of memory, respectively.”
and generate the pixel values of the output feature map by selectively merging the obtained output values (He, Page 535 Top Right, discloses: “Then, the produced binary result is processed in parallel by partial sum (subtract) units.” Furthermore, Fig. 1(a) shows a circled “+” icon under the partial sum output values (“P-Sum”) indicating them being merged into a feature map (“Output fmaps”). Examiner notes that the term “selectively” in the claim is not given any further detail, nor is “selectively” given any description in the Specification, and notes that in He each of the partial sums is selected and combined, and the BRI of “selectively merging” includes “selecting every element and adding all together.”)
As per Claim 26, the combination of He, Lin, and Wu teaches the device of Claim 22. He teaches wherein the processor is further configured to obtain the first kernel corresponding to the first binary feature map, obtain the second kernel corresponding to the second binary feature map, obtain the third kernel corresponding to the third binary feature map, and obtain the fourth kernel corresponding to the fourth binary feature map (He, Figure 1, discloses “banks”, of input feature maps and corresponding kernels, and describes on Page 535: “Assuming the Input feature maps (I) and Kernels (W) are initially stored in the Image Banks and Kernel Banks of memory, respectively.” He, Figure 1, shows corresponding arrays of l1 and w1, and l2 and w2. Thus, each feature map corresponds to a kernel.)
As per Claim 27, the combination of He, Lin, and Wu teaches the device of Claim 22 as well as first and second feature window (see Wu in rejection to Claim 22). He teaches wherein the input feature map is a two-dimensional (2D) feature map, the first feature window of the input feature map is a window of the input feature map, and wherein the second feature window of the input feature map is another 2D window of the input feature map. (He, Figure 1, discloses “Image Bank”, and one of ordinary skill in the art will appreciate that an “image” is a set of two-dimensional pixels, and each feature map is a two-dimensional window representing the pixels after a convolution. After each convolution, the two-dimensional pixel values change, and are therefore a new feature window of the two-dimensional window.)
As per Claim 28, the combination of He, Lin, and Wu teaches the device of Claim 22. He teaches wherein the first kernel, the second kernel, the third kernel, and the fourth kernel are respective channels of a same kernel. (He, Page 533 Top Right: “The essential tricks extracted from the aforementioned works can be summarized and further optimized as (1) introducing scaling factor for both binarized interlayer tensor and weight, and (2) using binarization functions with various thresholds (e.g. vanilla binarization function taken 0 as default threshold) to avoid information loss, which will be discussed in the main body.” Examiner notes that the binarization is performed multiple times therefore producing multiple kernel channels of the same kernel. Examiner notes that the multiple binarization is performed on both activations and weights (aka “kernels”) as shown on Page 535: “Quantizer: This unit binarize the interlayer tensor or weight.”)
As per Claim 29, the combination of He, Lin and Wu teaches the device of Claim 22. He teaches wherein the first kernel, the second kernel, the third kernel, and the fourth kernel are each binary feature maps differently representing respective same kernel elements of a channel of a same kernel. (He, Page 533 Top Right: “The essential tricks extracted from the aforementioned works can be summarized and further optimized as (1) introducing scaling factor for both binarized interlayer tensor and weight, and (2) using binarization functions with various thresholds (e.g. vanilla binarization function taken 0 as default threshold) to avoid information loss, which will be discussed in the main body.” Examiner notes that the binarization is performed multiple times therefore producing multiple channels of the same feature map. Examiner notes that the multiple binarization is performed on both activations (aka “binary feature maps”) and weights (aka “kernels”) as shown on Page 535: “Quantizer: This unit binarize the interlayer tensor or weight.”))
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LEONARD A SIEGER whose telephone number is (571)272-9710. The examiner can normally be reached M-F 8:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, David Yi can be reached at (571) 270-7519. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/LEONARD A SIEGER/Examiner, Art Unit 2126