DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statements (IDS) were submitted on 06/13/2023. The submission are in compliance with the provisions of 37 CFR § 1.97. Accordingly, the information disclosure statement is being considered by the examiner.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claim 20 is/are rejected under 35 U.S.C. § 101 as directed to non-statutory subject matter.
Claim 20 is/are rejected under 35 U.S.C. 101 as not falling within one of the four statutory categories of invention because the broadest reasonable interpretation of the instant claims in light of the specification encompasses transitory signals. But transitory signals are not within one of the four statutory categories ((i.e. process, machine, manufacture, or composition of matter).
However, claims directed toward a non-transitory computer readable medium may qualify as a manufacture and make the claim patent-eligible subject matter. Therefore, amending the claims to recite a “non-transitory computer-readable medium” would resolve this issue.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-8 and 12-20 are rejected under 35 U.S.C. 103 as being unpatentable over Seok et al. (US 20240169201, hereinafter Seok) in view of Sather et al. (US 12136039 B1, hereinafter Sather).
Regarding Claim 1, Seok discloses a method of controlling a precision of a neural processing apparatus comprising two in-memory compute (IMC) units ([0009], microcontroller unit that includes an accelerator, a data memory (DMEM), and a direct memory access (DMA) module configured to transfer a weight data of a layer of a machine learning model from the DMEM to the IMC macro cluster. The accelerator includes an in-memory computing (TMC) macro cluster configured to accelerate at least one layer of a machine-learning model ), wherein the IMC units include a first IMC unit and a second IMC unit, each designed to perform vector-matrix multiplication (VMM) operations, to produce analog output signals ([0011], the accelerator includes a first stage, a second stage, and a third stage such that the first stage prepares an input vector and feed it to the second stage. The first stage includes a scratchpad which store deep neural network (DNN) input/output data and an input ping-pong buffer that can fetch specific parts of data from the scratchpad based on the DNN layer parameters and send the data into the next stage. The second stage performs a vector-matrix multiplication (VMM) using the IMC macro cluster ), the method comprising:
training an artificial neural network (ANN) model to learn its parameters in accordance with a dual objective, wherein the ANN model comprises two neural layers including a first neural layer and a second neural layer, the parameters including synaptic weight values ([0011], store deep neural network (DNN) input/output data and an input ping-pong buffer that can fetch specific parts of data from the scratchpad based on the DNN layer parameters and send the data into the next stage; [0018], producing a TensorFlow (TF) file by training a deep neural network (DNN) model … fusing a batch norm layer of the DNN model into a convolution layer); and
storing the synaptic weight values of the parameters learned in the two IMC units to respectively map the first neural layer and the second neural layer onto the first IMC unit and the second IMC unit ([0054], an in-memory computing (IMC) macro cluster to accelerate at least one layer of a machine-learning mode that supports the computation flow in a fully pipelined manner such as having first stage & second stage with an IMC macro cluster, an adder tree, a latch, and weight buffer performing VMM operations; [0055]-[0056], stores the DNN input/output data and an input ping-pong buffer that fetches specific parts of data from scratchpad based on the DNN layer parameters and the second stage performs VMM using the 4×4 IMC macro cluster. After perform multiplication with the inverted weights, the results of one column are fed into a compressor to produce the compressed results), wherein:
the second IMC unit is designed to perform VMM operations based on analog input signals generated from activation values produced by the first neural layer ([0005] employ analog-mixed-signal (AMS) IC macros, which use capacitors and resistors for computation and ADCs for analog-to-digital conversion (ADC); [0086] Tiny machine learning (TinyML) allow performing a deep neural network (DNN)-based inference on an edge device, which makes it paramount to create a neural microcontroller unit (MCU and employ analog-mixed-signal (AMS) versions, exhibiting limited robustness over process, voltage, and temperature (PVT) variations), and
Seok does not explicitly disclose the dual objective includes a primary optimization objective for training the ANN model and an auxiliary objective enforcing a target distribution property on activation values produced by the first neural layer.
Sather teaches dual objective includes a primary optimization objective for training the ANN model and an auxiliary objective enforcing a target distribution property on activation values produced by the first neural layer (Col. 3, ll. 24-38 (11) use a loss function that constrains the weight values to only the allowed quantized values, while accounting for a loss in accuracy of the network's output when using the quantized values instead of the optimized floating-point values as well as the inter-relationship between corresponding weights of replica layers and the loss function includes (i) a first term that measures the difference between the actual output of the network and the expected output of the network, given a training input data set (i.e., a standard loss term) and (ii) a second term that constrains the weights to the sets of allowed values that accounts for the increase in loss when quantizing any individual weight as well as the relationship between corresponding weights of replica layers)
Therefore, it would have been obvious to one ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of a primary optimization objective for training the ANN model and an auxiliary objective enforcing a target distribution as taught by Sather (Col. 3, ll. 23-41 ) into the neural processing system of Seok in order to enable training the parameters of a machine-trained (MT) network by using multiple copies of certain layers of the network in order to increase a number of possible values for parameters of the layer (Sather, Col.73, ll. 56-66 ).
Regarding Claim 2, Seok in view of Sather disclose the method according to claim 1, Sather discloses wherein enforcing the target distribution property causes to increase an information entropy of the distribution of activation values across a range of values spanned by the activation values (Col. 17, ll. 43-53 (84), estimate the loss matrix terms separately from the weight training data, by propagating a set of training inputs through the network (including the replica weights) and sampling from a predicted output probability distribution in place of ground truth outputs to compare to the actual output distribution (i.e., to compute the loss function), and back-propagation (e.g., the same back-propagation algorithm as used for actual network training) to determine the gradients of all of the weights (including the replica weights). The same reason or rational of obviousness motivation applied as used above in claim 1.
Regarding Claim 3, Seok in view of Sather disclose the method according to claim 1, Sather discloses wherein the auxiliary objective enforces said target distribution property by enforcing a target distribution on the set of activation values (Col. 3, ll. 24-34 (11) use a loss function that constrains the weight values to only the allowed quantized values, while accounting for a loss in accuracy of the network's output when using the quantized values instead of the optimized floating-point values as well as the inter-relationship between corresponding weights of replica layers. In some embodiments, the loss function includes (i) a first term that measures the difference between the actual output of the network and the expected output of the network, given a training input data set (i.e., a standard loss term) and (ii) a second term that constrains the weights to the sets of allowed values). The same reason or rational of obviousness motivation applied as used above in claim 1.
Regarding Claim 4, Seok in view of Sather disclose the method according to claim 3, Sather discloses wherein training the ANN model comprises minimizing, as per the auxiliary objective, a distance between the distribution of the activation values and the target distribution to an extent permitted by the primary optimization objective for training the ANN model, in accordance with said dual objective (Col. 5, ll. 28-36 (19), for typical (non-replica) layers, batch normalization training computes statistics (e.g., the mean and variance) for pre-activations (i.e., the dot product of weight values and inputs for a computation node, before the activation function is applied), and uses these statistics to apply shift and scale values to the pre-activations (e.g., shifting based on the mean and scaling based on the variance). The same reason or rational of obviousness motivation applied as used above in claim 1.
Regarding Claim 5, Seok in view of Sather disclose the method according to claim 4, Seok discloses wherein training the ANN model comprises optimizing the ANN model against a loss function capturing the dual objective, wherein the loss function can be decomposed as a sum of two contributions, including: a first contribution reflecting the primary optimization objective; and a second contribution reflecting the auxiliary objective, the second contribution depending on the distance between the distribution of the set of activation values and the target distribution (Col. 3, ll. 24-34 (11) use a loss function that constrains the weight values to only the allowed quantized values, while accounting for a loss in accuracy of the network's output when using the quantized values instead of the optimized floating-point values as well as the inter-relationship between corresponding weights of replica layers. In some embodiments, the loss function includes (i) a first term that measures the difference between the actual output of the network and the expected output of the network, given a training input data set (i.e., a standard loss term) and (ii) a second term that constrains the weights to the sets of allowed values). The same reason or rational of obviousness motivation applied as used above in claim 1.
Regarding Claim 6, Seok in view of Sather disclose the method according to claim 5, Sather discloses wherein, at training the ANN model, the second contribution is implemented by a regularizer added to the loss function (Col. 16, ll. 39-45 (79) the loss function is back-propagated through the network to adjust the weight values in order to minimize the error and iteratively repeated over multiple epochs as stochastic gradient descent (SGD)). The same reason or rational of obviousness motivation applied as used above in claim 1.
Regarding Claim 7, Seok in view of Sather disclose the method according to claim 5, Sather discloses wherein the ANN model is iteratively trained, and wherein training the model comprises: approximating, after a forward pass, a current distribution of said set of activation values by a kernel density function, whereby said distance is measured as a distance between the kernel density function and the target distribution; and updating, during a backward pass, said parameters with a view to decreasing said loss function, this causing to decrease said distance (Col. 11, ll. 8-25 (54), training process iteratively selects different input value sets with known output value sets. For each selected input value set, the training process typically (1) forward propagates the input value set through the network's nodes to produce a computed output value set and then (2) backpropagates a gradient (rate of change) of a loss function (output error) that quantifies in a particular way the difference between the input set's known output value set and the input set's computed output value set, in order to adjust the network's configurable parameters (e.g., the weight values) … Or quantize the network replicate certain layers of the network while assigning different weight scales to those layers, use the alternating direction method of multipliers (ADMM) to train the quantized weight values (which includes performing forward and backward propagation), and ensure that at least a threshold percentage of the weight values are set to zero). The same reason or rational of obviousness motivation applied as used above in claim 1.
Regarding Claim 8, Seok in view of Sather disclose the method according to claim 7, Sather discloses wherein: the kernel density function is differentiable with respect to variables corresponding said activation values, whereby the distance between the kernel density function and the target distribution is differentiable with respect to said parameters, and the ANN model is trained in accordance with a backpropagation algorithm using partial derivatives of the loss function with respect to said parameters (Col. 12, ll. 60-67 (62), each layer has a set of three {0, αk, −αk} associated possible weight values. These αk may be determined by randomly sampling a distribution (e.g., a Gaussian distribution), randomly assigning each weight value in the layer an initial value within a range and then selecting one of these as the associated αk for the layer, training the network with floating-point values and then assigning the αk values for each layer based on these trained floating-point values, or with other techniques)
The same reason or rational of obviousness motivation applied as used above in claim 1.
Regarding Claim 12, Seok in view of Sather disclose the method according to claim 1, Seok discloses further comprising: executing the ANN model for inference purposes by: performing VMM operations involving the two IMC units to obtain respective sets of output analog signals ([0054], an in-memory computing (IMC) macro cluster to accelerate at least one layer of a machine-learning mode that supports the computation flow in a fully pipelined manner such as having first stage & second stage with an IMC macro cluster, an adder tree, a latch, and weight buffer performing VMM operations), converting the output analog signals into digital output values, and processing such digital output values to obtain activation values layer ([0005], employ analog-mixed-signal (AMS) IC macros, which use capacitors and resistors for computation and ADCs for analog-to-digital conversion (ADC)).
Regarding Claim 13, Seok in view of Sather disclose the method according to claim 12, Seok discloses wherein the target distribution property is devised to cause to increases of information entropy of said distribution of activation values across a range of values spanned by the activation values upon enforcing the target distribution property, so as to increase an information entropy of a distribution of analog output signals produced by the second IMC unit upon performing said VMM operations.
Regarding Claim 14, Seok in view of Sather disclose the method according to claim 13, Seok discloses wherein the target distribution property is devised to increase a signal-to-noise ratio of said analog output signals ([0005] employ analog-mixed-signal (AMS) IC macros, which use capacitors and resistors for computation and ADCs for analog-to-digital conversion (ADC); [0086] Tiny machine learning (TinyML) allow performing a deep neural network (DNN)-based inference on an edge device, which makes it paramount to create a neural microcontroller unit (MCU and employ analog-mixed-signal (AMS) versions, exhibiting limited robustness over process, voltage, and temperature (PVT) variations).
Regarding Claim 15, Seok in view of Sather disclose the method according to claim 13, Sather discloses wherein the target distribution property is a uniform distribution (Col. 12, ll. 60-67 (62), each layer has a set of three {0, αk, −αk} associated possible weight values. These αk may be determined by randomly sampling a distribution (e.g., a Gaussian distribution), randomly assigning each weight value in the layer an initial value within a range and then selecting one of these as the associated αk for the layer, training the network with floating-point values and then assigning the αk values for each layer based on these trained floating-point values, or with other techniques). The same reason or rational of obviousness motivation applied as used above in claim 1.
Regarding Claim 16, Seok in view of Sather disclose the method according to claim 1, Seok discloses wherein: the neural processing apparatus comprises L IMC units, L>2, where the IMC units are cascaded, the ANN model comprises L neural layers, the synaptic weight values of the parameters learned are stored in the L IMC units ([0054], an in-memory computing (IMC) macro cluster to accelerate at least one layer of a machine-learning mode that supports the computation flow in a fully pipelined manner such as having first stage & second stage with an IMC macro cluster, an adder tree, a latch, and weight buffer), so as to effectively map the L neural layers onto the L IMC units, the auxiliary objective enforces target distribution properties on L−1 sets of activation values respectively produced by the first L−1 layers, so as for the last L−1 IMC units of the L IMC units to perform VMM operations based on L−1 sets of analog input signals generated from the L−1 sets of activation values obtained from the first L−1 neural layers, respectively ([0011], the accelerator includes a first stage, a second stage, and a third stage such that the first stage prepares an input vector and feed it to the second stage. The first stage includes a scratchpad which store deep neural network (DNN) input/output data and an input ping-pong buffer that can fetch specific parts of data from the scratchpad based on the DNN layer parameters and send the data into the next stage. The second stage performs a vector-matrix multiplication (VMM) using the IMC macro cluster ).
Regarding Claims 17-19, System claims 17-19 of using the corresponding method claimed in claims 1, 13 and 16 and the rejections of which are incorporated herein for the same reasons as used above.
Regarding Claim 20, computer program claim 20 of using the corresponding method claimed in claim 1 and the rejections of which are incorporated herein for the same reasons as used above.
Claims 9-11 are rejected under 35 U.S.C. 103 as being unpatentable over Seok et al. (US 20240169201, hereinafter Seok) in view of Sather et al. (US 12136039 B1, hereinafter Sather) and Hautyunyan et al. (US 20230229675 , hereinafter Hautyunyan)
Regarding Claim 9, Seok in view of Sather disclose the method according to claim 7, but does not explicitly disclose wherein: training the ANN model further comprises generating evenly spaced bins spanning said range of values, with a view to measuring said distance thanks to a distance function, and the distance is evaluated as a sum of values taken by the distance function over said bins.
Hautyunyan teaches training the ANN model further comprises generating evenly spaced bins spanning said range of values, with a view to measuring said distance thanks to a distance function, and the distance is evaluated as a sum of values taken by the distance function over said bins ([0149] FIG. 37B , Kullback-Leibler divergence, D.KL, defined by expressions 3740, provides a numerical measure of the difference between two probability distributions; [0150] The Jensen-Shannon divergence, JSD, is a symmetrical divergence related to the Kullback-Leibler divergence. Like the Kullback-Leibler divergence, the JSD has a value of 0 for two identical compared probability distributions and the value of the JSD measure increases as the two compared probability distributions more greatly differ from one another).
Therefore, it would have been obvious to one ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings measuring said distance as taught by Hautyunyan ([0149] ) into the neural processing system of Seok in order to enable providing secure, efficient, and cost-effective data-collection and data analysis tools and subsystems to support monitoring, management, and administration of computing facilities including cloud-computing facilities and other large-scale computer system (Hautyunyan, [0002] ).
Regarding Claim 10, Seok in view of Sather and Hautyunyan disclose the method according to claim 9, Hautyunyan discloses wherein said distance is measured as a Kullback-Leibler divergence between the kernel density function and the target distribution ([0149] FIG. 37B , Kullback-Leibler divergence, D.KL, defined by expressions 3740, provides a numerical measure of the difference between two probability distributions). The same reason or rational of obviousness motivation applied as used above in claim 9.
Regarding Claim 11, Seok in view of Sather and Hautyunyan disclose the method according to claim 9, Hautyunyan discloses wherein said distance is measured as an information entropy of the kernel density function, the information entropy computed as an expected value of a logarithm of the kernel density function computed over said bins ([0149], “relative entropy” of the first of the two probability distributions with respect to the second of the two probability distributions). The same reason or rational of obviousness motivation applied as used above in claim 9.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Samuel D Fereja whose telephone number is (469)295-9243. The examiner can normally be reached 8AM-5PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, DAVID CZEKAJ can be reached at (571) 272-7327. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/SAMUEL D FEREJA/Primary Examiner, Art Unit 2487