DETAILED ACTION
Response to Amendment
The amendment filed on 17 February 2026 has been entered.
Claims 1-10 are pending.
Claims 1, 9-10 are amended.
Response to Arguments
Applicant's arguments filed on 17 February 2026 have been fully considered, but they are not persuasive.
Applicant’s remarks, regarding the rejections of claims under 35 USC 103, have been fully considered.
Applicant respectfully submits that the cited references, taken individually or in combination, fail to disclose or suggest the features of amended independent Claim 1, particularly with respect to the features of:
wherein the inference controller determines whether the binarized input data read immediately before corresponds to a last element, and, when the binarized input data read immediately before corresponds to the last element, the inference controller notifies the learning processor that the last element has been read,
wherein the coefficient updating circuit updates the coefficient stored in the memory before inference processing of next binarized input data is started, based on output data generated by inference processing of immediately preceding binarized input data,
wherein, when the next binarized input data is supplied, the inference processor performs inference processing for the next binarized input data in parallel with the coefficient updating circuit updating the coefficient based on the output data generated by the inference processing of the immediately preceding binarized input data, and
wherein the inference processor and the learning processor are configured to perform inference processing and coefficient updating in parallel.
Based on the foregoing, the applied combination of the cited references, alone or in combination, fails to disclose or suggest each and every feature of amended independent Claim 1, which is believed to be in condition for allowance. Amended independent Claims 9 and 10, although differing in scope, recite subject matter similar to that discussed above with respect to amended independent Claim 1. The dependent claims depend from their respective base claims and add further limitations thereto.
Examiner notes Applicant’s arguments, as outlined above, are directed to newly amended claim limitations for which Examiner has not yet made a prima facie case for, rendering Applicant’s arguments moot.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1-2, 4-5, 9-10 are rejected under 35 U.S.C. 103 as being unpatentable over Nagy et al. (U.S. Pre-Grant Publication No. 2021/0173787, hereinafter ‘Nagy'), in view of Chou et al. (U.S. Pre-Grant Publication No. 20200210759, hereinafter ‘Chou’) and Marukame et al. (U.S. Pre-Grant Publication No. 20210081771, hereinafter ‘Marukame’).
Regarding claim 1, Nagy teaches A data processing device comprising: an inference processor; and a learning processor, wherein the inference processor comprises ([0002] Even though the tasks or operations may be performed by a general-purpose processing unit, such as a CPU or cores of a processing device, dedicated hardware for accelerating such tasks or operations has been variously proposed. This dedicated hardware is typically referred to as an accelerator.; [0003] For example, an inference processor; and a learning processor various types of math co-processors have been proposed to accelerate mathematical operations, such as operations on floating point numbers and the like. As another example of an accelerator, video or graphics accelerators have been proposed to accelerate processing and rendering of graphics or video objects. Accelerators typically include dedicated hardware specialized on a required task or operation and respective data formats.):
wherein the coefficient address information includes one or more coefficient addresses arranged to be read in a predetermined order for performing product-sum operations ([0021] In a preferred embodiment, at least one (or each) processing core further includes a coefficient buffer, wherein the processing control unit is configured to control the plurality of processing elements to process input data stored in the data buffer using coefficient data stored in the coefficient buffer. The coefficient buffer may be controlled by the coefficient control unit and may store the retrieved coefficient data. The coefficient address information includes one or more coefficient addresses arranged to be read in a predetermined order for performing product-sum operations processing control unit may drive retrieval of the coefficient data and the input data from the coefficient buffer and data buffer, respectively, such that the coefficient data and the input data may be simultaneously read.; [0025] In at least one embodiment, at least one processing element of the processing core is configured to perform a multiply-accumulate operation. The multiply-accumulate operation may be implemented using a multiply-accumulate circuit (MAC).);
wherein the learning processor comprises: an output distribution calculate circuit analyzing the output data and calculating a correction value of the coefficient based on the analysis result ([0097] A output distribution calculate circuit analyzing the output data and calculating a correction value of the coefficient based on the analysis result determination and retrieval of input data required for performing the convolution operation on a respective processing element 418 may be achieved by adjusting the input window to sample on the same input data from the data buffer 431 while taking into consideration different coefficient values from coefficient buffer 429.),
wherein the analysis is performed based on an inference output distribution; a coefficient updating circuit updating the coefficient stored in the memory with the correction value of the coefficient calculated by the output distribution calculate circuit, wherein the coefficient updating circuit modifies stored coefficients in real time based on inference results ([0027] To resolve this issue, the result store may buffer multiple results of operations of the processing core and the processing core may analysis is performed based on an inference output distribution iteratively calculate different results for different input data using the same or different coefficients and/or parameters. coefficient updating circuit updating the coefficient stored in the memory with the correction value of the coefficient calculated by the output distribution calculate circuit Respective partial results are stored in the result store. After a number of clock cycles defining the latency of the MAC unit has lapsed, the processing core may return to processing of the initial input data to calculate a next partial result based on next or the same coefficients and/or parameters on the initial data that require the partial results stored in the result store. The result is coefficient updating circuit modifies stored coefficients in real time based on inference results accumulated with the required partial results from the result store.); and
wherein the inference controller determines whether the binarized input data read immediately before corresponds to a last element, and, when the binarized input data read immediately before corresponds to the last element, the inference controller notifies the learning processor that the last element has been read ([0116] The apparatus 401 may sequentially read input tiles at the same tile origin 601 across the input planes of the input plane tensor to execute the convolution operation, wherein the resulting output tiles may correspond to output tile across the one or more output planes of the output plane tensor having the same origin. Hence, the apparatus 401 may process the input plane tensor by proceeding over tile origins 601 and executing sequential calculations from each tile origin 601, across the one or more input planes.; [0159] wherein the inference controller determines whether the binarized input data read immediately before corresponds to a last element If, during iteration of one or more of the embodiments described above, the when the binarized input data read immediately before corresponds to the last element current input tile is of the last input plane and the current weight coefficient is at the last position of the filter kernel both horizontally and vertically, then the current results generated by sum 437 in each processing element 418 may be output to subsequent processing components of the convolution core 415, such as to the results scaling and quantization unit, the serializer unit, the results reorder unit, and/or the activation unit 443, and any combination thereof in any order. The the inference controller notifies the learning processor that the last element has been read convolution core 415 may finish a command after iterating over the last input plane tile. Thereafter, a next command may be scheduled and processed by the convolution core 415.),
wherein the coefficient updating circuit updates the coefficient stored in the memory before inference processing of next binarized input data is started, based on output data generated by inference processing of immediately preceding binarized input data ([0109] 1. The wherein the coefficient updating circuit updates the coefficient stored in the memory before inference processing of next binarized input data is started coefficient control unit 427 may continuously monitor a filling level or available space in the coefficient buffer 429 and may read coefficients of a next input plane via the interface if there is enough free space available in the coefficient buffer 429. When the coefficients are read, the coefficient control unit 427 may send a completion flag to the input control unit 417.; [0116] The apparatus 401 may based on output data generated by inference processing of immediately preceding binarized input data sequentially read input tiles at the same tile origin 601 across the input planes of the input plane tensor to execute the convolution operation, wherein the resulting output tiles may correspond to output tile across the one or more output planes of the output plane tensor having the same origin. Hence, the apparatus 401 may process the input plane tensor by proceeding over tile origins 601 and executing sequential calculations from each tile origin 601, across the one or more input planes.; [0155] 3. Step to the next output tile stored in result store 439 (out of R). If the maximum number of output tiles is reached, then step back to the first output tile. The processing may continue with item 3 and/or may initiate execution of at least one step of item 4 in parallel.),
wherein, when the next binarized input data is supplied, the inference processor performs inference processing for the next binarized input data in parallel with the coefficient updating circuit updating the coefficient based on the output data generated by the inference processing of the immediately preceding binarized input data ([0021] In a preferred embodiment, at least one (or each) processing core further wherein, when the next binarized input data is supplied, the inference processor performs inference processing for the next binarized input data with the coefficient updating circuit updating the coefficient includes a coefficient buffer, wherein the processing control unit is configured to control the plurality of processing elements to process input data stored in the data buffer using coefficient data stored in the coefficient buffer. The coefficient buffer may be controlled by the coefficient control unit and may store the retrieved coefficient data. The processing control unit may drive retrieval of the coefficient data and the input data from the coefficient buffer and data buffer, respectively, such that the coefficient data and the input data may be in parallel simultaneously read. Subsequently, the processing control unit may initiate processing of the (convolution or any other) operation by the plurality of processing elements, which may each calculate a fraction of the operation using respective input data and coefficient data.; [0116] The apparatus 401 may based on the output data generated by the inference processing of the immediately preceding binarized input data sequentially read input tiles at the same tile origin 601 across the input planes of the input plane tensor to execute the convolution operation, wherein the resulting output tiles may correspond to output tile across the one or more output planes of the output plane tensor having the same origin. Hence, the apparatus 401 may process the input plane tensor by proceeding over tile origins 601 and executing sequential calculations from each tile origin 601, across the one or more input planes; [0126] As the input control unit 417 reads the input tiles IT(c), the coefficient control unit 427 may read bias values B(n) for the respective output planes to be calculated.; [0128] Hence, with a single weight coefficient per output plane, the convolution core 415 may calculate partial results for a DW×DH area on M output tiles OT(n) simultaneously by element-wise multiplication of the single weight coefficient and the input data of the DW×DH area in M processing elements 418.; [0129] Referring back to FIG. 4, the convolution control unit 433 of the convolution core 415 may be configured to drive the coefficient buffer 429 and the data buffer 431 to simultaneously provide W(n,c) and a respective input area of size DW×DH of input data of the input tiles IT(c) as multiplication operands of the convolution equation 715 as input to the multiplier 435 in each of the plurality of processing elements 418.), and
Nagy fails to teach an input data determination circuit determined to determine whether binarized input data corresponds to a predetermined value, wherein the input data determination circuit evaluates the binarized input data using a threshold- based determination process; a memory storing a plurality of coefficients and coefficient address information specifying storage locations for the plurality of coefficients in the memory, an inference controller reading the coefficient address from the memory based on a determination result from the input data determination circuit and reading the coefficient from the memory based on the coefficient address, wherein the inference controller transmits the retrieved coefficient for performing arithmetic operations in an inference process; and an arithmetic circuit that performs a product-sum operation using the binarized input data and the coefficient acquired by the inference controller to generate an arithmetic operation result as an output data, wherein the arithmetic circuit executes cumulative addition operations based on binarized input data, a learning processor for controlling the updating of the coefficient, wherein the learning processor dynamically adjusts coefficients based on variations in inference accuracy due to environmental changes, wherein the inference processor and the learning processor are configured to perform inference processing and coefficient updating in parallel.
Chou teaches an input data determination circuit determined to determine whether binarized input data corresponds to a predetermined value, wherein the input data determination circuit evaluates the binarized input data using a threshold- based determination process ([0004] Turning to binarized neural networks (BNN), recent studies have identified that there is no need to employ full-precision weights and activations since CNN is highly fault-tolerant. One may preserve the accuracy of a neural network using quantized fixed-point values, which is called quantized neural network (QNN). An extreme case of QNN is a BNN, which adopts weights and activations with only two possible values (e.g., −1 and+1). Prior wherein the input data determination circuit evaluates the binarized input data using a threshold- based determination process binarization approach is called an input data determination circuit determined to determine whether binarized input data corresponds to a predetermined value deterministic binarization, as illustrated in Equation 2. This approach is suitable for hardware accelerations.);
a memory storing a plurality of coefficients and coefficient address information specifying storage locations for the plurality of coefficients in the memory ([0052] In one example, at the data loading stage, input 504 and a memory storing a plurality of coefficients and coefficient address information specifying storage locations for the plurality of coefficients in the memory weight data 506 and 508 may be read from off-chip memory and be stored into a data buffer 510 and/or weight memory banks (“WBank”) 512. In one embodiment, during the execution of a binarized convolution layer, the input data may be read from one data buffer and written to another equally sized buffer. In this example, the read or write mode of data buffer A and B in FIG. 5A may be switched for the computation of different layers so the input of each layer may not need to transfer back and forth between on-chip and off-chip memory.),
an inference controller reading the coefficient address from the memory based on a determination result from the input data determination circuit and reading the coefficient from the memory based on the coefficient address, wherein the inference controller transmits the retrieved coefficient for performing arithmetic operations in an inference process ([0054] In one example, the different input values may be executed by n different processing elements 518 simultaneously. Each of the processing elements 518 may be assigned with a reuse buffer 520, a Wbank 512, and an Output Activation bank (OAbank) 522. In one embodiment, the storage of the weight value may be partitioned in kernel or k dimensions, so that the ofmaps result will not interleave across different OAbanks 522. Once the current pixel (e.g., difference) has finished broadcasting, the inference controller transmits the retrieved coefficient for performing arithmetic operations in an inference process accumulation stage may begin. An address generator 524 and an accumulator 526, may inference controller reading the coefficient address from the memory based on a determination result from the input data determination circuit collect the results in the reuse buffer 520 and accumulate them into the corresponding position of OAbank 522.; [0055] In one embodiment, the address generator 524 may calculate an destination address for different intermediate result in the reuse buffer 520. For example, one of the OAbank 522 may use an accumulation controller to collect the result in the reuse buffer 520 and may reduce them into the correct positions in the OAbank 522 indicated by the address generator 524. The address of the ofmap (h0, w0, c0) (subscript0 denotes output) of the given input locating at (h, w, c) and weight locating at (r, s, c, k) may be calculated as (h−r, w−s, k) or (h−r +1, w−s+1, k) if padding mode is enabled.); and
an arithmetic circuit that performs a product-sum operation using the binarized input data and the coefficient acquired by the inference controller to generate an arithmetic operation result as an output data, wherein the arithmetic circuit executes cumulative addition operations based on binarized input data ([0031] Moreover, to realize a design that may efficiently accelerate BNN inference, an existing approach tend to optimize an objective called throughput which may be described by frame per second (FPS) as arithmetic circuit that performs a product-sum operation using the binarized input data and the coefficient acquired by the inference controller to generate an arithmetic operation result as an output dataEquation 3: where Utilization may indicate a ratio of time for multipliers doing arithmetic circuit executes cumulative addition operations based on binarized input data inference over the total runtime.),
a learning processor for controlling the updating of the coefficient, wherein the learning processor dynamically adjusts coefficients based on variations in inference accuracy due to environmental changes, wherein the coefficient updating circuit initiates coefficient updating based on completion of an inference process and/or a notification indicating the completion, and wherein the inference processor and the learning processor are configured to perform inference processing and coefficient updating in parallel ([0059] In one embodiment, it may be stored in the learning processor for controlling the updating of the coefficient checking engine 542 as a weight base which will be dynamically adjusts coefficients based on variations in inference accuracy due to environmental changes continuously updated during the checking process. In this example, the weight base may keep the latest version of the real weight value to recover the weight difference during the computation. For the rest of the computations, aspects of the invention may begin to use the weight reuse strategy for computation.).
Nagy and Chou are considered to be analogous to the claimed invention because they are in the same field of machine learning. In view of the teachings of Nagy, it would have been obvious for a person of ordinary skill in the art to apply the teachings of Chou to Nagy before the effective filing date of the claimed invention in order to reduce total power consumption, reduce on-chip power consumption, and increase processing speed (cf. Chou, [0009] Furthermore, aspects of the invention create two types of fast and energy-efficient architectures for BNN inference. Additional embodiments further identify analysis and insights to pick a better strategy of these two for different datasets and network models. Embodiments of the invention reuse the results from previous computations so that much cycles for data buffer access and computations may be skipped. Through identifying BNN similarity, embodiments of the invention may demonstrate that about 80% of the computation and about 40% of the buffer access may be skipped. Such embodiments of the invention may achieve about 17% reduction in total power consumption, about 54% reduction in on-chip power consumption, and about 2.4× maximum speedup, compared to the baseline without applying embodiments of the invention.).
Marukame teaches wherein the inference processor and the learning processor are configured to perform inference processing and coefficient updating in parallel ([0112] The second update unit 140 adjusts the plurality of coefficients set to the inference circuit 14 according to the second delay time period. More specifically, the second update unit 140 adjusts the plurality of coefficients set to the inference circuit 14 so that the second delay time period is short.; [0127] FIG. 10 is a flowchart illustrating the flow of processing during learning of the inference system 10. The inference system 10 executes learning processing by the flow illustrated in FIG. 10. The inference processor and the learning processor are configured to perform inference processing and coefficient updating in parallel inference system 10 may also execute learning processing prior to the inference processing or execute real-time learning processing in parallel with the inference processing.).
Nagy, Chou, and Marukame are considered to be analogous to the claimed invention because they are in the same field of machine learning. In view of the teachings of Nagy and Chou, it would have been obvious for a person of ordinary skill in the art to apply the teachings of Marukame to Nagy before the effective filing date of the claimed invention in order to enable high-speed learning by a simple algorithm (cf. Marukame, [0003] In recent years, reservoir computing, in which inference processing such as time-series data recognition and pattern recognition, and the like, is executed with respect to a signal obtained by transforming an input signal by a reservoir, has attracted attention. A reservoir is realized as a type of recurrent neural network. This kind of reservoir is capable of outputting a signal obtained by using nonlinear transformation to project an input signal onto a feature space accompanied by high-dimension and timewise information. Thus, reservoir computing enables high-speed learning by a simple algorithm.; [0004] Furthermore, the characteristics of a reservoir vary according to the task to be handled, the type of data, and the like. Hence, it is desirable to suitably adjust a reservoir in accordance with the task to be handled, the type of data, and the like, to enable inference to be performed accurately.).
Regarding claim 2, Nagy, as modified by Chou and Marukame, teaches The data processing device of claim 1.
Nagy teaches wherein the output distribution calculate circuit performs analysis using a plurality of output data corresponding to each of a plurality of elements ([0012] Each processing module includes a corresponding to each of a plurality of elements processing core with a plurality of processing elements that may process multiple data values of the input data to generate corresponding (intermediate) results. Each processing module further includes an input control unit that controls whether the input data to be processed by the processing core is retrieved via the interface or from the cache module. Provision of the results of the convolution operation is further controlled by the output distribution calculate circuit performs analysis using a plurality of output data output control unit, which controls whether the output data is either provided via the interface (to an external storage) or stored in the cache module as intermediate data that is further used as a subsequent input by at least one of the plurality of processing modules. This enables a flexible configuration of the apparatus to accelerate operations for various tasks and operations, which efficiently exploits the available resources and decreases the amount of data communicated via the interface to an external host. Accordingly, each processing element may access the cache module on individual write and read interfaces to gain simultaneous access to the cache module.).
Nagy, Chou, and Marukame are combinable for the same rationale as set forth above with respect to claim 1.
Regarding claim 4, Nagy, as modified by Chou and Marukame, teaches The data processing device of claim 1.
Nagy teaches wherein in case of the inference processor performs an operation using the binarized input data corresponding to a last element, the inference processing is terminated ([0192] In item 1409, the data elements of the at least one of the plurality of adjacent input tiles stored in the data buffer may be accessed by positioning an input window over the data elements stored in the data buffer, to generate a plurality of input areas, wherein the input window is adjustable according to a set of parameters. The set of parameters may include a stride value and a dilation value, which may define a configuration and positioning of the input window on an input area with regard to a position of coefficient values, and their contribution to output data elements. Examples of configurations of input windows have been described with regard to embodiments of FIGS. 8 to 13.; [0193] In item 1411, at least one of the plurality of input areas may be sequentially processed to at least partially process the at least one of the plurality of adjacent input tiles by the accelerator apparatus.; [0194] The method 1401 may inference processor performs an operation using the binarized input data corresponding to a last element repeat by storing another one of the plurality of adjacent input tiles in the data buffer in item 1407, and accessing data of the other one of the plurality of adjacent input tiles using the input window in item 1409. This may be repeated sequentially or in parallel until the inference processing is terminated computation on the entire input data completes.).
Nagy, Chou, and Marukame are combinable for the same rationale as set forth above with respect to claim 1.
Regarding claim 5, Nagy, as modified by Chou and Marukame, teaches The data processing device of claim 1.
Nagy teaches wherein the arithmetic circuit performs an operation using the binarized input data corresponding to a one element, wherein the output distribution calculate circuit analyzes the output data corresponding to the one element ([0099] The arithmetic circuit performs an operation using the binarized input data corresponding to a one element convolution operation may be implemented by each of the plurality of processing elements 418 by providing a multiply-accumulate circuit (MAC), which may include a multiplier 435, a sum 437 and a result store 439 configured to buffer results. output distribution calculate circuit analyzes the output data corresponding to the one element Each processing element of the plurality of processing elements 418 may further include a multiplexer 441 to determine the input of the sum 437 to either initialize the sum with a bias value or to accumulate the results of the multiplier 435 with the partial results stored in the result store 439.),
wherein the arithmetic circuit performs an operation using the binarized input data corresponding to a second or subsequent one element ([0102] The apparatus 401 may be used as an accelerator for convolutional neural networks (CNNs). For example, the arithmetic circuit performs an operation using the binarized input data input data retrieved via the interconnect 403 from memory 405 may represent an input layer of the CNN, and each convolution module 409 may process at least a part of a layer of the CNN. Moreover, data stored in the cache module 413 may represent at least a part of a corresponding to a second or subsequent one element next layer of the CNN to be processed by the same or another convolution module of the plurality of convolution modules 409.),
and wherein the output distribution calculate circuit performs analysis using the output data and output data up to a previous time, performs analysis using all output data, and calculates the correction value of the coefficient based on the analysis result using all output data ([0097] The data buffer 431 may be configured as a two-dimensional data buffer, which may store the input data retrieved by the input control unit 417 as a two-dimensional array. Accordingly, the data in the data buffer 431 may be accessed using two indices. In preferred embodiments, data in the data buffer 431 may be retrieved via an adjustable input window, as described, for example, in FIG. 14. The output distribution calculate circuit performs analysis using the output data and output data up to a previous time data buffer 431 may enable access to the input data stored in the data buffer 431 to all processing elements of the plurality of processing elements 418. A determination and retrieval of input data required for performing the convolution operation on a respective processing element 418 may be achieved by adjusting the input window to sample on the same input data from the data buffer 431 while taking into consideration different coefficient values from coefficient buffer 429.; [0192] In item 1409, the data elements of the at least one of the plurality of adjacent input tiles stored in the data buffer may be accessed by positioning an input window over the data elements stored in the data buffer, to generate a plurality of input areas, wherein the input window is adjustable according to a set of parameters. The set of parameters may include a stride value and a dilation value, which may define a configuration and positioning of the input window on an input area with regard to a position of performs analysis using all output data coefficient values, and their contribution to output data elements. Examples of configurations of input windows have been described with regard to embodiments of FIGS. 8 to 13.; [0129] The bias scaling and rounding element may add an offset, together with an up-scaled bias value, proportional to the magnitude range of the result to the processed coefficients. The values applied by the bias scaling and rounding element, such as bias value and bias scaling value, may be specified in a definition of a neural network, such as inside of a command word. This may calculates the correction value of the coefficient based on the analysis result using all output data reduce the effect of introducing asymmetrical errors in quantized values due to truncation of high world-length accumulation results.).
Nagy, Chou, and Marukame are combinable for the same rationale as set forth above with respect to claim 1.
Regarding claim 9 and analogous claim 10, Nagy teaches A method of data processing, comprising the steps of ([0054] According to another aspect of the present disclosure, one or more machine readable media are defined storing instructions thereon that, when executed on a computing device or an apparatus, configure the computing device or apparatus to execute a method according to any one embodiment of the present disclosure.; [0038] In one aspect of the present disclosure, a method for accessing and processing data by an accelerator apparatus is defined, the method comprising retrieving at least a part of input data to be processed by the accelerator apparatus, segmenting the input data into a plurality of adjacent input tiles, the input tiles having a pre-determined size, storing at least one of the plurality of adjacent input tiles in a data buffer of the accelerator apparatus, accessing data elements of the at least one of the plurality of adjacent input tiles stored in the data buffer by positioning an input window over the data elements stored in the data buffer, to generate a plurality of input areas, wherein the input window is adjustable according to a set of parameters, and sequentially processing at least one of the plurality of input areas to at least partially process the at least one of the plurality of adjacent input tiles by the accelerator apparatus.):
wherein the coefficient address information includes one or more coefficient addresses arranged to be read in a predetermined order for performing product-sum operations ([0021] In a preferred embodiment, at least one (or each) processing core further includes a coefficient buffer, wherein the processing control unit is configured to control the plurality of processing elements to process input data stored in the data buffer using coefficient data stored in the coefficient buffer. The coefficient buffer may be controlled by the coefficient control unit and may store the retrieved coefficient data. The coefficient address information includes one or more coefficient addresses arranged to be read in a predetermined order for performing product-sum operations processing control unit may drive retrieval of the coefficient data and the input data from the coefficient buffer and data buffer, respectively, such that the coefficient data and the input data may be simultaneously read.; [0025] In at least one embodiment, at least one processing element of the processing core is configured to perform a multiply-accumulate operation. The multiply-accumulate operation may be implemented using a multiply-accumulate circuit (MAC).);
(d) analyzing, by an output distribution calculation circuit, the output data to calculate a correction value of the coefficient based on an inference output distribution ([0097] A analyzing, by an output distribution calculation circuit, the output data to calculate a correction value of the coefficient based on an inference output distribution determination and retrieval of input data required for performing the convolution operation on a respective processing element 418 may be achieved by adjusting the input window to sample on the same input data from the data buffer 431 while taking into consideration different coefficient values from coefficient buffer 429.);
(e) updating, by a coefficient updating circuit, the coefficient stored in the memory with the correction value of the coefficient calculated by the output distribution calculate circuit ([0027] To resolve this issue, the result store may buffer multiple results of operations of the processing core and the processing core may iteratively calculate different results for different input data using the same or different coefficients and/or parameters. updating, by a coefficient updating circuit, the coefficient stored in the memory with the correction value of the coefficient calculated by the output distribution calculate circuit Respective partial results are stored in the result store. After a number of clock cycles defining the latency of the MAC unit has lapsed, the processing core may return to processing of the initial input data to calculate a next partial result based on next or the same coefficients and/or parameters on the initial data that require the partial results stored in the result store. The result is accumulated with the required partial results from the result store.),
(b1) reading, by an inference controller, a coefficient address from the coefficient address information based on a determination result of the input data determination circuit, and reading, by the inference controller, a coefficient from the memory based on the coefficient address ([0021] In a preferred embodiment, at least one (or each) processing core further reading, by an inference controller, a coefficient address from the coefficient address information based on a determination result of the input data determination circuit includes a coefficient buffer, wherein the processing control unit is configured to control the plurality of processing elements to process input data stored in the data buffer using coefficient data stored in the coefficient buffer. The coefficient buffer may be controlled by the coefficient control unit and may store the retrieved coefficient data. The processing control unit may drive retrieval of the coefficient data and the input data from the coefficient buffer and data buffer, respectively, such that the coefficient data and the input data may be simultaneously read. Subsequently, the processing control unit may initiate processing of the (convolution or any other) operation by the plurality of processing elements, which may each calculate a fraction of the operation using respective input data and coefficient data.; [0107] The control module 411 may schedule a command as an instruction to a convolution module 409. This may be a single instruction word including one or more of an input plane read base address, a plane address offset, a plane count, type and parameters of arithmetical operation, reading, by the inference controller, a coefficient from the memory based on the coefficient address a coefficient read base address, an output plane write base address, a plane offset, and the like, in any combination. Each stage of the convolution module 409 may receive the instruction in parallel and may start execution individually using relevant parts of the instruction word.);
(f) controlling, by a learning controller, the updating of the coefficient, including initiating the updating of the coefficient based on completion of inference processing or based on a notification indicating such completion ([0096] The controlling, by a learning controller, the updating of the coefficient convolution control unit 433 may apply a scheduling and control scheme to ensure that each processing element 418 is including initiating the updating of the coefficient provided with correct input data from the coefficient buffer 429 and data buffer 431 to perform the operation. The convolution control unit 433 may further based on completion of inference processing or based on a notification indicating such completion monitor processing of the processing elements 418 in order to avoid process-stall between window positions or input planes.; [0109] 1. The coefficient control unit 427 may continuously monitor a filling level or available space in the coefficient buffer 429 and may read coefficients of a next input plane via the interface if there is enough free space available in the coefficient buffer 429. When the coefficients are read, the coefficient control unit 427 may send a completion flag to the input control unit 417.; [0116] The apparatus 401 may sequentially read input tiles at the same tile origin 601 across the input planes of the input plane tensor to execute the convolution operation, wherein the resulting output tiles may correspond to output tile across the one or more output planes of the output plane tensor having the same origin. Hence, the apparatus 401 may process the input plane tensor by proceeding over tile origins 601 and executing sequential calculations from each tile origin 601, across the one or more input planes.); and
Nagy fails to teach (a) storing, in a memory, a plurality of coefficients and coefficient address information specifying storage locations for the plurality of coefficients in the memory, (b) determining, by an input data determination circuit, whether binarized input data corresponds to a predetermined value, wherein the input data determination circuit evaluates the binarized input data using a threshold-based determination process; (c) performing, by an arithmetic circuit, a product-sum operation using the binarized input data and a coefficient acquired based on the coefficient address information, and generating an arithmetic operation result as output data, wherein the arithmetic circuit executes cumulative addition operations based on binarized input data; (f) initiating the updating of the coefficient based on completion of inference processing or based on a notification indicating such completion; and (g) executing the inference processing and the coefficient updating in parallel.
Chou teaches (a) storing, in a memory, a plurality of coefficients and coefficient address information specifying storage locations for the plurality of coefficients in the memory ([0052] In one example, at the data loading stage, input 504 and storing, in a memory, a plurality of coefficients and coefficient address information specifying storage locations for the plurality of coefficients in the memory weight data 506 and 508 may be read from off-chip memory and be stored into a data buffer 510 and/or weight memory banks (“WBank”) 512. In one embodiment, during the execution of a binarized convolution layer, the input data may be read from one data buffer and written to another equally sized buffer. In this example, the read or write mode of data buffer A and B in FIG. 5A may be switched for the computation of different layers so the input of each layer may not need to transfer back and forth between on-chip and off-chip memory.),
(b) determining, by an input data determination circuit, whether binarized input data corresponds to a predetermined value, wherein the input data determination circuit evaluates the binarized input data using a threshold-based determination process ([0004] Turning to binarized neural networks (BNN), recent studies have identified that there is no need to employ full-precision weights and activations since CNN is highly fault-tolerant. One may preserve the accuracy of a neural network using quantized fixed-point values, which is called quantized neural network (QNN). An extreme case of QNN is a BNN, which adopts weights and activations with only two possible values (e.g., −1 and+1). Prior wherein the input data determination circuit evaluates the binarized input data using a threshold- based determination process binarization approach is called an input data determination circuit determined to determine whether binarized input data corresponds to a predetermined value deterministic binarization, as illustrated in Equation 2. This approach is suitable for hardware accelerations.);
(c) performing, by an arithmetic circuit, a product-sum operation using the binarized input data and a coefficient acquired based on the coefficient address information, and generating an arithmetic operation result as output data, wherein the arithmetic circuit executes cumulative addition operations based on binarized input data ([0031] Moreover, to realize a design that may efficiently accelerate BNN inference, an existing approach tend to optimize an objective called throughput which may be described by frame per second (FPS) as performing, by an arithmetic circuit, a product-sum operation using the binarized input data and a coefficient acquired based on the coefficient address information Equation 3: where Utilization may indicate a ratio of time for multipliers doing generating an arithmetic operation result as output data, wherein the arithmetic circuit executes cumulative addition operations based on binarized input data inference over the total runtime.);
Nagy and Chou are combinable for the same rationale as set forth above with respect to claim 1.
Marukame teaches (g) executing the inference processing and the coefficient updating in parallel ([0112] The second update unit 140 adjusts the plurality of coefficients set to the inference circuit 14 according to the second delay time period. More specifically, the second update unit 140 adjusts the plurality of coefficients set to the inference circuit 14 so that the second delay time period is short.; [0127] FIG. 10 is a flowchart illustrating the flow of processing during learning of the inference system 10. The inference system 10 executes learning processing by the flow illustrated in FIG. 10. The executing the inference processing and the coefficient updating in parallel inference system 10 may also execute learning processing prior to the inference processing or execute real-time learning processing in parallel with the inference processing.).
Nagy, Chou, and Marukame are combinable for the same rationale as set forth above with respect to claim 1.
Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Nagy, in view of Chou, Marukame, and further in view of Kottke et al. (U.S. Pre-Grant Publication No. 2019/0289296, hereinafter ‘Kottke').
Regarding claim 3, Nagy, as modified by Chou and Marukame, teaches The data processing device of claim 1.
Nagy, as modified by Chou and Marukame, fails to teach wherein the analysis result includes an average and a deviation.
Kottke teaches wherein the analysis result includes an average and a deviation ([0024] Some of the computer methods, systems, and program products iterate through the following (1)-(4) until a predicted MOS value is close to the desired MOS value. (1) The computer methods, systems, and program products generate an initial encoding of a source video and a decoding of the initial encoding. (2) The computer methods, systems, and program products next compute a first metric and a second metric on the initial encoding, the first metric being based on a analysis result includes an average and a deviation video-average gradient magnitude similarity deviation (GMSD) and the second metric being based on a log-normalized mean GMSD.).
Nagy, Chou, Marukame, and Kottke are considered to be analogous to the claimed invention because they are in the same field of machine learning. In view of the teachings of Nagy, Chou, and Marukame, it would have been obvious for a person of ordinary skill in the art to apply the teachings of Kottke to Nagy before the effective filing date of the claimed invention in order to determine target bitrate to encode and decode the source video based on a relationship between the predicted MOS value and a desired MOS value (cf. Kottke, [0024] Some of the computer methods, systems, and program products iterate through the following (1)-(4) until a predicted MOS value is close to the desired MOS value. (1) The computer methods, systems, and program products generate an initial encoding of a source video and a decoding of the initial encoding. (2) The computer methods, systems, and program products next compute a first metric and a second metric on the initial encoding, the first metric being based on a video-average gradient magnitude similarity deviation (GMSD) and the second metric being based on a log-normalized mean GMSD. (3) The computer methods, systems, and program products apply a previously-derived model that predicts a mean opinion score (MOS) value for the initial encoding as a function of measurements of the computed first metric and second metric. (4) The computer methods, systems, and program products determine a target bitrate to encode and decode the source video based on a relationship between the predicted MOS value and a desired MOS value.).
Claims 6-8 are rejected under 35 U.S.C. 103 as being unpatentable over Nagy, in view of Chou, Marukame, and further in view of Liu et al. (U.S. Pre-Grant Publication No. 2020/0394522, hereinafter ‘Liu').
Regarding claim 6, Nagy, as modified by Chou and Marukame, teaches The data processing device of claim 1.
Nagy, as modified by Chou and Marukame, fails to teach wherein the inference processor adjusts a quantization coefficient used in quantize of input data prior to being binarized.
Liu teaches wherein the inference processor adjusts a quantization coefficient used in quantize of input data prior to being binarized ([0100] The determination process of six types of quantization parameters are described in detail above, and are merely exemplary descriptions. The types of quantization parameters can be different from the above description in different examples. According to the formula (1) to the formula (13), both the point position parameter and the scaling coefficients are related to the data bit width. adjusts a quantization coefficient used in quantize of input data prior to being binarized Different data bit width may lead to different point position parameters and scaling coefficients, which may affect the quantization precision.).
Nagy, Chou, Marukame, and Liu are considered to be analogous to the claimed invention because they are in the same field of machine learning. In view of the teachings of Nagy, Chou, and Marukame, it would have been obvious for a person of ordinary skill in the art to apply the teachings of Liu to Nagy before the effective filing date of the claimed invention in order to convert high-precision data into low-precision fixed-point data, which may reduce storage space of data involved in the process of neural network operation and reduce memory access data in the artificial intelligence processor chip and improve computing performance (cf. Liu, [0014] In the process of neural network operation, a quantization parameter is determined during quantization by using technical schemes in the present disclosure. The quantization parameter is used by an artificial intelligence processor to quantize data involved in the process of neural network operation and convert high-precision data into low-precision fixed-point data, which may reduce storage space of data involved in the process of neural network operation. For example, a conversion of float32 to fix8 may reduce a model parameter by four times. Smaller data storage space enables neural network deployment to occupy smaller space, thus on-chip memory of an artificial intelligence processor chip may store more data, which may reduce memory access data in the artificial intelligence processor chip and improve computing performance.).
Regarding claim 7, Nagy, as modified by Chou, Marukame, and Liu, teaches The data processing device of claim 6.
Liu teaches wherein the inference processor monitors a previous layer's output data and adjusts the quantization coefficient so that the previous layer's output data distribution falls within a predetermined range ([0067] No matter what a neural network structure it is, in the process of training or fine-tuning a neural network, the data to be quantized includes at least one type of neurons, weights, gradients, and biases of the neural network. In the inference process, the data to be quantized includes at least one type of neurons, weights, and biases of the neural network. If the data to be quantized are the weights, the data to be quantized may be all or part of the weights of a certain layer in the neural network. If the inference processor monitors a previous layer's output data certain layer is a convolution layer, the data to be quantized may be all or part of the weights with a channel as a unit in the convolution layer, in which the channel refers to all or part of the channels of the convolution layer. It should be noted that only the convolution layer has a concept of channels. In the convolution layer, only the weights are quantized layer by layer in a channel manner.; [0068] The following example is that the data to be quantized are the neurons and the weights of a target layer in the neural network, and the technical scheme is described in detail below. In this step, the neurons and the weights of each layer in the target layer are analyzed respectively to obtain a maximum value and a minimum value of each type of the data to be quantized, and a maximum absolute value of each type of the data to be quantized may also be obtained. The target layer, as a layer needed to be quantized in the neural network, may be one layer or multiple layers. Taking one layer as a unit, the maximum absolute value of the data to be quantized may be determined by the maximum value and the minimum value of each type of the data to be quantized. The maximum absolute value of each type of the data to be quantized may be further obtained by calculating the absolute value of each type of the data to be quantized to obtain results and then traversing the results.; [0110] According to FIG. 5a and FIG. 5b , in a same epoch, the variation range of the weight in each iteration is large in an initial stage of training, while in middle and later stages of training, the variation range of the weight in each iteration is not large. In such case, in the middle and later stages of training, since the variation range of the weight is not large before and after each iteration, the weight of corresponding layers in each iteration have similarity within a certain iteration interval, and the data involved in the neural network training process in each layer can be quantized by using the data bit width used in the quantization of the corresponding layer in the previous iteration. However, in the initial stage of training, because of the large variation range of the weight before and after each iteration, in order to achieve the precision of the floating-point computation required for quantization, in each iteration in the initial stage of training, the weight of the corresponding layer in the current iteration is quantized by using the data bit width used in the quantization of the corresponding layer in the previous iteration, or the weight of the current layer is quantized based on the preset data bit width n of the current layer to obtain quantized fixed-point numbers. According to the quantized weight and the corresponding pre-quantized weight, the quantization error diffbit is determined. According to the comparison result of the quantization error diffbit and the threshold, the data bit width n used in the previous layer's output data distribution falls within a predetermined range quantization of the corresponding layer in the previous iteration or the adjusts the quantization coefficient preset data bit width n of the current layer is adjusted, and the adjusted data bit width is applied to the quantization of the weight of the corresponding layer in the current iteration. Furthermore, in the process of training or fine-tuning, the weights between each layer in the neural network are independent of each other and have no similarity, which makes neurons between each layer independent of each other and have no similarity. Therefore, in the process of neural network training or fine-tuning, the data bit width of each layer in each iteration of the neural network is only suitable to be used in the corresponding neural network layer.).
Nagy, Chou, Marukame, and Liu are combinable for the same rationale as set forth above with respect to claim 6.
Regarding claim 8, Nagy, as modified by Chou and Marukame, teaches The data processing device of claim 1.
Nagy, as modified by Chou and Marukame, fails to teach wherein the inference processor performs an operation using the binarized input data to generate the output data, quantizes the output data, and generates input data of the following layer.
Liu teaches wherein the inference processor performs an operation using the binarized input data to generate the output data, quantizes the output data, and generates input data of the following layer ([0112] In the inference process of a neural network, the weights between each layer in the neural network are independent of each other and have no similarity, which makes neurons between each layer independent of each other and have no similarity. Therefore, in the inference process of the neural network, the data bit width of each layer in the neural network is applied to the corresponding layer. In practical applications, in the inference process, the input neuron of each layer may not be the same or similar. Moreover, since the weights between each layer in the neural network are independent of each other, the input neurons of each of the hidden layers in the neural network are different. During quantization, it may be not suitable for the data bit width used by the input neuron of the upper layer to be applied to the input neuron of the current layer. Therefore, in order to achieve the precision of floating-point computation required for quantization, in the reference process, the input neuron of the current layer is quantizes the output data quantized by performs an operation using the binarized input data to generate the output data using the data bit width used in the quantization of the upper layer, or the generates input data of the following layer input neuron of the current layer is quantized based on the preset data bit width n of the current layer to obtain quantized fixed-point numbers.).
Nagy, Chou, Marukame, and Liu are combinable for the same rationale as set forth above with respect to claim 6.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Yan et al. (NPL: “Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units”) teaches a novel data partitioning scheme that effectively reduces the memory cost, balancing the computational cost on each multiprocessor to easily avoid memory access conflicts.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MAGGIE MAIDO whose telephone number is (703) 756-1953. The examiner can normally be reached M-Th: 6am - 4pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael Huntley can be reached on (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/MM/Examiner, Art Unit 2129
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129