DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1, 3-12, and 14-20 are pending in this application.
Response to Arguments
Applicant’s arguments regarding the rejections of claims 1-20 under 35 U.S.C. 112b have been fully considered and are persuasive. The rejections have been withdrawn. However, new 35 U.S.C. 112b rejections are applied to claims 1, 3-12, and 14-20 based on the amendments.
Applicant's arguments regarding the 35 U.S.C. 101 rejections of claims 1-20 have been fully considered and they are persuasive.
Applicant's arguments regarding the 35 U.S.C. 103 rejections of claims 1-20 have been fully considered but they are not persuasive.
Regarding the 35 U.S.C. 103 rejection, the applicant argues the following in the remarks:
The combination of references do not teach or suggest the combined pooling of memory data during load to be delivered to multiple processors to apply respective kernels and then written together as output in the same clock cycle that new data is being read as recited in the amended claims.
Examiner has thoroughly considered Applicant' s arguments, but respectfully finds them unpersuasive for at least the following reasons:
As per point (a), the examiner respectfully disagrees. The claims do not recite that the combined pooling of memory data during load is delivered to multiple processors to apply respective kernels. The claims recite that inflight-pooling is performed on the first subset of data and that there are multiple processors, which process the first subset of data with respective kernels. The claims do not recite that there are multiple processors, which process the inflight-pooled first subset of data with respective kernels. The cited references do teach what is recited in the claims. Zhu teaches writing in the same clock cycle that new data is being read since Zhu recites in [0057] “4) carry out convolution computation and data loading. The two operations are completely independent of each other and are synchronously carried out in parallel”, [0052] “Step 3, execute convolution computation, put a result into the output feature map cache, and simultaneously storing a next group of 64 convolution kernels into another convolution kernel cache”, and the third block in Figure 2 recites: execute convolution computation, put a result into an output feature map cache, and store a next group of 64 convolution kernels into another convolution kernel cache at the same time. Since convolution output is stored at the same time that a next group of data is loaded, they are performed in the same clock cycle. Lovell teaches an inflight-pooling operation in [0061] “The pooling process in FIG. 17 may be described as “in-flight” pooling, i.e., pooling occurs when the original data is read out from source memory 1702 before the pooled data are written into compute cache 1704”.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 1, 3-12, and 14-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
As per claims 1, 12, and 20 (line numbers refer to claim 1):
Line 14 recites “process the first subset of data with respective kernels” but it is unclear what portion of the first subset of data each kernel is related to. Respective means relating separately to each of two or more things so it is unclear what pieces of data in the first subset of data relate to each of the kernels.
As per claim 12:
Line 5 recites “the one or more processors” but it is unclear what this is referring to.
Claims 3-11 and 14-19 are dependent claims of claims 1 and 12, and fail to resolve the deficiencies of claims 1 and 12, so they are rejected for the same reasons.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1, 3-12, are 14-20 are rejected under 35 U.S.C. 103 as being unpatentable over Zhu et al. (US 20220414423 A1 hereinafter Zhu), in view of Chao et al. (US 20200065251 A1 hereinafter Chao), and further in view of Lovell et al. (US 20200110604 A1 hereinafter Lovell).
Zhu, Chao, and Lovell were cited in a prior office action.
As per claim 1, Zhu teaches the invention substantially as claimed including a computer-implemented method for increasing hardware accelerator performance in neural network applications ([0037] improving efficiency and real-time performance of the neural network accelerator), the method comprising:
in response to a pre-processor accessing a first set of locations in a source memory coupled to a hardware accelerator, obtaining a first subset of data from a set of input data ([0054] 1) load an input feature map instruction parameter latch and a convolution kernel instruction parameter latch. When a neural network accelerator is started, instructions are sequentially read from an instruction module, and if a current instruction is an input feature map loading instruction, an off-chip input feature map storage address is latched, and an input feature map length is loaded; and if the current instruction is a convolution kernel loading instruction, the number of currently loaded convolution kernels, lengths of the loaded convolution kernels, a convolution kernel cache starting address, an off-chip convolution kernel storage address, etc. are required to be latched; [0032] A parallel device for convolution computation and data loading of a neural network accelerator includes: a convolution computation array, and an input feature map cache, an output feature map cache and a convolution kernel cache), wherein the hardware accelerator performs a pooling operation ([0005] Generally speaking, the neural network accelerator mainly includes a data input/output cache unit, a convolution computation unit, a quantization unit, a vector computation unit, a pooling unit,);
at a resource optimizer, determine a first set of target locations in a set of memory elements for storing a first output ([0052] Step 3, execute convolution computation, put a result into the output feature map cache; [0037] the parallel method and device for convolution computation and data loading of a neural network accelerator provided in the present disclosure make convolution computation and data loading completely parallel, and hides part of the convolution kernels and loading time of the next frame of input feature map, thereby greatly reducing total reasoning time of the neural network accelerator, and improving efficiency and real-time performance of the neural network accelerator; the parallel method may remarkably reduce space of the on-chip convolution kernel cache, thereby reducing an area of a chip); and
using a read/write synchronizer to schedule reading of a second subset of data from a second set of locations in the source memory to occur in a same clock cycle as writing the first output to the first set of target locations, the read/write synchronizer instructing multiple processors, which process the first subset of data with respective kernels to generate the first output, to use the first set of target locations to write the first output to the set of memory elements (Figs. 2 and 5; The third block in Figure 2 recites: execute convolution computation, put a result into an output feature map cache, and store a next group of 64 convolution kernels into another convolution kernel cache at the same time. [0057] 4) carry out convolution computation and data loading. The two operations are completely independent of each other and are synchronously carried out in parallel; [0052] Step 3, execute convolution computation, put a result into the output feature map cache, and simultaneously storing a next group of 64 convolution kernels into another convolution kernel cache; [0054] if the current instruction is a convolution kernel loading instruction, the number of currently loaded convolution kernels, lengths of the loaded convolution kernels, a convolution kernel cache starting address, an off-chip convolution kernel storage address, etc. are required to be latched; Abstract Disclosed are a parallel method and device for convolution computation and data loading of a neural network accelerator; [0037] the parallel method and device for convolution computation and data loading of a neural network accelerator provided in the present disclosure make convolution computation and data loading completely parallel, and hides part of the convolution kernels and loading time of the next frame of input feature map, thereby greatly reducing total reasoning time of the neural network accelerator, and improving efficiency and real-time performance of the neural network accelerator; [0009] S1, storing a frame of input feature map into an input feature map cache, and dispersedly storing the input feature maps into input feature map sub-caches according to channels of the input feature maps; [0010] S2, sequentially loading a group of convolution kernels into corresponding convolution kernel cache sub-blocks in a first convolution kernel cache; [0011] S3, loading the input feature map cache and the first convolution kernel cache to execute convolution computation, putting a result into an output feature map cache; [0059] The input feature map cache and the output feature map cache are consistent in structure, stores feature maps according to channels, and store 64 channels into different cache sub-blocks. The first convolution kernel weight cache and the second convolution kernel weight cache are consistent in structure, and sequentially store required convolution kernel weights into the convolution kernel cache sub-blocks in order. In one embodiment of the present disclosure, the convolution computation array is composed of a 64×64 two-dimensional array, the weights are transmitted down, and the input feature maps are transmitted to the right; [0032] A parallel device for convolution computation and data loading of a neural network accelerator includes: a convolution computation array, and an input feature map cache, an output feature map cache and a convolution kernel cache).
Zhu fails to teach performs an inflight-pooling operation on the first subset of data; at a resource optimizer, using a storage availability and one or more network parameters to determine a first set of target locations in a set of memory elements for storing a first output according to one or more memory access metrics.
However, Chao teaches at a resource optimizer, using a storage availability and one or more network parameters to determine a first set of target locations in a set of memory elements for storing a first output according to one or more memory access metrics (Fig. 10; [0033] When the total output size (M×So) is smaller than the cache free space size of the feature map cache (CFS), all the output channels of the output feature map tile 300 are stored in the feature map cache; claim 3 wherein the cache free space size is greater than a sum of the number of the input channels of the input feature maps multiplied by the input feature map tile size and the number of the output channels of the output feature maps multiplied by the output feature map tile size; [0004] how to optimize DRAM access to reduce power consumption is a crucial problem for edge Al computing. Therefore, a memory-adaptive processing method for the convolutional neural network and a system thereof which are capable of reducing DRAM access are commercially desirable; [0039] Therefore, the memory-adaptive processing method 100a of the present disclosure can reduce DRAM access so as to reduce power consumption; [0010] FIG. 10 shows a schematic view of read and write access of the DRAM and the feature map cache in the memory-adaptive processing method 100 a of FIG. 6…w represent read access and write access of the feature map cache).
It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined Zhu with the teachings of Chao to reduce power consumption (see Chao [0039] Therefore, the memory-adaptive processing method 100a of the present disclosure can reduce DRAM access so as to reduce power consumption).
Zhu and Chao fail to teach performs an inflight-pooling operation on the first subset of data.
However, Lovell teaches performs an inflight-pooling operation on the first subset of data ([0061] The pooling process in FIG. 17 may be described as “in-flight” pooling, i.e., pooling occurs when the original data is read out from source memory 1702 before the pooled data are written into compute cache 1704).
It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined Zhu and Chao with the teachings of Lovell to reduce storage usage (see Lovell [0061] The pooling process in FIG. 17 may be described as “in-flight” pooling, i.e., pooling occurs when the original data is read out from source memory 1702 before the pooled data are written into compute cache 1704. For example, in embodiments, data in a pooling window may be max-pooled by streaming the data element-by-element and replacing one element value with the value of a subsequent element in the data stream if the subsequent value is a greater (and otherwise ignoring it). In this manner, the pooling operation does not rely on intermediate storage or caching steps.).
As per claim 3, Zhu, Chao, and Lovell teach the computer-implemented method according to claim 1. Zhu teaches wherein the resource optimizer synchronizes read and write operations to enable the multiple processors to perform a number of processing steps in parallel (Fig. 2; The third block in Figure 2 recites: execute convolution computation, put a result into an output feature map cache, and store a next group of 64 convolution kernels into another convolution kernel cache at the same time. [0057] 4) carry out convolution computation and data loading. The two operations are completely independent of each other and are synchronously carried out in parallel; [0013] computing convolution and synchronously loading convolution kernels; [0052] Step 3, execute convolution computation, put a result into the output feature map cache, and simultaneously storing a next group of 64 convolution kernels into another convolution kernel cache; [0037] the parallel method and device for convolution computation and data loading of a neural network accelerator provided in the present disclosure make convolution computation and data loading completely parallel, and hides part of the convolution kernels and loading time of the next frame of input feature map, thereby greatly reducing total reasoning time of the neural network accelerator, and improving efficiency and real-time performance of the neural network accelerator; the parallel method may remarkably reduce space of the on-chip convolution kernel cache, thereby reducing an area of a chip; [0035] the convolution computation array is composed of a two-dimensional array, the input feature map cache and the first convolution kernel cache are loaded to execute convolution computation, a result is put into the output feature map cache, and moreover, a next group of convolution kernels are stored into corresponding convolution kernel cache sub-blocks in the second convolution kernel cache; [0032] A parallel device for convolution computation and data loading of a neural network accelerator includes: a convolution computation array,).
As per claim 4, Zhu, Chao, and Lovell teach the computer-implemented method according to claim 3. Zhu teaches wherein synchronizing comprises evaluating potential savings associated with at least one of the read operations or the write operations ([0037] the parallel method and device for convolution computation and data loading of a neural network accelerator provided in the present disclosure make convolution computation and data loading completely parallel, and hides part of the convolution kernels and loading time of the next frame of input feature map, thereby greatly reducing total reasoning time of the neural network accelerator, and improving efficiency and real-time performance of the neural network accelerator; the parallel method may remarkably reduce space of the on-chip convolution kernel cache, thereby reducing an area of a chip, and reducing manufacturing cost of the chip; [0013] computing convolution and synchronously loading convolution kernels).
Additionally, Chao teaches potential power savings associated with at least one of the read operations or the write operations ([0039] In FIG. 10, there are 8R and 4W with the DRAM in the memory-adaptive processing method 100a of the present disclosure. Therefore, the memory-adaptive processing method 100a of the present disclosure can reduce DRAM access so as to reduce power consumption.).
As per claim 5, Zhu, Chao, and Lovell teach the computer-implemented method according to claim 3. Zhu teaches wherein the resource optimizer uses a correlation between the read and write operations to instruct the hardware accelerator to select one execution path over another or to change an order of processing ([0052] Step 3, execute convolution computation, put a result into the output feature map cache, and simultaneously storing a next group of 64 convolution kernels into another convolution kernel cache; [0037] the parallel method and device for convolution computation and data loading of a neural network accelerator provided in the present disclosure make convolution computation and data loading completely parallel, and hides part of the convolution kernels and loading time of the next frame of input feature map, thereby greatly reducing total reasoning time of the neural network accelerator, and improving efficiency and real-time performance of the neural network accelerator; the parallel method may remarkably reduce space of the on-chip convolution kernel cache, thereby reducing an area of a chip, and reducing manufacturing cost of the chip; [0005] The prior art generally prepares the input feature map and convolution kernel weight in advance, computes convolution according to the convolution instruction to output the result, and then continues preparing the input feature map and the convolution kernel weight to compute convolution, etc. The prior art has simple, direct and easily understood principles, but does not fully consider parallelism between various operations, which makes the reasoning time of large-scale neural networks too long.).
As per claim 6, Zhu, Chao, and Lovell teach the computer-implemented method according to claim 1. Chao teaches wherein at least one of the one or more network parameters or the one or more memory access metrics is used to determine a power consumption ([0039] FIG. 10 shows a schematic view of read and write access of the DRAM and the feature map cache in the memory-adaptive processing method 100 a of FIG. 6. In FIGS. 8, 9 and 10, the capacity of the feature map cache is three feature map tiles, and there are four input channels (Input Map 0-3) and four output channels (Output Map 0-3) to process (i.e., N=4, M=4). R and W represent read access and write access of the DRAM, respectively. and w represent read access and write access of the feature map cache, respectively. In FIG. 8, there are 13R and 4W with the DRAM in the first conventional processing method. In FIG. 9, there are 10R and 4W with the DRAM in the first conventional processing method. In FIG. 10, there are 8R and 4W with the DRAM in the memory-adaptive processing method 100a of the present disclosure. Therefore, the memory-adaptive processing method 100a of the present disclosure can reduce DRAM access so as to reduce power consumption; [0046] The output feature map tile 300 has a total output size (M×So). The feature map cache 420 has a cache free space size. A size relation among a number N of a plurality of input channels of the input feature maps, an input feature map tile size, a number M of a plurality of output channels of the output feature maps, an output feature map tile size and the cache free space size of the feature map cache 420 is calculated by the processing loop controller 434. The memory-adaptive processing technique includes a transformation of a processing loop structure for the convolutional layer operation according to the size relation. Hence, the memory-adaptive processing system 400 of the present disclosure can reduce DRAM access so as to reduce power consumption.).
As per claim 7, Zhu, Chao, and Lovell the computer-implemented method according to claim 1. Zhu teaches wherein the hardware accelerator is a convolutional neural network accelerator that performs at least one of one-dimensional or two-dimensional convolution operations to generate output data by applying one or more of the network parameters to a neural network (Abstract Disclosed are a parallel method and device for convolution computation and data loading of a neural network accelerator; [0035] the convolution computation array is composed of a two-dimensional array, the input feature map cache and the first convolution kernel cache are loaded to execute convolution computation, a result is put into the output feature map cache; [0061] use another convolution kernel cache as a weight input to continue convolution computation.).
As per claim 8, Zhu, Chao, and Lovell teach the computer-implemented method according to claim 1. Zhu teaches wherein the first output is generated by a first neural network layer and serves as an input to a second neural network layer (claim 1 after convolution computation of the layer is completed, interchanging the input feature map cache and the output feature map cache; [0061] Step 4, after convolution computation of the layer is completed, interchange the input feature map cache and the output feature map cache, and use another convolution kernel cache as a weight input to continue convolution computation; [0031] Further, a next frame of input feature map is loaded, and/or convolution kernels in the next layer are loaded after all convolution computation is completed in S5.).
As per claim 9, Zhu, Chao, and Lovell teach the computer-implemented method according to claim 8. Zhu teaches wherein the resource optimizer uses the set of parameters to determine a second set of target locations in the set of memory elements for storing a second output associated with the second neural network layer ([0035] the convolution computation array is composed of a two-dimensional array, the input feature map cache and the first convolution kernel cache are loaded to execute convolution computation, a result is put into the output feature map cache, and moreover, a next group of convolution kernels are stored into corresponding convolution kernel cache sub-blocks in the second convolution kernel cache; and after convolution computation of the layer is completed, the input feature map cache and the output feature map cache are interchanged; [0059] The input feature map cache and the output feature map cache are consistent in structure, stores feature maps according to channels, and store 64 channels into different cache sub-blocks; [0037] the parallel method and device for convolution computation and data loading of a neural network accelerator provided in the present disclosure make convolution computation and data loading completely parallel, and hides part of the convolution kernels and loading time of the next frame of input feature map, thereby greatly reducing total reasoning time of the neural network accelerator, and improving efficiency and real-time performance of the neural network accelerator; the parallel method may remarkably reduce space of the on-chip convolution kernel cache).
As per claim 10, Zhu, Chao, and Lovell teach the computer-implemented method according to claim 1. Chao teaches wherein the one or more memory access metrics are derived by using at least one of a formula, an average, a region, an address range, a value range, or a memory access estimation model ([0033] When the total output size (M×So) is smaller than the cache free space size of the feature map cache (CFS), all the output channels of the output feature map tile 300 are stored in the feature map cache; [0039] In FIG. 10, there are 8R and 4W with the DRAM in the memory-adaptive processing method 100a of the present disclosure).
As per claim 11, Zhu, Chao, and Lovell teach the computer-implemented method according to claim 1. Zhu teaches wherein the read/write synchronizer comprises a partition controller that assigns storage locations associated with one or more output channels ([0059] The input feature map cache and the output feature map cache are consistent in structure, stores feature maps according to channels, and store 64 channels into different cache sub-blocks; [0032] A parallel device for convolution computation and data loading of a neural network accelerator includes: a convolution computation array, and an input feature map cache, an output feature map cache).
As per claim 12, it is a resource optimizer claim of claim 1, so it is rejected for similar reasons. Additionally, Zhu teaches multiple processors; and a non-transitory computer-readable medium or media comprising one or more sets of instructions which, when executed by at least one of the one or more processors, causes steps to be performed ([0054] When a neural network accelerator is started, instructions are sequentially read from an instruction module; [0063] Step 5, load a next frame of input feature map according to setting of an instruction register; [0059] the convolution computation array is composed of a 64×64 two-dimensional array; [0035] the convolution computation array is composed of a two-dimensional array, the input feature map cache and the first convolution kernel cache are loaded to execute convolution computation, a result is put into the output feature map cache; [0032] A parallel device for convolution computation and data loading of a neural network accelerator includes: a convolution computation array,).
As per claim 14, it is a resource optimizer claim of claim 9, so it is rejected for similar reasons.
As per claim 15, it is a resource optimizer claim of claim 6, so it is rejected for similar reasons.
As per claim 16, it is a resource optimizer claim of claim 3, so it is rejected for similar reasons.
As per claim 17, it is a resource optimizer claim of claim 4, so it is rejected for similar reasons.
As per claim 18, it is a resource optimizer claim of claim 5, so it is rejected for similar reasons.
As per claim 19, Zhu, Chao, and Lovell teach the resource optimizer according to claim 12. Chao teaches wherein the set of input data comprises at least one of audio data or image data ([0003] For example, deep learning-based image and voice recognition may be implemented through a trained CNN. The convolutional neural network is widely used in various applications, especially in image and video applications).
As per claim 20, it is a non-transitory computer-readable medium or media claim of claim 1, so it is rejected for similar reasons. Additionally, Zhu teaches a non-transitory computer-readable medium or media comprising one or more sequences of instructions which, when executed by multiple processors, causes steps for increasing hardware accelerator performance in neural network applications ([0054] When a neural network accelerator is started, instructions are sequentially read from an instruction module; [0063] Step 5, load a next frame of input feature map according to setting of an instruction register; [0037] improving efficiency and real-time performance of the neural network accelerator; [0059] the convolution computation array is composed of a 64×64 two-dimensional array; [0035] the convolution computation array is composed of a two-dimensional array, the input feature map cache and the first convolution kernel cache are loaded to execute convolution computation, a result is put into the output feature map cache).
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HSING CHUN LIN whose telephone number is (571)272-8522. The examiner can normally be reached Mon - Fri 9AM-5PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Aimee Li can be reached at (571) 272-4169. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/H.L./Examiner, Art Unit 2195
/Aimee Li/Supervisory Patent Examiner, Art Unit 2195