Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Notice to Applicants
This communication is in response to the Application filed on 03/20/2024.
Claims 1-20 are pending.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim 1-4, 6-13 and 15-20 are rejected under 35 U.S.C. 103 as being unpatentable over Venkatesh (U.S. Publication No. 2021/0019633) in view of Selvam et al. (NPL - FuSeConv: Fully Separable Convolutions for Fast Inference on Systolic Arrays) (hereafter, "Selvam") and further in view of KIM et al. (U.S. Publication No. 2023/0267313) (hereafter, "KIM").
Venkatesh teaches A method for processing an input image using a hardware integrated circuit configured to implement a convolutional neural network ([0042] the AI accelerator 108 can provide hardware acceleration for artificial intelligence applications, including artificial neural networks, machine vision and machine learning ... the AI accelerator 108 can include ... application-specific integrated circuit (ASIC)) comprising a plurality of neural network layers, the plurality of neural network layers ([0029] A neural network 114 of the AI accelerator 108 can include any type of neural network including, for example, a convolution neural network (CNN); [0030] The convolution neural network can include one or more convolution cells (or pooling layers) and kernels) comprising a group convolution layer, the method comprising ([0057] performing a convolution on channels of a plurality of channels, e.g., that are in a corresponding partition among a plurality of partitions; [0058] Group convolution can enable power savings by limiting or reducing the amount of processing to subsets of weight and/or activation information among multiple channels): identifying a control parameter that defines a plurality of partitions along a channel dimension of an input feature map ([0004] the circuitry may be configured to circularly shift the plurality of channels arranged in the first order to the second order by a determined number of channels; [0008] the plurality of channels arranged in the first order may be circularly shifted to the second order by a determined number of channels; [0069] the accelerator 230 may circularly shift the first plurality of channels 231 arranged in the first order to a different order by more than one channel (e.g., channel position or channel size); [0062] an AI accelerator (or an “accelerator”) can receive an (M×K×D) input data 211 … where D is the number of channels); determining a mapping of the plurality of partitions to a plurality of multiply accumulate cells (MACs) ([0071] for each of the partitions 37, 38, 39, the MAC unit 239 of the accelerator 230 may perform a convolution on channels of the second plurality of channels 232 that are in a corresponding second partition, for a second layer of the neural network. For example, the MAC unit 239 may perform a convolution on the entirety of channels C9, C1, C2 in the partition 37, perform a convolution on the entirety of channels C3, C4, C5 in the partition 38, and perform a convolution on the entirety of channels C6, C7, C8 in the partition 39) in a computational unit of the integrated circuit ([0044] In a neural network 114 (e.g., artificial neural network) implemented in the AI accelerator 108, neurons can take various forms and can be referred to as processing elements (PEs); [0049] a PE 120 can include one or more multiply-accumulate (MAC) units or circuits 140); applying, for the group convolution layer, a group convolution to the input feature map ([0047] Referring again to FIG. 1B, the input x to a PE 120 can be part of an input stream 132 that is read or accessed from a storage device 126 (e.g., SRAM). An input stream 132 can be directed to one row (horizontal bank or group) of PEs, and can be shared across one or more of the PEs, or partitioned into data portions (overlapping or non-overlapping data portions) as inputs for respective PEs. Weights 134 (or weight information) in a weight stream (e.g., read from the storage device 126) can be directed or provided to a column (vertical bank or group) of PEs), comprising, for each of the plurality of partitions ([0071] for each of the partitions 37, 38, 39, the MAC unit 239 of the accelerator 230 may perform a convolution on channels of the second plurality of channels) … providing, via an input bus of the integrated circuit, a respective input of the input feature map to each MAC in the subset; and ([0047] An input stream 132 can be directed to one row (horizontal bank or group) of PEs, and can be shared across one or more of the PEs, or partitioned into data portions (overlapping or non-overlapping data portions) as inputs for respective PEs ... The input and/or weight for each target PE can be directly routed (e.g., from the storage device 126) to the target PE (e.g., without passing through other PE(s)), or can be routed through one or more PEs (e.g., along a row or column of PEs) to the target PE; [0049] a PE 120 can include one or more multiply-accumulate (MAC) units or circuits 140; [0062] an AI accelerator (or an “accelerator”) can receive an (M×K×D) input data 211).
Venkatesh does not expressly teach based on the determined mapping, providing weights for the group convolution layer to a subset of the plurality of MACs … computing, at each MAC in the subset, a product using the respective input and a corresponding weight for the group convolution layer; and generating an output feature map for the group convolution layer based on an accumulation of products.
However, Selvam teaches based on the determined mapping (Page 1, right column, lines 18-19, FuSeConv generalizes the decomposition fully to separable 1D convolutions along all spatial and depth dimensions; Page 4, left column, the last para. & right column, lines 1-3, The independent 1D convolutions of FuSeConv can be mapped into individual rows of a 2D systolic array. However, this mapping requires a slightly modified dataflow: Each row of a systolic array should have a broadcast link that sends weight values to all PEs in that row ... This modified dataflow can co-exist with the standard systolic flow from top to bottom. For this, the PE can be configured (as shown in Fig. 5) to either read data from the top systolic link or the row broadcast link), providing weights for the group convolution layer to a subset of the plurality of MACs (Page 4, left column, the last para. & right column, lines 1-3, The independent 1D convolutions of FuSeConv can be mapped into individual rows of a 2D systolic array. However, this mapping requires a slightly modified dataflow: Each row of a systolic array should have a broadcast link that sends weight values to all PEs in that row ... This modified dataflow can co-exist with the standard systolic flow from top to bottom. For this, the PE can be configured (as shown in Fig. 5) to either read data from the top systolic link or the row broadcast link; Page 4, right column, lines 16-17, weights are sliced across channels into C/2 1D filters denoted K1 through KC/2; Fig. 6).
It would have been obvious before the effective filing date of the claimed invention to one having ordinary skill in the art to modify the device and method of Venkatesh to incorporate the step/system of broadcasting sliced weights to a subset of PEs based on a specific row-based mapping for the separable convolution taught by Selvam.
The suggestion/motivation for doing so would have been to improve the efficiency in neural network inference on hardware while maintaining accuracy (Page 6, left column, the last para. & right column, the first para., We showed that FuSeConv is efficiently executed on systolic arrays with a modified dataflow. The Full variants of FuSeConv match the accuracy of efficient networks such as MobileNet and MNasNet, but with high speed-up of about 4X … our work motivates automated Network Operator Search (NOS) in complement to ongoing studies on NAS). Further, one skilled in the art could have combined the elements as described above by known method with no change in their respective functions, and the combination would have yielded nothing more than predicted results.
The combination Venkatesh and Selvam does not expressly teach computing, at each MAC in the subset, a product using the respective input and a corresponding weight for the group convolution layer; and generating an output feature map for the group convolution layer based on an accumulation of products.
However, KIM teaches computing, at each MAC in the subset, a product using the respective input and a corresponding weight for the group convolution layer; and ([0079] an N-th type MAC unit 342 (e.g., the first type MAC unit) included in the computation unit 340 may receive a weight value group and a feature value group as an input and then multiply an element in the weight value group by a corresponding element in the feature value group for each channel) generating an output feature map for the group convolution layer based on an accumulation of products ([0081] The computation data processor 350 may generate a transformed output feature map by using MAC partial sums calculated by the plurality of MAC units; [0079] The computation unit 340 may cumulatively add results of performing element-wise multiplication operations channel by channel to thereby output a partial sum that is a result of a MAC operation between the weight value group and the feature value group).
It would have been obvious before the effective filing date of the claimed invention to one having ordinary skill in the art to modify the device and method of combination of Venkatesh and Selvam to incorporate the step/system of multiplying an element in the weight value group by a corresponding element in the feature value group for each channel and generating an output feature map based on an accumulation of element-wise multiplications taught by KIM.
The suggestion/motivation for doing so would have been to improve the hardware and computational efficiency ([0061] the output feature map may be obtained with a reduced number of multiplication operations compared to when performing a convolution operation in a spatial domain while achieving the same result as in the convolution operation in the spatial domain; [0076] in order to reduce the area of hardware performing computations, a bit length of a transformed feature value or a transformed weight value may need to be reduced by using various types of MAC units). Further, one skilled in the art could have combined the elements as described above by known method with no change in their respective functions, and the combination would have yielded nothing more than predicted results. Therefore, it would have been obvious to combine Venkatesh and Selvam with KIM to obtain the invention as specified in claim 1.
Regarding claim 2, the combination of Venkatesh and Selvam with KIM teaches all the limitations of claim 1 above. Venkatesh teaches wherein determining a mapping of the plurality of partitions to the plurality of multiply accumulate cells comprises ([0071] for each of the partitions 37, 38, 39, the MAC unit 239 of the accelerator 230 may perform a convolution on channels of the second plurality of channels 232 that are in a corresponding second partition, for a second layer of the neural network. For example, the MAC unit 239 may perform a convolution on the entirety of channels C9, C1, C2 in the partition 37, perform a convolution on the entirety of channels C3, C4, C5 in the partition 38, and perform a convolution on the entirety of channels C6, C7, C8 in the partition 39): determining the mapping based on a number of channels in each of the plurality of partitions ([0062] an AI accelerator (or an “accelerator”) can receive an (M×K×D) input data 211 and a (K×N×D) kernel matrix 212 as kernels, where D is the number of channels. The input data 211 may include D number of channel data (e.g., single channel data 218), corresponding to the number of channels ... Examples of the D number of channels can be found in red-green-blue (RGB) data, in which the number of channels is three).
Regarding claim 3, the combination of Venkatesh and Selvam with KIM teaches all the limitations of claim 2 above. Selvam teaches wherein: each partition of the plurality of partitions comprises a respective quantity of input channels that correspond to a respective size of the partition (Page 4, right column, lines 6-16, Fig. 6 illustrates how a FuSeConv is mapped onto a systolic array of size S x S with the modified dataflow. We choose the half variant (i.e., D = 2), and only show the mapping of row 1D filters. The mapping of column filters and the full variant follows similarly and the 1x1 pointwise convolution of FuSeConv is mapped to the standard systolic dataflow. The input is sliced into its W rows, denoted A1 through AW, and each slice is allocated to one row of the systolic array. Every row is further sliced across the channels into C=2 channel slices denoted A1,1 through A1,C/2).
Regarding claim 4, the combination of Venkatesh and Selvam with KIM teaches all the limitations of claim 3 above. Selvam teaches wherein generating the output feature map comprises: generating the output feature map based on the respective size of each partition (Page 4, right column, lines 12-21, The input is sliced into its W rows, denoted A1 through AW, and each slice is allocated to one row of the systolic array. Every row is further sliced across the channels into C=2 channel slices denoted A1,1 through A1,C/2 ... The computation follows multiple folds wherein at every fold one weight channel slice operates over one input-channel slice generating S output feature map slices. After all folds are computed, the output slices are concatenated to form the output feature map).
Regarding claim 6, the combination of Venkatesh and Selvam with KIM teaches all the limitations of claim 1 above. Venkatesh teaches wherein the input bus ([0047] An input stream 132) includes a broadcast function and the method further comprises: broadcasting, via the input bus and for each partition, multiple inputs of the input feature map to the computational unit of the integrated circuit ([0047] FIG. 1B, the input x to a PE 120 can be part of an input stream 132 … An input stream 132 can be directed to one row (horizontal bank or group) of PEs, and can be shared across one or more of the PEs, or partitioned into data portions (overlapping or non-overlapping data portions) as inputs for respective PEs ... The input and/or weight for each target PE can be directly routed (e.g., from the storage device 126) to the target PE; [0049] a PE 120 can include one or more multiply-accumulate (MAC) units or circuits 140; [0062] The input data 211 may include D number of channel data (e.g., single channel data 218)).
Regarding claim 7, the combination of Venkatesh and Selvam with KIM teaches all the limitations of claim 6 above. Venkatesh teaches further comprising: broadcasting, via the input bus and for a first partition of the input feature map, first inputs of the first partition to each MAC in the subset ([0047] FIG. 1B, the input x to a PE 120 can be part of an input stream 132 … An input stream 132 can be directed to one row (horizontal bank or group) of PEs, and can be shared across one or more of the PEs, or partitioned into data portions (overlapping or non-overlapping data portions) as inputs for respective PEs ... The input and/or weight for each target PE can be directly routed (e.g., from the storage device 126) to the target PE; [0049] a PE 120 can include one or more multiply-accumulate (MAC) units or circuits 140; [0062] The input data 211 may include D number of channel data (e.g., single channel data 218)); wherein the first inputs that are broadcast are reused during computations for the group convolution layer ([0061] the accelerator can be configured to use a channel identifier (ID) to output an address in the memory using an address mapping function; [0071] the accelerator 230 may partition the second plurality of channels 232 into a second plurality of partitions including three partitions 37, 38, 39 … the number of the plurality of second partitions (e.g., three partitions 37, 38, 39 in FIG. 2C) may be the same as that of the plurality of first partitions (e.g., three partitions 27, 28, 29 in FIG. 2C). Each of the plurality of second partitions may have at least one channel common with a corresponding one of the plurality of first partitions, and can have at least one channel that is different from channels in the corresponding one of the plurality of first partitions. For example, the partition 37 can have two channels (e.g., channels C1 and C2) common with the partition 27, and one channel that is different … the MAC unit 239 of the accelerator 230 may perform a convolution on channels of the second plurality of channels 232 that are in a corresponding second partition).
Regarding claim 8, the combination of Venkatesh and Selvam with KIM teaches all the limitations of claim 7 above. Venkatesh teaches wherein: the first partition of the input feature map corresponds to a first partition of the output feature map ([0071] the accelerator 230 may partition the second plurality of channels 232 into a second plurality of partitions including three partitions 37, 38, 39 … the number of the plurality of second partitions (e.g., three partitions 37, 38, 39 in FIG. 2C) may be the same as that of the plurality of first partitions (e.g., three partitions 27, 28, 29 in FIG. 2C). Each of the plurality of second partitions may have at least one channel common with a corresponding one of the plurality of first partitions, and can have at least one channel that is different from channels in the corresponding one of the plurality of first partitions. For example, the partition 37 can have two channels (e.g., channels C1 and C2) common with the partition 27, and one channel that is different; [0039]); and the first inputs have reuse over outputs of the first partition of the output feature map ([0071] after performing the convolution on the second plurality of channels 232 in respective partitions of the second plurality of partitions, the MAC unit 239 may overwrite or update the second plurality of channels with a result of the convolution corresponding to respective partitions … after performing the convolution on the second plurality of channels 232 in respective partitions of the second plurality of partitions, each partition 37, 38, 39 may include a result of the convolution on a corresponding partition of channels in data of the second layer of the neural network … the MAC unit 239 can store the second plurality of channels C9, C1, C2, C3, . . . , C8 in a same continuous range of addresses in the storage device 237 as the data of the second layer of the neural network; [0061] the accelerator can be configured to use a channel identifier (ID) to output an address in the memory using an address mapping function).
Regarding claim 9, the combination of Venkatesh and Selvam with KIM teaches all the limitations of claim 1 above. KIM teaches wherein generating the output feature map comprises ([0081] The computation data processor 350 may generate a transformed output feature map by using MAC partial sums calculated by the plurality of MAC units; [0079] The computation unit 340 may cumulatively add results of performing element-wise multiplication operations channel by channel to thereby output a partial sum that is a result of a MAC operation between the weight value group and the feature value group): computing a plurality of products using the subset of the plurality of MACs ([0078] referring to FIG. 3E, the computation unit 340 may include a first type MAC unit, a second type MAC unit, and a third type MAC unit; [0136] Referring to FIG. 7A, the first type MAC unit may include a plurality of multiplier units 710 and an accumulator 750 that accumulates and adds outputs respectively from the plurality of multiplier units); and generating the accumulation of products from the plurality of products ([0164] The first type MAC unit may output a partial sum by performing a MAC operation between the weight value group and the feature value group, i.e., by respectively obtaining multiplication operation results (multiplication operation result A to multiplication operation result N) from the plurality of multiplier units and accumulating and adding the plurality of multiplication operation results using the accumulator 750; [0136] the first type MAC unit may include a plurality of multiplier units 710 and an accumulator 750 that accumulates and adds outputs respectively from the plurality of multiplier units).
With respect to claim 10, arguments analogous to those presented for claim 1, are applicable.
With respect to claim 11, arguments analogous to those presented for claim 2, are applicable.
With respect to claim 12, arguments analogous to those presented for claim 3, are applicable.
With respect to claim 13, arguments analogous to those presented for claim 4, are applicable.
With respect to claim 15, arguments analogous to those presented for claim 6, are applicable.
With respect to claim 16, arguments analogous to those presented for claim 7, are applicable.
With respect to claim 17, arguments analogous to those presented for claim 8, are applicable.
With respect to claim 18, arguments analogous to those presented for claim 9, are applicable.
With respect to claim 19, arguments analogous to those presented for claim 1, are applicable.
With respect to claim 20, arguments analogous to those presented for claim 3, are applicable.
Claim 5 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Venkatesh (U.S. Publication No. 2021/0019633) in view of Selvam et al. (NPL - FuSeConv: Fully Separable Convolutions for Fast Inference on Systolic Arrays) (hereafter, "Selvam") and further in view of KIM et al. (U.S. Publication No. 2023/0267313) (hereafter, "KIM") and Raha et al. (U.S. Publication No. 2021/0397414) (hereafter, "Raha").
Regarding claim 5, the combination of Venkatesh and Selvam with KIM teaches all the limitations of claim 3 above. The combination of Venkatesh and Selvam with KIM does not expressly teach further comprising: accessing information describing a hardware configuration of the computational unit; and determining the respective size of each partition based on the hardware configuration of the computational unit.
However, Raha teaches further comprising: accessing information describing a hardware configuration of the computational unit ([0038] data is implemented and fed to the multi-precision MAC unit based on sparsity ... “MultiMAC” is an area-efficient, multi-precision multiply and accumulate unit-based processing element in DNN accelerators; [0073] The same logic can be kept intact and easily be applied for MultiMAC by introducing the concept of block sparsity where each bit in bitmap can either represent 1, 2, 4, or 8 ICs based on whether UINT8/INT8, UINT4/INT4, UINT2/INT2, or binary mode (BIN), respectively, are active); and determining the respective size of each partition based on the hardware configuration of the computational unit ([0039] a multi-precision MAC processor 40 that includes a plurality of arithmetic blocks 42 (42a-42n, e.g., arithmetic logic units/ALUs), wherein the plurality of arithmetic blocks 42 share a single multiplier size 44 that is uniform across the plurality of arithmetic blocks 42 ... each of the arithmetic blocks 42 includes a set of multipliers 46 and each of the multipliers 46 operates on the same number of bits. Additionally, the single multiplier size 44 is less than a maximum precision size supported by the plurality of arithmetic blocks. In one example, the maximum precision size is eight bits … and the single multiplier size 44 is five bits (5b) ... the multi-precision MAC processor 40 includes logic (e.g., logic instructions, configurable hardware, fixed-functionality hardware, etc., or any combination thereof) to arrange sparsity information for activations and weights in accordance with a bitmap format that is common to multiple precisions).
It would have been obvious before the effective filing date of the claimed invention to one having ordinary skill in the art to modify the device and method of combination of Venkatesh and Selvam with KIM to incorporate the step/system of identifying the available hardware configuration and determining how to partition the input data (size) based on the precision defined by the hardware configuration taught by Raha.
The suggestion/motivation for doing so would have been to improve the performance of MAC operations ([0001] embodiments relate to area and energy efficient multi-precision MAC (“MultiMAC”) unit-based processors; [0037] The introduction of a multi-precision MAC basically leads to performance improvements that can significantly improve two measurable metrics for DNN accelerators). Further, one skilled in the art could have combined the elements as described above by known method with no change in their respective functions, and the combination would have yielded nothing more than predicted results. Therefore, it would have been obvious to combine Venkatesh, Selvam and KIM with Raha to obtain the invention as specified in claim 5.
With respect to claim 14, arguments analogous to those presented for claim 5, are applicable.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DANIEL C. CHANG whose telephone number is (571)270-1277. The examiner can normally be reached Monday-Thursday and Alternate Fridays 8:00-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Chan S. Park can be reached at (571) 272-7409. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/DANIEL C CHANG/Examiner, Art Unit 2669 /CHAN S PARK/Supervisory Patent Examiner, Art Unit 2669