DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claim(s) 1-20 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Nair et al., (US 2021/0256363 A1, hereinafter Nair).
Regarding claim 1:
Nair shows:
“A data stream-based computation unit, comprising a plurality of computation circuits, each computation circuit comprising a first input terminal and a second input terminal, wherein M first input terminals of M computation circuits in the plurality of computation circuits are configured to receive M pieces of first data required for performing a computation task on a one-to-one basis, where M≥2 and M is a positive integer; M second input terminals of the M computation circuits are configured to receive M pieces of second data distinct from each other required for performing the computation task on a one-to-one basis;” (Paragraph [0022]: “A processor system for performing efficient convolution operations using a channel convolution processor is disclosed. Using the disclosed techniques, the throughput and power efficiency for computing convolution operations and in particular depthwise convolutions is significantly increased particularly for input activation data with small width and height dimensions. In some embodiments, the processor system includes a channel convolution processor unit capable of performing convolution operations using two input matrices by applying different weight matrices to different channels of portions of a data convolution matrix. The channel convolution processor unit includes a plurality of calculation units such as vector units used to process input vectors of the input matrices. In various embodiments, a calculation unit includes at least a vector multiply unit and a vector adder unit. The vector multiply unit is capable of performing multiply operations using corresponding elements of two input vectors, data elements from the same channel and weight input elements from a weight matrix. In some embodiments, the vector adder unit is used to sum the vector of multiplication results computed using a vector multiply unit. For example, the vector adder unit can be used to compute the dot product result of two vectors using the vector multiplication results of vector elements from corresponding input vectors. In some embodiments, the vector adder unit is an adder tree. For example, an adder tree computes the sum of the multiplication results by summing multiplication results and subsequent partial sums in parallel.” And in paragraph [0033]: “multiple instances of processing element 101 can operate in parallel to process different portions of an activation data input matrix. For example, each processing element can retrieve its assigned data elements of the activation data input matrix and corresponding weight matrices from memory 161. In some embodiments, different processing elements share weight matrices and the data elements of the shared weight matrices can be broadcasted to the appropriate processing elements to improve memory efficiency. Each processing element performs depthwise convolution operations on the assigned portions of the activation data input matrix using its own channel convolution processor unit. The results of each processing element can be combined, for example, by writing the results to a shared memory location such as memory 161. In some embodiments, channel convolution processor unit 107 includes the functionality of data input unit 103, weight input unit 105, and/or output unit 151.”)
“the M computation circuits are configured to perform the computation task in parallel on the basis of the M pieces of first data and the M pieces of second data, wherein each computation circuit of the M computation circuits is configured to perform the computation task on the basis of one piece of first data and one piece of second data.” (Paragraph [0022]: “A processor system for performing efficient convolution operations using a channel convolution processor is disclosed. Using the disclosed techniques, the throughput and power efficiency for computing convolution operations and in particular depthwise convolutions is significantly increased particularly for input activation data with small width and height dimensions. In some embodiments, the processor system includes a channel convolution processor unit capable of performing convolution operations using two input matrices by applying different weight matrices to different channels of portions of a data convolution matrix. The channel convolution processor unit includes a plurality of calculation units such as vector units used to process input vectors of the input matrices. In various embodiments, a calculation unit includes at least a vector multiply unit and a vector adder unit. The vector multiply unit is capable of performing multiply operations using corresponding elements of two input vectors, data elements from the same channel and weight input elements from a weight matrix. In some embodiments, the vector adder unit is used to sum the vector of multiplication results computed using a vector multiply unit. For example, the vector adder unit can be used to compute the dot product result of two vectors using the vector multiplication results of vector elements from corresponding input vectors. In some embodiments, the vector adder unit is an adder tree. For example, an adder tree computes the sum of the multiplication results by summing multiplication results and subsequent partial sums in parallel.” And in paragraph [0033]: “multiple instances of processing element 101 can operate in parallel to process different portions of an activation data input matrix. For example, each processing element can retrieve its assigned data elements of the activation data input matrix and corresponding weight matrices from memory 161. In some embodiments, different processing elements share weight matrices and the data elements of the shared weight matrices can be broadcasted to the appropriate processing elements to improve memory efficiency. Each processing element performs depthwise convolution operations on the assigned portions of the activation data input matrix using its own channel convolution processor unit. The results of each processing element can be combined, for example, by writing the results to a shared memory location such as memory 161. In some embodiments, channel convolution processor unit 107 includes the functionality of data input unit 103, weight input unit 105, and/or output unit 151.”)
Regarding claim 2:
Nair shows the computation based unit of claim 1.
And Nair shows “wherein the computation task is a computation in a neural network model.” (Paragraph [0044]: “FIG. 4 is a block diagram illustrating an embodiment of a channel convolution processor unit for solving artificial intelligence problems using a neural network. In the example shown, channel convolution processor unit 400 includes multiple vector units including vector units 401, 411, 421, and 451. The three dots between vector units 421 and 451 indicate optional additional vector units (not shown). In various embodiments, a channel convolution processor unit may include more or fewer vector units. The number of vector units corresponds to the number of channels and associated weight matrices that can be processed in parallel. For example, a channel convolution processor unit may include 32 vector units, each capable of processing a channel of a portion of an activation data input matrix with an associated weight matrix.”)
Regarding claim 3:
Nair shows the computation based unit of claim 2.
And Nair shows “wherein the computation task is a convolution, and the M pieces of first data are identical to each other; each piece of first data comprises feature map data corresponding to M pieces of convolution kernel data in a feature map, and each piece of second data comprises one of the M pieces of convolution kernel data.” (Paragraph [0044]: “FIG. 4 is a block diagram illustrating an embodiment of a channel convolution processor unit for solving artificial intelligence problems using a neural network. In the example shown, channel convolution processor unit 400 includes multiple vector units including vector units 401, 411, 421, and 451. The three dots between vector units 421 and 451 indicate optional additional vector units (not shown). In various embodiments, a channel convolution processor unit may include more or fewer vector units. The number of vector units corresponds to the number of channels and associated weight matrices that can be processed in parallel. For example, a channel convolution processor unit may include 32 vector units, each capable of processing a channel of a portion of an activation data input matrix with an associated weight matrix.” And in paragraph [0046]: “channel convolution processor unit 400 includes multiple vector units that each include a vector multiply and a vector adder unit. Each vector multiply unit, such as vector multiply units 403, 413, 423, or 453, is configured to multiply corresponding elements received via a data input unit (not shown) and a weight input unit (not shown)”)
Regarding claim 4:
Nair shows the computation based unit of claim 3.
And Nair shows: “wherein the feature map data comprises feature map sub-data of N channels, each piece of convolution kernel data comprises weight data of N channels, either the feature map sub-data of each channel or the weight data of each channel is an m×n matrix, where N≥2, n≥1, m≥1, and N, m and n are all positive integers; at least one computation circuit of the M computation circuits comprises: a plurality of multipliers, each multiplier of N multipliers in the plurality of multipliers being configured to multiply an element in an i-th row and a j-th column of the feature map sub-data of a corresponding channel by an element in an i-th row and a j-th column of the weight data of the corresponding channel to obtain a plurality of first computation results, wherein the N multipliers correspond to the feature map sub-data of N channels on a one-to-one basis and correspond to the weight data of N channels on a one-to-one basis, where 1≤i≤m, 1≤j≤n, and i and j are both positive integers; and an accumulator configured to perform an accumulation operation to obtain a result of the convolution, the accumulation operation comprising performing a first accumulation operation on the plurality of first computation results of each of the N multipliers.” (Paragraph [0044]: “FIG. 4 is a block diagram illustrating an embodiment of a channel convolution processor unit for solving artificial intelligence problems using a neural network. In the example shown, channel convolution processor unit 400 includes multiple vector units including vector units 401, 411, 421, and 451. The three dots between vector units 421 and 451 indicate optional additional vector units (not shown). In various embodiments, a channel convolution processor unit may include more or fewer vector units. The number of vector units corresponds to the number of channels and associated weight matrices that can be processed in parallel. For example, a channel convolution processor unit may include 32 vector units, each capable of processing a channel of a portion of an activation data input matrix with an associated weight matrix. In some embodiments, each vector unit includes a vector multiply unit and a vector adder unit. In the example shown, vector unit 401 includes vector multiply unit 403 and vector adder unit 405. Similarly, vector unit 411 includes vector multiply unit 413 and vector adder unit 415, vector unit 421 includes vector multiply unit 423 and vector adder unit 425, and vector unit 451 includes vector multiply unit 453 and vector adder unit 455. In various embodiments, channel convolution processor unit 400 is channel convolution processor unit 107 of FIG. 1 and vector units 401, 411, 421, and 451 are vector units 111, 121, 131, and 141 of FIG. 1, respectively.”)
Regarding claim 5:
Nair shows the computation based unit of claim 4.
And Nair shows “wherein the accumulation operation further comprises a second accumulation operation on a result of the first accumulation operation and a piece of bias data to obtain the result of the convolution.” (Paragraph [0044]: “FIG. 4 is a block diagram illustrating an embodiment of a channel convolution processor unit for solving artificial intelligence problems using a neural network. In the example shown, channel convolution processor unit 400 includes multiple vector units including vector units 401, 411, 421, and 451. The three dots between vector units 421 and 451 indicate optional additional vector units (not shown). In various embodiments, a channel convolution processor unit may include more or fewer vector units. The number of vector units corresponds to the number of channels and associated weight matrices that can be processed in parallel. For example, a channel convolution processor unit may include 32 vector units, each capable of processing a channel of a portion of an activation data input matrix with an associated weight matrix. In some embodiments, each vector unit includes a vector multiply unit and a vector adder unit. In the example shown, vector unit 401 includes vector multiply unit 403 and vector adder unit 405. Similarly, vector unit 411 includes vector multiply unit 413 and vector adder unit 415, vector unit 421 includes vector multiply unit 423 and vector adder unit 425, and vector unit 451 includes vector multiply unit 453 and vector adder unit 455. In various embodiments, channel convolution processor unit 400 is channel convolution processor unit 107 of FIG. 1 and vector units 401, 411, 421, and 451 are vector units 111, 121, 131, and 141 of FIG. 1, respectively.” And in paragraph [0046]: “channel convolution processor unit 400 includes multiple vector units that each include a vector multiply and a vector adder unit. Each vector multiply unit, such as vector multiply units 403, 413, 423, or 453, is configured to multiply corresponding elements received via a data input unit (not shown) and a weight input unit (not shown). In some embodiments, the result is a vector of multiplication results. For example, for two 9-byte input vectors corresponding to two 3×3 matrices, the result of a vector multiply unit is a vector of 9 multiplication results. The first element from a data input vector is multiplied with the first element of a weight input vector. Similarly, the second element from a data input vector is multiplied with the second element of a weight input vector. In various embodiments, corresponding elements from a data input vector and a weight input vector are multiplied in parallel. In various embodiments, the vector of multiplication results is passed to a vector adder unit of the vector unit. For example, vector multiply unit 403 passes its multiplication results to vector adder unit 405, vector multiply unit 413 passes its multiplication results to vector adder unit 415, vector multiply unit 423 passes its multiplication results to vector adder unit 425, and vector multiply unit 453 passes its multiplication results to vector adder unit 455.”)
Regarding claim 6:
Nair shows the computation based unit of claim 4.
And Nair shows “wherein the accumulator comprises: a first accumulator configured to perform N third accumulation operations to obtain N second computation results, wherein performing each third accumulation operation comprises accumulating the plurality of first computation results of one multiplier of the N multipliers to obtain the second computation result; and a second accumulator configured to accumulate the N second computation results to obtain the result of the convolution.” (Paragraph [0044]: “FIG. 4 is a block diagram illustrating an embodiment of a channel convolution processor unit for solving artificial intelligence problems using a neural network. In the example shown, channel convolution processor unit 400 includes multiple vector units including vector units 401, 411, 421, and 451. The three dots between vector units 421 and 451 indicate optional additional vector units (not shown). In various embodiments, a channel convolution processor unit may include more or fewer vector units. The number of vector units corresponds to the number of channels and associated weight matrices that can be processed in parallel. For example, a channel convolution processor unit may include 32 vector units, each capable of processing a channel of a portion of an activation data input matrix with an associated weight matrix. In some embodiments, each vector unit includes a vector multiply unit and a vector adder unit. In the example shown, vector unit 401 includes vector multiply unit 403 and vector adder unit 405. Similarly, vector unit 411 includes vector multiply unit 413 and vector adder unit 415, vector unit 421 includes vector multiply unit 423 and vector adder unit 425, and vector unit 451 includes vector multiply unit 453 and vector adder unit 455. In various embodiments, channel convolution processor unit 400 is channel convolution processor unit 107 of FIG. 1 and vector units 401, 411, 421, and 451 are vector units 111, 121, 131, and 141 of FIG. 1, respectively.” And in paragraph [0046]: “channel convolution processor unit 400 includes multiple vector units that each include a vector multiply and a vector adder unit. Each vector multiply unit, such as vector multiply units 403, 413, 423, or 453, is configured to multiply corresponding elements received via a data input unit (not shown) and a weight input unit (not shown). In some embodiments, the result is a vector of multiplication results. For example, for two 9-byte input vectors corresponding to two 3×3 matrices, the result of a vector multiply unit is a vector of 9 multiplication results. The first element from a data input vector is multiplied with the first element of a weight input vector. Similarly, the second element from a data input vector is multiplied with the second element of a weight input vector. In various embodiments, corresponding elements from a data input vector and a weight input vector are multiplied in parallel. In various embodiments, the vector of multiplication results is passed to a vector adder unit of the vector unit. For example, vector multiply unit 403 passes its multiplication results to vector adder unit 405, vector multiply unit 413 passes its multiplication results to vector adder unit 415, vector multiply unit 423 passes its multiplication results to vector adder unit 425, and vector multiply unit 453 passes its multiplication results to vector adder unit 455.”)
Regarding claim 7:
Nair shows the computation based unit of claim 4.
And Nair shows “wherein a number of the plurality of multipliers is P, where 16≤P≤256, and P is a positive integer.” (Paragraph [0030]: “Although 32 channels are processed using 3×3 matrices for each iteration in the example above, the size of the elements and matrices processed by system 100 can be configured as appropriate. For example, elements may be 4-bits, 8-bits, 2-byte, 4-bytes, or another appropriate size. Similarly, the sub-matrices of the activation data input matrix and weight matrices can be 3×3, 5×5, or another appropriate size.” And in paragraph [0046]: “channel convolution processor unit 400 includes multiple vector units that each include a vector multiply and a vector adder unit. Each vector multiply unit, such as vector multiply units 403, 413, 423, or 453, is configured to multiply corresponding elements received via a data input unit (not shown) and a weight input unit (not shown). In some embodiments, the result is a vector of multiplication results. For example, for two 9-byte input vectors corresponding to two 3×3 matrices, the result of a vector multiply unit is a vector of 9 multiplication results. The first element from a data input vector is multiplied with the first element of a weight input vector. Similarly, the second element from a data input vector is multiplied with the second element of a weight input vector. In various embodiments, corresponding elements from a data input vector and a weight input vector are multiplied in parallel. In various embodiments, the vector of multiplication results is passed to a vector adder unit of the vector unit. For example, vector multiply unit 403 passes its multiplication results to vector adder unit 405, vector multiply unit 413 passes its multiplication results to vector adder unit 415, vector multiply unit 423 passes its multiplication results to vector adder unit 425, and vector multiply unit 453 passes its multiplication results to vector adder unit 455.”)
Regarding claim 8:
Nair shows the computation based unit of claim 1.
And Nair shows “An artificial intelligence chip, comprising: a data stream-based computation unit according to claim 1; and a data buffer connected to the first input terminal and the second input terminal of the plurality of computation circuits and configured to transmit the M pieces of first data to the M computation circuits on a one-to-one basis and transmit the M pieces of second data to the M computation circuits on a one-to-one basis in response to a drive signal corresponding to the computation task.” (Paragraph [0029]: “FIG. 1 is a block diagram illustrating an embodiment of a system for solving artificial intelligence problems using a neural network. In the example shown, system 100 includes processing element 101 and memory 161. Processing element 101 includes data input unit 103, weight input unit 105, channel convolution processor unit 107, and output unit 151. In some embodiments, processing element 101 is a hardware integrated circuit, for example, an application specific integrated circuit (ASIC) and includes hardware components data input unit 103, weight input unit 105, channel convolution processor unit 107, and output unit 151. As compared to a general purpose processor, processing element 101 is designed and implemented using a specialized hardware integrated circuit to more efficiently perform one or more specific computing tasks related to performing convolution operations and/or solving artificial intelligence problems using a neural network. The specialized hardware results in significant performance improvements and resource efficiencies gained over using a general purpose processor. In the example shown, channel convolution processor unit 107 includes multiple vector calculation units including at least vector units 111, 121, 131, and 141. In various embodiments, channel convolution processor unit 107 receives data input vectors (not shown) from data input unit 103 and weight input vectors (not shown) from weight input unit 105. For example, in some embodiments, data input vectors are generated by data input unit 103 that correspond to 2D sub-matrices of a 3D activation data input matrix, where each 2D sub-matrix corresponds to a different channel of the 3D activation data input matrix. Weight input vectors are generated by weight input unit 105 and correspond to different weight matrices. In various embodiments, the 2D sub-matrices of the 3D activation data input matrix and the weight matrices may be 3×3 matrices or another appropriate size. The data elements of the activation data input matrix and the weight input matrices may be stored and retrieved from memory 161.”)
Regarding claim 9:
Nair shows the computation based unit of claim 2.
And Nair shows “An artificial intelligence chip, comprising: a data stream-based computation unit according to claim 2; and a data buffer connected to the first input terminal and the second input terminal of the plurality of computation circuits and configured to transmit the M pieces of first data to the M computation circuits on a one-to-one basis and transmit the M pieces of second data to the M computation circuits on a one-to-one basis in response to a drive signal corresponding to the computation task.” (Paragraph [0029]: “FIG. 1 is a block diagram illustrating an embodiment of a system for solving artificial intelligence problems using a neural network. In the example shown, system 100 includes processing element 101 and memory 161. Processing element 101 includes data input unit 103, weight input unit 105, channel convolution processor unit 107, and output unit 151. In some embodiments, processing element 101 is a hardware integrated circuit, for example, an application specific integrated circuit (ASIC) and includes hardware components data input unit 103, weight input unit 105, channel convolution processor unit 107, and output unit 151. As compared to a general purpose processor, processing element 101 is designed and implemented using a specialized hardware integrated circuit to more efficiently perform one or more specific computing tasks related to performing convolution operations and/or solving artificial intelligence problems using a neural network. The specialized hardware results in significant performance improvements and resource efficiencies gained over using a general purpose processor. In the example shown, channel convolution processor unit 107 includes multiple vector calculation units including at least vector units 111, 121, 131, and 141. In various embodiments, channel convolution processor unit 107 receives data input vectors (not shown) from data input unit 103 and weight input vectors (not shown) from weight input unit 105. For example, in some embodiments, data input vectors are generated by data input unit 103 that correspond to 2D sub-matrices of a 3D activation data input matrix, where each 2D sub-matrix corresponds to a different channel of the 3D activation data input matrix. Weight input vectors are generated by weight input unit 105 and correspond to different weight matrices. In various embodiments, the 2D sub-matrices of the 3D activation data input matrix and the weight matrices may be 3×3 matrices or another appropriate size. The data elements of the activation data input matrix and the weight input matrices may be stored and retrieved from memory 161.”)
Regarding claim 10:
Nair shows the computation based unit of claim 3.
And Nair shows “An artificial intelligence chip, comprising: a data stream-based computation unit according to claim 3; and a data buffer connected to the first input terminal and the second input terminal of the plurality of computation circuits and configured to transmit the M pieces of first data to the M computation circuits on a one-to-one basis and transmit the M pieces of second data to the M computation circuits on a one-to-one basis in response to a drive signal corresponding to the computation task.” (Paragraph [0029]: “FIG. 1 is a block diagram illustrating an embodiment of a system for solving artificial intelligence problems using a neural network. In the example shown, system 100 includes processing element 101 and memory 161. Processing element 101 includes data input unit 103, weight input unit 105, channel convolution processor unit 107, and output unit 151. In some embodiments, processing element 101 is a hardware integrated circuit, for example, an application specific integrated circuit (ASIC) and includes hardware components data input unit 103, weight input unit 105, channel convolution processor unit 107, and output unit 151. As compared to a general purpose processor, processing element 101 is designed and implemented using a specialized hardware integrated circuit to more efficiently perform one or more specific computing tasks related to performing convolution operations and/or solving artificial intelligence problems using a neural network. The specialized hardware results in significant performance improvements and resource efficiencies gained over using a general purpose processor. In the example shown, channel convolution processor unit 107 includes multiple vector calculation units including at least vector units 111, 121, 131, and 141. In various embodiments, channel convolution processor unit 107 receives data input vectors (not shown) from data input unit 103 and weight input vectors (not shown) from weight input unit 105. For example, in some embodiments, data input vectors are generated by data input unit 103 that correspond to 2D sub-matrices of a 3D activation data input matrix, where each 2D sub-matrix corresponds to a different channel of the 3D activation data input matrix. Weight input vectors are generated by weight input unit 105 and correspond to different weight matrices. In various embodiments, the 2D sub-matrices of the 3D activation data input matrix and the weight matrices may be 3×3 matrices or another appropriate size. The data elements of the activation data input matrix and the weight input matrices may be stored and retrieved from memory 161.”)
Regarding claim 11:
Nair shows the computation based unit of claim 8.
And Nair shows “wherein the data buffer is connected to the first input terminals of the plurality of computation circuits on a one-to-one basis via a first set of data paths, and is connected to the second input terminals of the plurality of computation circuits on a one-to-one basis via a second set of data paths; the artificial intelligence chip further comprises: a switching circuit configured to control, in response to a control signal corresponding to the computation task, M data paths of at least one set of data paths of the first set of data paths and the second set of data paths that are connected to the M computation circuits to be conductive, other data paths being not conductive.” (Paragraph [0029]: “FIG. 1 is a block diagram illustrating an embodiment of a system for solving artificial intelligence problems using a neural network. In the example shown, system 100 includes processing element 101 and memory 161. Processing element 101 includes data input unit 103, weight input unit 105, channel convolution processor unit 107, and output unit 151. In some embodiments, processing element 101 is a hardware integrated circuit, for example, an application specific integrated circuit (ASIC) and includes hardware components data input unit 103, weight input unit 105, channel convolution processor unit 107, and output unit 151. As compared to a general purpose processor, processing element 101 is designed and implemented using a specialized hardware integrated circuit to more efficiently perform one or more specific computing tasks related to performing convolution operations and/or solving artificial intelligence problems using a neural network. The specialized hardware results in significant performance improvements and resource efficiencies gained over using a general purpose processor. In the example shown, channel convolution processor unit 107 includes multiple vector calculation units including at least vector units 111, 121, 131, and 141. In various embodiments, channel convolution processor unit 107 receives data input vectors (not shown) from data input unit 103 and weight input vectors (not shown) from weight input unit 105. For example, in some embodiments, data input vectors are generated by data input unit 103 that correspond to 2D sub-matrices of a 3D activation data input matrix, where each 2D sub-matrix corresponds to a different channel of the 3D activation data input matrix. Weight input vectors are generated by weight input unit 105 and correspond to different weight matrices. In various embodiments, the 2D sub-matrices of the 3D activation data input matrix and the weight matrices may be 3×3 matrices or another appropriate size. The data elements of the activation data input matrix and the weight input matrices may be stored and retrieved from memory 161.”)
Regarding claim 12:
Nair shows the computation based unit of claim 1.
And Nair shows “wherein the switching circuit comprises a plurality of switches disposed on the at least one set of data paths on a one-to-one basis.” (Paragraph [0029]: “FIG. 1 is a block diagram illustrating an embodiment of a system for solving artificial intelligence problems using a neural network. In the example shown, system 100 includes processing element 101 and memory 161. Processing element 101 includes data input unit 103, weight input unit 105, channel convolution processor unit 107, and output unit 151. In some embodiments, processing element 101 is a hardware integrated circuit, for example, an application specific integrated circuit (ASIC) and includes hardware components data input unit 103, weight input unit 105, channel convolution processor unit 107, and output unit 151. As compared to a general purpose processor, processing element 101 is designed and implemented using a specialized hardware integrated circuit to more efficiently perform one or more specific computing tasks related to performing convolution operations and/or solving artificial intelligence problems using a neural network. The specialized hardware results in significant performance improvements and resource efficiencies gained over using a general purpose processor. In the example shown, channel convolution processor unit 107 includes multiple vector calculation units including at least vector units 111, 121, 131, and 141. In various embodiments, channel convolution processor unit 107 receives data input vectors (not shown) from data input unit 103 and weight input vectors (not shown) from weight input unit 105. For example, in some embodiments, data input vectors are generated by data input unit 103 that correspond to 2D sub-matrices of a 3D activation data input matrix, where each 2D sub-matrix corresponds to a different channel of the 3D activation data input matrix. Weight input vectors are generated by weight input unit 105 and correspond to different weight matrices. In various embodiments, the 2D sub-matrices of the 3D activation data input matrix and the weight matrices may be 3×3 matrices or another appropriate size. The data elements of the activation data input matrix and the weight input matrices may be stored and retrieved from memory 161.”)
Regarding claim 13:
Nair shows the computation based unit of claim 8.
And Nair shows “An accelerator, comprising: the artificial intelligence chip according to claim 8.” (Paragraph [0029]: “FIG. 1 is a block diagram illustrating an embodiment of a system for solving artificial intelligence problems using a neural network. In the example shown, system 100 includes processing element 101 and memory 161. Processing element 101 includes data input unit 103, weight input unit 105, channel convolution processor unit 107, and output unit 151. In some embodiments, processing element 101 is a hardware integrated circuit, for example, an application specific integrated circuit (ASIC) and includes hardware components data input unit 103, weight input unit 105, channel convolution processor unit 107, and output unit 151. As compared to a general purpose processor, processing element 101 is designed and implemented using a specialized hardware integrated circuit to more efficiently perform one or more specific computing tasks related to performing convolution operations and/or solving artificial intelligence problems using a neural network. The specialized hardware results in significant performance improvements and resource efficiencies gained over using a general purpose processor. In the example shown, channel convolution processor unit 107 includes multiple vector calculation units including at least vector units 111, 121, 131, and 141. In various embodiments, channel convolution processor unit 107 receives data input vectors (not shown) from data input unit 103 and weight input vectors (not shown) from weight input unit 105. For example, in some embodiments, data input vectors are generated by data input unit 103 that correspond to 2D sub-matrices of a 3D activation data input matrix, where each 2D sub-matrix corresponds to a different channel of the 3D activation data input matrix. Weight input vectors are generated by weight input unit 105 and correspond to different weight matrices. In various embodiments, the 2D sub-matrices of the 3D activation data input matrix and the weight matrices may be 3×3 matrices or another appropriate size. The data elements of the activation data input matrix and the weight input matrices may be stored and retrieved from memory 161.”)
Regarding claim 14:
Nair shows:
“A data stream-based computation method, comprising: receiving M pieces of first data required for performing a computation task on a one-to-one basis by M first input terminals of M computation circuits in a plurality of computation circuits, where M≥2 and M is a positive integer;” (Paragraph [0022]: “A processor system for performing efficient convolution operations using a channel convolution processor is disclosed. Using the disclosed techniques, the throughput and power efficiency for computing convolution operations and in particular depthwise convolutions is significantly increased particularly for input activation data with small width and height dimensions. In some embodiments, the processor system includes a channel convolution processor unit capable of performing convolution operations using two input matrices by applying different weight matrices to different channels of portions of a data convolution matrix. The channel convolution processor unit includes a plurality of calculation units such as vector units used to process input vectors of the input matrices. In various embodiments, a calculation unit includes at least a vector multiply unit and a vector adder unit. The vector multiply unit is capable of performing multiply operations using corresponding elements of two input vectors, data elements from the same channel and weight input elements from a weight matrix. In some embodiments, the vector adder unit is used to sum the vector of multiplication results computed using a vector multiply unit. For example, the vector adder unit can be used to compute the dot product result of two vectors using the vector multiplication results of vector elements from corresponding input vectors. In some embodiments, the vector adder unit is an adder tree. For example, an adder tree computes the sum of the multiplication results by summing multiplication results and subsequent partial sums in parallel.” And in paragraph [0033]: “multiple instances of processing element 101 can operate in parallel to process different portions of an activation data input matrix. For example, each processing element can retrieve its assigned data elements of the activation data input matrix and corresponding weight matrices from memory 161. In some embodiments, different processing elements share weight matrices and the data elements of the shared weight matrices can be broadcasted to the appropriate processing elements to improve memory efficiency. Each processing element performs depthwise convolution operations on the assigned portions of the activation data input matrix using its own channel convolution processor unit. The results of each processing element can be combined, for example, by writing the results to a shared memory location such as memory 161. In some embodiments, channel convolution processor unit 107 includes the functionality of data input unit 103, weight input unit 105, and/or output unit 151.”)
“receiving M pieces of second data distinct from each other required for performing the computation task on a one-to-one basis by M second input terminals of the M computation circuits; and performing the computation task in parallel on the basis of the M pieces of first data and the M pieces of second data by the M computation circuits, wherein each of the M computation circuits performs the computation tasks on the basis of one piece of first data and one piece of second data.” (Paragraph [0022]: “A processor system for performing efficient convolution operations using a channel convolution processor is disclosed. Using the disclosed techniques, the throughput and power efficiency for computing convolution operations and in particular depthwise convolutions is significantly increased particularly for input activation data with small width and height dimensions. In some embodiments, the processor system includes a channel convolution processor unit capable of performing convolution operations using two input matrices by applying different weight matrices to different channels of portions of a data convolution matrix. The channel convolution processor unit includes a plurality of calculation units such as vector units used to process input vectors of the input matrices. In various embodiments, a calculation unit includes at least a vector multiply unit and a vector adder unit. The vector multiply unit is capable of performing multiply operations using corresponding elements of two input vectors, data elements from the same channel and weight input elements from a weight matrix. In some embodiments, the vector adder unit is used to sum the vector of multiplication results computed using a vector multiply unit. For example, the vector adder unit can be used to compute the dot product result of two vectors using the vector multiplication results of vector elements from corresponding input vectors. In some embodiments, the vector adder unit is an adder tree. For example, an adder tree computes the sum of the multiplication results by summing multiplication results and subsequent partial sums in parallel.” And in paragraph [0033]: “multiple instances of processing element 101 can operate in parallel to process different portions of an activation data input matrix. For example, each processing element can retrieve its assigned data elements of the activation data input matrix and corresponding weight matrices from memory 161. In some embodiments, different processing elements share weight matrices and the data elements of the shared weight matrices can be broadcasted to the appropriate processing elements to improve memory efficiency. Each processing element performs depthwise convolution operations on the assigned portions of the activation data input matrix using its own channel convolution processor unit. The results of each processing element can be combined, for example, by writing the results to a shared memory location such as memory 161. In some embodiments, channel convolution processor unit 107 includes the functionality of data input unit 103, weight input unit 105, and/or output unit 151.”)
Regarding claim 15:
Nair shows the method of claim 14 as claimed and specified above.
And Nair shows “wherein the computation task is a computation in a neural network model.” (Paragraph [0044]: “FIG. 4 is a block diagram illustrating an embodiment of a channel convolution processor unit for solving artificial intelligence problems using a neural network. In the example shown, channel convolution processor unit 400 includes multiple vector units including vector units 401, 411, 421, and 451. The three dots between vector units 421 and 451 indicate optional additional vector units (not shown). In various embodiments, a channel convolution processor unit may include more or fewer vector units. The number of vector units corresponds to the number of channels and associated weight matrices that can be processed in parallel. For example, a channel convolution processor unit may include 32 vector units, each capable of processing a channel of a portion of an activation data input matrix with an associated weight matrix.”)
Regarding claim 16:
Nair shows the method of claim 15 as claimed and specified above.
And Nair shows “wherein the computation task is a convolution, and the M pieces of first data are identical to each other; each piece of first data comprises feature map data corresponding to M pieces of convolution kernel data in a feature map, and each piece of second data comprises one of the M pieces of convolution kernel data.” (Paragraph [0044]: “FIG. 4 is a block diagram illustrating an embodiment of a channel convolution processor unit for solving artificial intelligence problems using a neural network. In the example shown, channel convolution processor unit 400 includes multiple vector units including vector units 401, 411, 421, and 451. The three dots between vector units 421 and 451 indicate optional additional vector units (not shown). In various embodiments, a channel convolution processor unit may include more or fewer vector units. The number of vector units corresponds to the number of channels and associated weight matrices that can be processed in parallel. For example, a channel convolution processor unit may include 32 vector units, each capable of processing a channel of a portion of an activation data input matrix with an associated weight matrix.” And in paragraph [0046]: “channel convolution processor unit 400 includes multiple vector units that each include a vector multiply and a vector adder unit. Each vector multiply unit, such as vector multiply units 403, 413, 423, or 453, is configured to multiply corresponding elements received via a data input unit (not shown) and a weight input unit (not shown)”)
Regarding claim 17:
Nair shows the method of claim 16 as claimed and specified above.
And Nair shows: “wherein the feature map data comprises feature map sub-data of N channels, each piece of convolution kernel data comprises weight data of N channels, either the feature map sub-data of each channel or the weight data of each channel is an m×n matrix, where N≥2, n≥1, m≥1, and N, m and n are all positive integers; at least one computation circuit of the M computation circuits comprises a plurality of multipliers and an accumulator, and the at least one computation circuit performs the convolution in such a manner that each multiplier of N multipliers in the plurality of multipliers multiplies an element in an i-th row and a j-th column of the feature map sub-data of a corresponding channel by an element in an i-th row and a j-th column of the weight data of the corresponding channel to obtain a plurality of first computation results, wherein the N multipliers correspond to the feature map sub-data of N channels on a one-to-one basis and correspond to the weight data of N channels on a one-to-one basis, where 1≤i≤m, 1≤j≤n, and i and j are both positive integers; and an accumulator performs an accumulation operation to obtain a result of the convolution, the accumulation operation comprising performing a first accumulation operation on the plurality of first computation results of each of the N multipliers.” (Paragraph [0044]: “FIG. 4 is a block diagram illustrating an embodiment of a channel convolution processor unit for solving artificial intelligence problems using a neural network. In the example shown, channel convolution processor unit 400 includes multiple vector units including vector units 401, 411, 421, and 451. The three dots between vector units 421 and 451 indicate optional additional vector units (not shown). In various embodiments, a channel convolution processor unit may include more or fewer vector units. The number of vector units corresponds to the number of channels and associated weight matrices that can be processed in parallel. For example, a channel convolution processor unit may include 32 vector units, each capable of processing a channel of a portion of an activation data input matrix with an associated weight matrix. In some embodiments, each vector unit includes a vector multiply unit and a vector adder unit. In the example shown, vector unit 401 includes vector multiply unit 403 and vector adder unit 405. Similarly, vector unit 411 includes vector multiply unit 413 and vector adder unit 415, vector unit 421 includes vector multiply unit 423 and vector adder unit 425, and vector unit 451 includes vector multiply unit 453 and vector adder unit 455. In various embodiments, channel convolution processor unit 400 is channel convolution processor unit 107 of FIG. 1 and vector units 401, 411, 421, and 451 are vector units 111, 121, 131, and 141 of FIG. 1, respectively.”)
Regarding claim 18:
Nair shows the method of claim 17 as claimed and specified above.
And Nair shows “wherein the accumulation operation further comprises a second accumulation operation on a result of the first accumulation operation and a piece of bias data to obtain the result of the convolution.” (Paragraph [0044]: “FIG. 4 is a block diagram illustrating an embodiment of a channel convolution processor unit for solving artificial intelligence problems using a neural network. In the example shown, channel convolution processor unit 400 includes multiple vector units including vector units 401, 411, 421, and 451. The three dots between vector units 421 and 451 indicate optional additional vector units (not shown). In various embodiments, a channel convolution processor unit may include more or fewer vector units. The number of vector units corresponds to the number of channels and associated weight matrices that can be processed in parallel. For example, a channel convolution processor unit may include 32 vector units, each capable of processing a channel of a portion of an activation data input matrix with an associated weight matrix. In some embodiments, each vector unit includes a vector multiply unit and a vector adder unit. In the example shown, vector unit 401 includes vector multiply unit 403 and vector adder unit 405. Similarly, vector unit 411 includes vector multiply unit 413 and vector adder unit 415, vector unit 421 includes vector multiply unit 423 and vector adder unit 425, and vector unit 451 includes vector multiply unit 453 and vector adder unit 455. In various embodiments, channel convolution processor unit 400 is channel convolution processor unit 107 of FIG. 1 and vector units 401, 411, 421, and 451 are vector units 111, 121, 131, and 141 of FIG. 1, respectively.” And in paragraph [0046]: “channel convolution processor unit 400 includes multiple vector units that each include a vector multiply and a vector adder unit. Each vector multiply unit, such as vector multiply units 403, 413, 423, or 453, is configured to multiply corresponding elements received via a data input unit (not shown) and a weight input unit (not shown). In some embodiments, the result is a vector of multiplication results. For example, for two 9-byte input vectors corresponding to two 3×3 matrices, the result of a vector multiply unit is a vector of 9 multiplication results. The first element from a data input vector is multiplied with the first element of a weight input vector. Similarly, the second element from a data input vector is multiplied with the second element of a weight input vector. In various embodiments, corresponding elements from a data input vector and a weight input vector are multiplied in parallel. In various embodiments, the vector of multiplication results is passed to a vector adder unit of the vector unit. For example, vector multiply unit 403 passes its multiplication results to vector adder unit 405, vector multiply unit 413 passes its multiplication results to vector adder unit 415, vector multiply unit 423 passes its multiplication results to vector adder unit 425, and vector multiply unit 453 passes its multiplication results to vector adder unit 455.”)
Regarding claim 19:
Nair shows the method of claim 17 as claimed and specified above.
And Nair shows “wherein the accumulator comprises a first accumulator and a second accumulator, the accumulator performing an accumulation operation to obtain a result of the convolution comprises: performing N third accumulation operations to obtain N second computation results by the first accumulator, wherein performing each third accumulation operation comprises accumulating the plurality of first computation results of one multiplier of the N multipliers to obtain the second computation result; and accumulating the N second computation results to obtain the result of the convolution by the second accumulator.” (Paragraph [0044]: “FIG. 4 is a block diagram illustrating an embodiment of a channel convolution processor unit for solving artificial intelligence problems using a neural network. In the example shown, channel convolution processor unit 400 includes multiple vector units including vector units 401, 411, 421, and 451. The three dots between vector units 421 and 451 indicate optional additional vector units (not shown). In various embodiments, a channel convolution processor unit may include more or fewer vector units. The number of vector units corresponds to the number of channels and associated weight matrices that can be processed in parallel. For example, a channel convolution processor unit may include 32 vector units, each capable of processing a channel of a portion of an activation data input matrix with an associated weight matrix. In some embodiments, each vector unit includes a vector multiply unit and a vector adder unit. In the example shown, vector unit 401 includes vector multiply unit 403 and vector adder unit 405. Similarly, vector unit 411 includes vector multiply unit 413 and vector adder unit 415, vector unit 421 includes vector multiply unit 423 and vector adder unit 425, and vector unit 451 includes vector multiply unit 453 and vector adder unit 455. In various embodiments, channel convolution processor unit 400 is channel convolution processor unit 107 of FIG. 1 and vector units 401, 411, 421, and 451 are vector units 111, 121, 131, and 141 of FIG. 1, respectively.” And in paragraph [0046]: “channel convolution processor unit 400 includes multiple vector units that each include a vector multiply and a vector adder unit. Each vector multiply unit, such as vector multiply units 403, 413, 423, or 453, is configured to multiply corresponding elements received via a data input unit (not shown) and a weight input unit (not shown). In some embodiments, the result is a vector of multiplication results. For example, for two 9-byte input vectors corresponding to two 3×3 matrices, the result of a vector multiply unit is a vector of 9 multiplication results. The first element from a data input vector is multiplied with the first element of a weight input vector. Similarly, the second element from a data input vector is multiplied with the second element of a weight input vector. In various embodiments, corresponding elements from a data input vector and a weight input vector are multiplied in parallel. In various embodiments, the vector of multiplication results is passed to a vector adder unit of the vector unit. For example, vector multiply unit 403 passes its multiplication results to vector adder unit 405, vector multiply unit 413 passes its multiplication results to vector adder unit 415, vector multiply unit 423 passes its multiplication results to vector adder unit 425, and vector multiply unit 453 passes its multiplication results to vector adder unit 455.”)
Regarding claim 20:
Nair shows the method of claim 17 as claimed and specified above.
And Nair shows “wherein a number of the plurality of multipliers is P, where 16≤P≤256, and P is a positive integer.” (Paragraph [0030]: “Although 32 channels are processed using 3×3 matrices for each iteration in the example above, the size of the elements and matrices processed by system 100 can be configured as appropriate. For example, elements may be 4-bits, 8-bits, 2-byte, 4-bytes, or another appropriate size. Similarly, the sub-matrices of the activation data input matrix and weight matrices can be 3×3, 5×5, or another appropriate size.” And in paragraph [0046]: “channel convolution processor unit 400 includes multiple vector units that each include a vector multiply and a vector adder unit. Each vector multiply unit, such as vector multiply units 403, 413, 423, or 453, is configured to multiply corresponding elements received via a data input unit (not shown) and a weight input unit (not shown). In some embodiments, the result is a vector of multiplication results. For example, for two 9-byte input vectors corresponding to two 3×3 matrices, the result of a vector multiply unit is a vector of 9 multiplication results. The first element from a data input vector is multiplied with the first element of a weight input vector. Similarly, the second element from a data input vector is multiplied with the second element of a weight input vector. In various embodiments, corresponding elements from a data input vector and a weight input vector are multiplied in parallel. In various embodiments, the vector of multiplication results is passed to a vector adder unit of the vector unit. For example, vector multiply unit 403 passes its multiplication results to vector adder unit 405, vector multiply unit 413 passes its multiplication results to vector adder unit 415, vector multiply unit 423 passes its multiplication results to vector adder unit 425, and vector multiply unit 453 passes its multiplication results to vector adder unit 455.”)
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Yu et al., (US 2022/0261249 A1), part of the prior art made of record, teaches the computation task of received data in a one-to-one manner and in parallel of claims 1 and 14 in paragraph [0059] through the use of a pipeline of data read in a parallel manner for a depthwise convolution.
Komuravelli et al., (US 2021/0319076 A1), part of the prior art made of record, teaches the computation task of received data in a one-to-one manner and in parallel of claims 1 and 14 in paragraph [0068] through input vectors loaded into a convolution unit and processed by channels in parallel.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHANE D WOOLWINE whose telephone number is (571)272-4138. The examiner can normally be reached M-F 9:30-6:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, MIRANDA HUANG can be reached at (571) 270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
SHANE D. WOOLWINE
Primary Examiner
Art Unit 2124
/SHANE D WOOLWINE/Primary Examiner, Art Unit 2124