DETAILED ACTION
Examiner Remarks
Examiner notes that the instant application was previously examined by a different examiner. As such, Examiner proceeds with prosecution giving full faith and credit to the search and action of the previous examiner per MPEP § 704.01.
Response to Arguments
Applicant’s arguments with respect to the independent claim(s) have been considered but are moot because the new ground of rejection does not rely on references applied in the prior rejection of record for any teaching or matter specifically challenged by Applicant’s arguments submitted in the Remarks of 12/09/2025.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 12/09/2025 has been entered.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 12/09/2025 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.
Claim Objections
Claim 21 is objected to because of the following informalities: Claim 21 recites that the flattened matrix data to align with is raw-wise of corresponding matrices. The claim element of raw-wise should be replaced by row-wise. Appropriate correction is required.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-3, 6-10, 13-17 and 20-22 are rejected under 35 U.S.C. 103 as being unpatentable over Chen et al., US 2021/0182077 A1 (“Chen”) in view of Austin et al., Parallel tensor compression for large-scale scientific data. In 2016 IEEE international parallel and distributed processing symposium (IPDPS) 2016 May 23 (“Austin”) and in view of Kayaaslan, Enver et al., "Semi-two-dimensional partitioning for parallel sparse matrix-vector multiplication." 2015 IEEE International Parallel and Distributed Processing Symposium Workshop. IEEE, 2015(“Kayaaslan”)
Claim 1
Chen discloses:
a plurality of matrix processing units (MPUs), wherein each MPU is to perform matrix multiplication operations; a memory to store tensor data including matrix data; at least one processor, coupled to the plurality of MPUs, to: cause the MPUs to operate on the partitioned matrix data to generate output data (Chen, 0066, “The terminal device includes: a storage device, a processor, and a computer program that is stored in the storage device and can run on the processor. The processor executes the computer program[a memory to store tensor data including matrix data; at least one processor, coupled to the plurality of MPUs, to]” & Chen, 0007-0008, 0027; ‘The primary processing circuit is configured to pre-process the input data, and transfer data and operation instructions to the plurality of secondary processing circuits. (Pre-arithmetic operations)’ with ‘The plurality of secondary processing circuits are configured to perform intermediate operations in parallel according to the data and the operation instructions transferred by the primary processing circuit to obtain a plurality of intermediate results, and transfer the plurality of intermediate results to the primary processing circuit[cause the MPUs to operate on the partitioned matrix data to generate output data]. (Post-arithmetic operations)’ and ‘In some possible examples, the operation instruction includes at least one of: a matrix-multiply-vector instruction, a vector-multiply-matrix instruction, a matrix-multiply-scalar instruction[a plurality of matrix processing units (MPUs), wherein each MPU is to perform matrix multiplication operations], a tensor operation instruction, a matrix addition instruction, a matrix subtraction instruction, a matrix retrieving instruction, a matrix loading instruction, a matrix saving instruction, and a matrix moving instruction.’ (This maps to ‘matrix-wide operation.’));
and cause storage of the output data(Chen, 0084; Optionally, the storage unit is further configured to store the matrix operation result[and cause storage of the output data].)
to implement one or more deep neural networks(Chen, 0869; an operation instruction configured to complete an arithmetic operation of the neural network, including a matrix operation instruction, a vector operation instruction, a scalar operation instruction, a convolution neural network operation instruction, a fully connected neural network operation instruction, a pooling neural network operation instruction, a RBM (Restricted Boltzmann Machine) neural network operation instruction, a LRN (Local Response Normalization) neural network operation instruction, a LCN (Local Contrast Normalization) neural network operation instruction, a LSTM (Long Short-Term Memory) neural network operation instruction, a RNN (Recurrent Neural Network) operation instruction, a RELU (Rectified Linear Unit) neural network operation instruction, a PRELU (Parametric Rectified Linear Unit) neural network operation instruction, a SIGMOID (S-shaped growth curve) neural network operation instruction, a TAN H (hyperbolic function) neural network operation instruction, and a MAXOUT (maximum output) neural network operation instruction;)
Chen does not disclose: cause the matrix data of the tensor data to be partitioned into a plurality of partitions, wherein the matrix data is partitioned into a number of partitions based on a number of processing elements of the apparatus.
However, Austin teaches:
cause the matrix data of tensor data to be partitioned into a plurality of partitions, wherein the matrix data is partitioned into a number of partitions based on a number of processing elements of the apparatus(Austin, pgs. 914-915, see also fig. 3(a), “For 𝑁-way tensors, we assume a logical 𝑁-way processor grid. Let
P
1
×
P
2
×
⋯
×
P
N
be the size of the processor grid[processing elements of the apparatus]…[l]et
J
1
×
J
2
×
⋯
×
J
N
be the size of a generic tensor 𝓎[matrix data of tensor data]… [w]e impose a Cartesian parallel distribution of the tensor across processors, which we refer to as a block distribution. Each processor owns a distinct subtensor of size
J
1
P
1
×
⋯
×
J
N
P
N
with J/P entries[cause the matrix data of tensor data to be partitioned into a plurality of partitions, wherein the matrix data is partitioned into a number of partitions based on a number of processing elements the apparatus].”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Chen with the teachings of Austin the motivation to do so would be to provide better compression rates for large scale data used in simulations to better enable efficient data storage, retrieval, transfer and analysis(Austin, pg., 912, “Today’s high-performance parallel computers enable large-scale, high-fidelity simulations of natural phenomena across scientific domains. As the speed and quality of simulations increase, the amount of data produced is growing at a rate that is creating bottlenecks in the scientific process. A posteriori analysis of the data requires dedicated storage devices and parallel clusters even for simple computations. A primary goal of this work is to provide a compression technique for large-scale simulation data which enables much more efficient data storage, transfer, and analysis, thereby facilitating bigger and better science.”).
Chen in view of Austin do not disclose: wherein the matrix data of the tensor data in two dimensions are flattened along with one dimension to be operated by the MPUs; based on an instruction causing the matrix multiplication operations on the partitioned matrix data.
However, Kayaaslan teaches:
wherein the matrix data of the tensor data in two dimensions are flattened along with one dimension to be operated by the MPUs(Kayaaslan, pgs. 2-3, see also fig. 1, “In s2D partitioning, the matrix
A
and the vectors
x
and
y
are partitioned into k parts[wherein the matrix data of the tensor data in two dimensions are flattened]... [f]orm vector
y
^
k
(
l
)
which contains only those entries...corresponding to the nonzero rows in
A
l
k
(
k
)
... [f]orm vector
x
^
l
(
k
)
which contains only those entries...corresponding to the nonzero columns in
A
l
k
(
k
)
...[s]end vector [
x
^
l
(
k
)
,
y
^
k
(
l
)
] to processor
P
l
[along with one dimension to be operated by the MPUs]”);
based on an instruction causing the matrix multiplication operations on the partitioned matrix data(Kayaaslan, pgs. 2-3, see also fig. 1, “3) (Compute)Compute the output-subvector
y
(
k
)
[based on an instruction] as
y
(
k
)
←
∑
l
≠
k
A
k
l
(
k
)
x
^
k
(
l
)
+
A
k
k
x
(
k
)
+
∑
l
≠
k
y
^
l
(
k
)
[causing the matrix multiplication operations on the partitioned matrix data] ”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Chen in view of Austin with the teachings of Kayaaslan the motivation to do so would be to improve the traditional sparse matrix-vector multiplication process by decreasing the communication costs associated with processor inter-communication(Kayaaslan, pg., 3, “We propose semi-two-dimensional (s2D) data partitioning, which restricts each nonzero...to be together with either
x
j
or
y
i
. By this restriction, we guarantee that the computational group...is empty, and thus, the two communication phases can be fused into a single one.”).
Claim 2
Chen in view of Austin and Kayaaslan discloses wherein the memory comprises a memory resource block to be shared by two or more MPUs in the plurality of MPUs(Chen, paras. [2484-2489], see also figs. 37F, 4A, 5, and 2A, “Fig. 37 is a flowchart of a neural network processing method...where the computation device contains a plurality of ALUs...mapping, by an on-chip address index module, to a correct storage address”)
Claim 3
Chen in view of Austin and Kayaaslan discloses wherein the output data includes a tensor value (Chen, 0027; In some possible examples, the operation instruction includes at least one of: a matrix-multiply-vector instruction, a vector-multiply-matrix instruction, a matrix-multiply-scalar instruction, a tensor operation instruction[wherein the output data includes a tensor value])
Claim 6
Chen in view of Austin and Kayaaslan discloses further including a control processor to manage the plurality of MPUs (Chen, paras. [1841-1844], see also fig. 1F, “The control unit 504 is configured to control the matrix computation unit[a control processor to manage the plurality of MPUs]...”).
Claim 7
Chen in view of Austin and Kayaaslan discloses wherein the tensor data includes a single exponent value for values in the tensor data(Chen, 2043; The pre-processing includes but is not limited to data format conversion, such as the conversion between continuous data and discrete data as described in the present disclosure, power conversion which is to convert non-power weight data in input data of a neural network to power weight data, statistics of floating-point data which is to count the bits of exponent bias and exponent bits required for storing different types of data during a forward operation of the artificial neural network, and floating-point data conversion for a short-bit floating-point data type and a long-bit floating-point data type, which is not restricted in the present disclosure.)
Claim 8
Chen discloses a non-transitory computer readable medium comprising instructions that, when executed, cause a processor of a system(Chen, 0066, “The terminal device includes: a storage device, a processor, and a computer program that is stored in the storage device and can run on the processor. The processor executes the computer program” )
to at least:
cause a plurality of the MPUs to operate on the partitioned matrix data to generate output data, wherein each MPU is to perform matrix multiplication operations (Chen, 0007-0008, 0027; ‘The primary processing circuit is configured to pre-process the input data, and transfer data and operation instructions to the plurality of secondary processing circuits. (Pre-arithmetic operations)’ with ‘The plurality of secondary processing circuits are configured to perform intermediate operations in parallel according to the data and the operation instructions transferred by the primary processing circuit to obtain a plurality of intermediate results, and transfer the plurality of intermediate results to the primary processing circuit. (Post-arithmetic operations)’ and ‘In some possible examples, the operation instruction includes at least one of: a matrix-multiply-vector instruction, a vector-multiply-matrix instruction, a matrix-multiply-scalar instruction[cause a plurality of the MPUs to operate on the partitioned matrix data to generate output data, wherein each MPU is to perform matrix multiplication operations], a tensor operation instruction, a matrix addition instruction, a matrix subtraction instruction, a matrix retrieving instruction, a matrix loading instruction, a matrix saving instruction, and a matrix moving instruction.’ (This maps to ‘matrix-wide operation.’));
and cause storage of the output data(Chen, 0084; Optionally, the storage unit is further configured to store the matrix operation result[and store the output data].);
to implement one or more deep neural networks(Chen, 0869; an operation instruction configured to complete an arithmetic operation of the neural network, including a matrix operation instruction, a vector operation instruction, a scalar operation instruction, a convolution neural network operation instruction, a fully connected neural network operation instruction, a pooling neural network operation instruction, a RBM (Restricted Boltzmann Machine) neural network operation instruction, a LRN (Local Response Normalization) neural network operation instruction, a LCN (Local Contrast Normalization) neural network operation instruction, a LSTM (Long Short-Term Memory) neural network operation instruction, a RNN (Recurrent Neural Network) operation instruction, a RELU (Rectified Linear Unit) neural network operation instruction, a PRELU (Parametric Rectified Linear Unit) neural network operation instruction, a SIGMOID (S-shaped growth curve) neural network operation instruction, a TAN H (hyperbolic function) neural network operation instruction, and a MAXOUT (maximum output) neural network operation instruction;)
Chen does not teach: cause matrix data of tensor data to be partitioned into a plurality of partitions, wherein the matrix data is partitioned into a number of partitions based on a number of processing elements of the system including a plurality of matrix processing units (MPUs).
However, Austin teaches:
cause matrix data of tensor data to be partitioned into a plurality of partitions, wherein the matrix data is partitioned into a number of partitions based on a number of processing elements of the system including a plurality of matrix processing units (MPUs). (Austin, pgs. 914-915, see also fig. 3(a), “For 𝑁-way tensors, we assume a logical 𝑁-way processor grid. Let
P
1
×
P
2
×
⋯
×
P
N
be the size of the processor grid[matrix processing units (MPUs)]…[l]et
J
1
×
J
2
×
⋯
×
J
N
be the size of a generic tensor 𝓎[matrix data of tensor data]… [w]e impose a Cartesian parallel distribution of the tensor across processors, which we refer to as a block distribution. Each processor owns a distinct subtensor of size
J
1
P
1
×
⋯
×
J
N
P
N
with J/P entries[cause matrix data of tensor data to be partitioned into a plurality of partitions, wherein the matrix data is partitioned into a number of partitions based on a number of processing elements of a system including a plurality of matrix processing units (MPUs)].”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Chen with the teachings of Austin the motivation to do so would be to provide better compression rates for large scale data used in simulations to better enable efficient data storage, retrieval, transfer and analysis(Austin, pg., 912, “Today’s high-performance parallel computers enable large-scale, high-fidelity simulations of natural phenomena across scientific domains. As the speed and quality of simulations increase, the amount of data produced is growing at a rate that is creating bottlenecks in the scientific process. A posteriori analysis of the data requires dedicated storage devices and parallel clusters even for simple computations. A primary goal of this work is to provide a compression technique for large-scale simulation data which enables much more efficient data storage, transfer, and analysis, thereby facilitating bigger and better science.”).
Chen in view of Austin do not disclose: wherein the matrix data of the tensor data in two dimensions are flattened along with one dimension to be operated by the plurality of MPUs; based on an instruction causing the matrix multiplication operations on the partitioned matrix data.
However, Kayaaslan teaches:
wherein the matrix data of the tensor data in two dimensions are flattened along with one dimension to be operated by the plurality of MPUs(Kayaaslan, pgs. 2-3, see also fig. 1, “In s2D partitioning, the matrix
A
and the vectors
x
and
y
are partitioned into k parts[wherein the matrix data of the tensor data in two dimensions are flattened]... [f]orm vector
y
^
k
(
l
)
which contains only those entries...corresponding to the nonzero rows in
A
l
k
(
k
)
... [f]orm vector
x
^
l
(
k
)
which contains only those entries...corresponding to the nonzero columns in
A
l
k
(
k
)
...[s]end vector [
x
^
l
(
k
)
,
y
^
k
(
l
)
] to processor
P
l
[along with one dimension to be operated by the MPUs]”);
based on an instruction causing the matrix multiplication operations on the partitioned matrix data(Kayaaslan, pgs. 2-3, see also fig. 1, “3) (Compute)Compute the output-subvector
y
(
k
)
[based on an instruction] as
y
(
k
)
←
∑
l
≠
k
A
k
l
(
k
)
x
^
k
(
l
)
+
A
k
k
x
(
k
)
+
∑
l
≠
k
y
^
l
(
k
)
[causing the matrix multiplication operations on the partitioned matrix data] ”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Chen in view of Austin with the teachings of Kayaaslan the motivation to do so would be to improve the traditional sparse matrix-vector multiplication process by decreasing the communication costs associated with processor inter-communication(Kayaaslan, pg., 3, “We propose semi-two-dimensional (s2D) data partitioning, which restricts each nonzero...to be together with either
x
j
or
y
i
. By this restriction, we guarantee that the computational group...is empty, and thus, the two communication phases can be fused into a single one.”).
Claim 9
Chen in view of Austin and Kayaaslan discloses the non-transitory computer readable medium of claim 8, wherein two or more MPUs in the plurality of MPUs share a memory resource block (Chen, paras. [2484-2489], see also figs. 37F, 4A, 5, and 2A, “Fig. 37 is a flowchart of a neural network processing method...where the computation device contains a plurality of ALUs...mapping, by an on-chip address index module, to a correct storage address”).
Claim 10
Chen in view of Austin and Kayaaslan discloses wherein the output data includes a tensor value(Chen, 0027; In some possible examples, the operation instruction includes at least one of: a matrix-multiply-vector instruction, a vector-multiply-matrix instruction, a matrix-multiply-scalar instruction, a tensor operation instruction[wherein the output data includes a tensor value]).
Claim 13
Chen in view of Austin and Kayaaslan discloses wherein the tensor data includes a single exponent value for values in the tensor data(Chen, 2043; The pre-processing includes but is not limited to data format conversion, such as the conversion between continuous data and discrete data as described in the present disclosure, power conversion which is to convert non-power weight data in input data of a neural network to power weight data, statistics of floating-point data which is to count the bits of exponent bias and exponent bits[wherein the tensor data includes a single exponent value for values in the tensor data] required for storing different types of data during a forward operation of the artificial neural network, and floating-point data conversion for a short-bit floating-point data type and a long-bit floating-point data type, which is not restricted in the present disclosure.)
Claim 14
Chen in view of Austin and Kayaaslan discloses wherein the processor is a component in a cloud computing system (Chen, 1981; The electronic device may include a data processing device, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a mobile phone, a traffic recorder, a navigator, a sensor, a webcam, a server, a cloud server[wherein the processor is a component in a cloud computing system], a camera, a video camera, a projector, a watch, a headphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or medical equipment.)
Claim 15
Chen teaches a method comprising: by a processor of a system(Chen, 0066, “The terminal device includes: a storage device, a processor, and a computer program that is stored in the storage device and can run on the processor. The processor executes the computer program”)
causing, by the processor of the system, a plurality of the MPUs to operate on the partitioned matrix data to generate output data, wherein each MPU is to perform matrix multiplication operations (Chen, 0007-0008, 0027; ‘The primary processing circuit is configured to pre-process the input data, and transfer data and operation instructions to the plurality of secondary processing circuits. (Pre-arithmetic operations)’ with ‘The plurality of secondary processing circuits are configured to perform intermediate operations in parallel according to the data and the operation instructions transferred by the primary processing circuit to obtain a plurality of intermediate results, and transfer the plurality of intermediate results to the primary processing circuit. (Post-arithmetic operations)’ and ‘In some possible examples, the operation instruction includes at least one of: a matrix-multiply-vector instruction, a vector-multiply-matrix instruction, a matrix-multiply-scalar instruction[causing a plurality of the MPUs to operate on the partitioned matrix data to generate output data, wherein each MPU is to perform matrix multiplication operations], a tensor operation instruction, a matrix addition instruction, a matrix subtraction instruction, a matrix retrieving instruction, a matrix loading instruction, a matrix saving instruction, and a matrix moving instruction.’ (This maps to ‘matrix-wide operation.’ & Chen, 0066, “The terminal device includes: a storage device, a processor, and a computer program that is stored in the storage device and can run on the processor. The processor executes the computer program[by the processor of the system,]”));
and causing, by the processor of the system, the output data(Chen, 0084; Optionally, the storage unit is further configured to store the matrix operation result[and causing the output data] & Chen, 0066, “The terminal device includes: a storage device, a processor, and a computer program that is stored in the storage device and can run on the processor. The processor executes the computer program[by the processor of the system]”))
to implement one or more deep neural networks(Chen, 0869; an operation instruction configured to complete an arithmetic operation of the neural network, including a matrix operation instruction, a vector operation instruction, a scalar operation instruction, a convolution neural network operation instruction, a fully connected neural network operation instruction, a pooling neural network operation instruction, a RBM (Restricted Boltzmann Machine) neural network operation instruction, a LRN (Local Response Normalization) neural network operation instruction, a LCN (Local Contrast Normalization) neural network operation instruction, a LSTM (Long Short-Term Memory) neural network operation instruction, a RNN (Recurrent Neural Network) operation instruction, a RELU (Rectified Linear Unit) neural network operation instruction, a PRELU (Parametric Rectified Linear Unit) neural network operation instruction, a SIGMOID (S-shaped growth curve) neural network operation instruction, a TAN H (hyperbolic function) neural network operation instruction, and a MAXOUT (maximum output) neural network operation instruction;)
While Chen does teach a processor of a system, Chen does not teach:
causing matrix data of tensor data to be partitioned into a plurality of partitions, wherein the matrix data is partitioned into a number of partitions based on a number of processing elements of a system including a plurality of matrix processing units (MPUs).
However, Austin teaches:
causing matrix data of tensor data to be partitioned into a plurality of partitions, wherein the matrix data is partitioned into a number of partitions based on a number of processing elements of a system including a plurality of matrix processing units (MPUs)(Austin, pgs. 914-915, see also fig. 3(a), “For 𝑁-way tensors, we assume a logical 𝑁-way processor grid. Let
P
1
×
P
2
×
⋯
×
P
N
be the size of the processor grid[matrix processing units (MPUs)]…[l]et
J
1
×
J
2
×
⋯
×
J
N
be the size of a generic tensor 𝓎[matrix data of tensor data]… [w]e impose a Cartesian parallel distribution of the tensor across processors, which we refer to as a block distribution. Each processor owns a distinct subtensor of size
J
1
P
1
×
⋯
×
J
N
P
N
with J/P entries[causing matrix data of tensor data to be partitioned into a plurality of partitions, wherein the matrix data is partitioned into a number of partitions based on a number of processing elements of a system including a plurality of matrix processing units (MPUs)].”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Chen with the teachings of Austin the motivation to do so would be to provide better compression rates for large scale data used in simulations to better enable efficient data storage, retrieval, transfer and analysis(Austin, pg., 912, “Today’s high-performance parallel computers enable large-scale, high-fidelity simulations of natural phenomena across scientific domains. As the speed and quality of simulations increase, the amount of data produced is growing at a rate that is creating bottlenecks in the scientific process. A posteriori analysis of the data requires dedicated storage devices and parallel clusters even for simple computations. A primary goal of this work is to provide a compression technique for large-scale simulation data which enables much more efficient data storage, transfer, and analysis, thereby facilitating bigger and better science.”).
Chen in view of Austin do not disclose: wherein the matrix data of the tensor data in two dimensions are flattened along with one dimension to be operated by the plurality of MPUs; based on an instruction causing the matrix multiplication operations on the partitioned matrix data.
However, Kayaaslan teaches:
wherein the matrix data of the tensor data in two dimensions are flattened along with one dimension to be operated by the plurality of MPUs(Kayaaslan, pgs. 2-3, see also fig. 1, “In s2D partitioning, the matrix
A
and the vectors
x
and
y
are partitioned into k parts[wherein the matrix data of the tensor data in two dimensions are flattened]... [f]orm vector
y
^
k
(
l
)
which contains only those entries...corresponding to the nonzero rows in
A
l
k
(
k
)
... [f]orm vector
x
^
l
(
k
)
which contains only those entries...corresponding to the nonzero columns in
A
l
k
(
k
)
...[s]end vector [
x
^
l
(
k
)
,
y
^
k
(
l
)
] to processor
P
l
[along with one dimension to be operated by the MPUs]”);
based on an instruction causing the matrix multiplication operations on the partitioned matrix data(Kayaaslan, pgs. 2-3, see also fig. 1, “3) (Compute)Compute the output-subvector
y
(
k
)
[based on an instruction] as
y
(
k
)
←
∑
l
≠
k
A
k
l
(
k
)
x
^
k
(
l
)
+
A
k
k
x
(
k
)
+
∑
l
≠
k
y
^
l
(
k
)
[causing the matrix multiplication operations on the partitioned matrix data] ”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Chen in view of Austin with the teachings of Kayaaslan the motivation to do so would be to improve the traditional sparse matrix-vector multiplication process by decreasing the communication costs associated with processor inter-communication(Kayaaslan, pg., 3, “We propose semi-two-dimensional (s2D) data partitioning, which restricts each nonzero...to be together with either
x
j
or
y
i
. By this restriction, we guarantee that the computational group...is empty, and thus, the two communication phases can be fused into a single one.”).
Claim 16
Chen in view of Austin and Kayaaslan disclose wherein two or more MPUs in the plurality of MPUs share a memory resource block (Chen, paras. [2484-2489], see also figs. 37F, 4A, 5, and 2A, “Fig. 37 is a flowchart of a neural network processing method...where the computation device contains a plurality of ALUs...mapping, by an on-chip address index module, to a correct storage address”).
Claim 17
Chen in view of Austin and Kayaaslan discloses wherein the output data includes a tensor value(Chen, 0027; In some possible examples, the operation instruction includes at least one of: a matrix-multiply-vector instruction, a vector-multiply-matrix instruction, a matrix-multiply-scalar instruction, a tensor operation instruction[wherein the output data includes a tensor value])
Claim 20
Chen in view of Austin and Kayaaslan discloses wherein the tensor data includes a single exponent value for values in the tensor data(Chen, 2043; The pre-processing includes but is not limited to data format conversion, such as the conversion between continuous data and discrete data as described in the present disclosure, power conversion which is to convert non-power weight data in input data of a neural network to power weight data, statistics of floating-point data which is to count the bits of exponent bias and exponent bits[wherein the tensor data includes a single exponent value for values in the tensor data]required for storing different types of data during a forward operation of the artificial neural network, and floating-point data conversion for a short-bit floating-point data type and a long-bit floating-point data type, which is not restricted in the present disclosure.)
Claim 21
Chen in view of Austin and Kayaaslan discloses the apparatus of claim 1, wherein the one dimension for the flattened matrix data to align with is raw-wise of corresponding matrices(Kayaaslan, pgs. 2-3, see also fig. 1, “In s2D partitioning, the matrix
A
and the vectors
x
and
y
are partitioned into k parts...[f]orm vector
y
^
k
(
l
)
which contains only those entries of
y
k
(
l
)
corresponding to the nonzero rows in
A
l
k
(
k
)
[wherein the one dimension for the flattened matrix data to align with is raw-wise of corresponding matrices].”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Chen in view of Austin with the above teachings of Kayaaslan for the same rationale stated at Claim 1.
Claim 22
Chen in view of Austin and Kayaaslan discloses the apparatus of claim 1, wherein the one dimension for the flattened matrix data to align with is column-wise of corresponding matrices(Kayaaslan, pgs. 2-3, see also fig. 1, “In s2D partitioning, the matrix
A
and the vectors
x
and
y
are partitioned into k parts... [f]orm vector
x
^
l
(
k
)
which contains only those entries of
x
(
k
)
corresponding to the nonzero columns in
A
l
k
(
k
)
[wherein the one dimension for the flattened matrix data to align with is column-wise of corresponding matrices].”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Chen in view of Austin with the above teachings of Kayaaslan for the same rationale stated at Claim 1.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ADAM C STANDKE whose telephone number is (571)270-1806. The examiner can normally be reached Gen. M-F 9-9PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael J Huntley can be reached at (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Adam C Standke/
Primary Examiner
Art Unit 2129