Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.
Claims 1 - 18 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Lacy et al. (US Pat. No. 10261786), hereinafter referred to as Lacy.
Referring to claim 1, Lacy discloses an intelligence processing unit (IPU) coupled to an external memory storing a first tensor and a second tensor (System 100 is an example data processing system for performing tensor or vectorized computations associated with inference workloads for multi-layer DNNs. External memory 106 stores tensors, and system 100 includes a VPU lane 102 (i.e., an IPU) coupled to it, Fig. 1; col. 5, lines 45–50; col. 9, lines 40–52), the IPU comprising:
a memory (Each VPU lane includes vector memory 204 for storing data locally, referred to as vmem 204, Fig. 2; col. 10, lines 1–7);
a direct memory access (DMA) circuit coupled to the external memory and the memory and configured to perform following steps (DMA transfers from external memory 106 to vmem 204 are initiated by control instructions or by the host processor; DMA logic is implied in the movement of data between memory tiers, Fig. 1; col. 9, lines 40–67; col. 10, lines 15–30):
reading a first part of the first tensor from the external memory (External memory 106 provides vector element data (tensor parts) to vmem 204, supporting partial loads of tensors, col. 10, lines 1–7);
storing the first part of the first tensor in the memory (The first part is stored in vector memory 204 for local access, col. 10, lines 1–15);
reading a second part of the second tensor from the external memory (External memory is used to retrieve and load additional vector/tensor data into memory 204, col. 10, lines 1–15); and
storing the second part of the second tensor in the memory (Multiple banks in vmem 204 store the retrieved tensor segments for access by sublane processors, col. 10, lines 1–15; Fig. 2); and
a vector accelerator that comprises a register circuit, is coupled to the memory, and is configured to perform following steps (Each processing unit 202 acts as a vector accelerator and includes register file 206 coupled to vmem 204, Fig. 2; col. 11, lines 1–20):
storing P bytes of the first part of the first tensor in a target row of the register circuit, P being a positive integer (Data from vmem is loaded into specific registers (e.g., V0–V31) in register file 206; these registers store vector data from tensor segments, col. 11, lines 20–35);
storing Q bytes of the second part of the second tensor in the target row of the register circuit, Q being a positive integer (Registers store additional values (Q bytes) from the second tensor in the same row or register block, col. 11, lines 20–45); and
writing data of the target row into the memory (After arithmetic operations, results are written back to vmem 204 via crossbar 212b, which enables transfer from register files to memory, col. 11, lines 45–60; Fig. 2).
As to claim 2, Lacy discloses the IPU of claim 1, wherein the target row is a first target row, and the vector accelerator is further configured to perform following steps:
storing R bytes of the first part of the first tensor in a second target row of the register circuit, R being a positive integer; wherein the second target row is different from the first target row, and P is equal to R (Lacy discloses multiple vector registers (V0–V31) within register file 206 that receive data segments from the same tensor (first part), and can be accessed and written separately. Equal sizes (P = R) are typical due to fixed register widths (e.g., 32 bits), Fig. 2; col. 11, lines 20–45).
As to claim 3, Lacy discloses the IPU of claim 2, wherein the second target row is next to the first target row (Registers V0–V31 are indexed consecutively in register file 206, implying adjacency of register rows; thus, storing data in adjacent rows is inherent in Lacy’s architecture, col. 11, lines 20–35).
As to claim 4, Lacy discloses the IPU of claim 2, wherein storing of the P bytes in the first target row of the register circuit and storing of the R bytes in the second target row of the register circuit are completed in a same one write operation of the vector accelerator (Lacy describes a single load or write instruction from vmem 204 to register file 206 that loads multiple vector elements simultaneously, which can span multiple registers (e.g., V0, V1), constituting a single write operation, col. 11, lines 30–45).
As to claim 5, Lacy discloses the IPU of claim 2, wherein the vector accelerator is further configured to perform following steps:
storing S bytes of the second part of the second tensor (vector register V1 loaded after second part read from memory, col. 11, lines 35–45) in the second target row of the register circuit, S being equal to Q (second tensor data also stored in 8×32-bit wide registers like first tensor, col. 11, lines 40–45); and writing the first target row and the second target row into the memory simultaneously (crossbar 212b supports c1oncurrent routing of multiple register outputs to memory, Fig. 2; col. 12, lines 5–20); wherein the vector accelerator writes at most W bytes in one write operation to the memory (wide vector architecture allows full register row (256 bits) to be written per operation, col. 11, lines 50–60), and W is greater than or equal to a sum of P, Q, R, and S (Register 206 stores data from both tensors, and crossbar 212b supports simultaneous writeback of multiple registers to memory. Lacy supports wide vector instructions and data paths (e.g., 8×32-bit words), enabling single-cycle writes of aggregate sizes W ≥ sum of P, Q, R, S, Fig. 2; col. 11, lines 45–60; col. 12, lines 5–20; In column 11, lines 45–60, Lacy describes how the first and second load sequences complete with vector registers V0 and V1 filled, followed by computation and writeback to memory. Each vector register stores 8×32-bit elements (i.e., 256 bits or 32 bytes), and Lacy's architecture supports parallel or burst writeback of these registers through a wide crossbar (e.g., 212b) to memory. When both V0 and V1 are written to memory—either simultaneously or in a tightly coupled sequence—the total data written reaches 64 bytes or more. This supports the claimed limitation that the vector accelerator writes at most W bytes in one write operation to memory, where W is greater than or equal to the sum of P, Q, R, and S).
As to claim 6, Lacy discloses the IPU of claim 1, wherein the innermost dimension of the first tensor is P (first tensor segments processed in 32-bit wide vector elements, P = 32 bits, col. 11, lines 20–30), and the innermost dimension of the second tensor is Q (Innermost dimensions (P, Q) correspond to the width of vector elements. Lacy’s tensors are stored and operated on in vectors of fixed widths (e.g., 32-bit elements per vector register), col. 9, lines 56–60; col. 11, lines 20–35; the tensors are explicitly processed using vector registers that operate on fixed-width elements, particularly described in column 11, lines 20–30, where the system loads data into vector registers V0 and V1, each comprising eight 32-bit elements. This matches the vector data type unit32x8_t, which consists of 8 elements, each 32 bits wide. The innermost dimension of a tensor, by standard definition, refers to the smallest stride or element-wise width along the last axis, which in Lacy is precisely the width of the vector elements being loaded, processed, and written. As further supported by column 9, lines 56–60, which references the structured use of 32-bit vector elements in tensor computation, it is reasonable to interpret P and Q as 32 bits, since that is the actual granularity of data movement and computation in the hardware. Therefore, Lacy discloses the claimed "innermost dimension" of the tensors as being 32-bit wide elements, satisfying the requirement that P = Q = 32 bits).
As to claim 7, Lacy discloses the IPU of claim 1, wherein in the memory, a ratio of the first part (e.g., 128 elements of first tensor loaded to vector register V0, col. 11, lines 20–45) of the first tensor to the second part (128 elements of second tensor loaded to V1, col. 11, lines 20–45) of the second tensor is P/Q (Fixed-width register file and memory banks imply consistent ratios in how vectorized tensor parts are stored. When both parts are loaded into a common vector row or sequence, the memory layout inherently reflects the P/Q proportion, col. 11, lines 20–45).
As to claim 8, Lacy discloses the IPU of claim 1, wherein whenever the DMA circuit reads a part of the first tensor P times (read operations fill vector registers based on element counts P, e.g., 32-bit elements × 8 lanes, col. 12, lines 1–15) from the external memory, the DMA circuit reads a part of the second tensor Q (same number of fetches for Q elements of second tensor, using vectorized access logic, col. 12, lines 1–15) times from the external memory (Lacy supports parallel or balanced fetching of different tensor parts depending on the register fill logic and vector instruction structure. Equal and proportionate memory accesses are implied by consistent vector lengths and use of broadcast/fetch logic, col. 10, lines 1–15; col. 12, lines 1–15).
As to claim 9, Lacy discloses the IPU of claim 1, wherein the DMA circuit further performs following steps: writing an effective data of the target row into the external memory; wherein an amount of the effective data is greater than or equal to a sum of P and Q (Lacy describes writeback from register 206 to vmem and back to external memory 106 via DMA operations, including full vector row contents (i.e., 8×32-bit = 256 bits). The amount written equals or exceeds P+Q when multiple operands are involved, col. 11, lines 45–60; col. 12, lines 15–30; when data from multiple tensors (or operands) are loaded into vector registers (e.g., V0, V1) and written back to memory together, the combined write size includes both tensor parts. Since each part contributes at least P or Q bytes, the total written data from the register (e.g., 256 bits or more) meets or surpasses the sum of P and Q — especially when both are loaded into the same or adjacent vector rows and written back in a single or simultaneous operation).
Claims 10 - 18 recite the corresponding limitation of claims 1 - 9. Therefore, they are rejected accordingly.
Response to Arguments
Applicant's arguments filed 1/31/2024 have been fully considered but they are not persuasive.
Applicant’s Argument:
The Applicant argues that Lacy fails to disclose the limitation of claim 1 wherein “P bytes of the first part of the first tensor” and “Q bytes of the second part of the second tensor” are stored in the same “target row” of the register circuit, and further fails to disclose the step of “writing data of the target row into the memory.” See Applicant’s remarks filed 8/14/2025, at pages 2-3.
Examiner’s Response:
With respect to the first argument, Lacy teaches a data processing system comprising a vector processing architecture in which vector registers V0 and V1 are part of a unified register file used for SIMD operations (see Lacy, e.g., col. 11, lines 20–45; Fig. 2). In such SIMD architectures, registers V0 and V1 reside in the same register file (206) and participate in concurrent operations across multiple lanes, functionally operating as rows of a single register array. The claim’s language of a “target row” does not require a literal single register; rather, it is a functional designation for data temporarily stored in preparation for memory operations. Accordingly, the distinction drawn by the Applicant — that V0 and V1 are not “in the same row” — is a semantic distinction that does not distinguish over the register structure and operations disclosed in Lacy.
Further, with respect to the second argument, Lacy explicitly discloses that vector data from the register file is written back to memory following computation. Specifically, col. 14, lines 20–25 states:
Results of vector computations are written back to vector memory 204 via crossbar 212b.
This disclosure satisfies the claimed step of “writing data of the target row into the memory,” since the contents of the vector registers, which collectively form the claimed target row, are written back after completion of operations. Therefore, Lacy expressly teaches the write-back of data from the vector register(s) to memory.
Examiner’s Response to Argument Regarding Claim 6
The Applicant argues that Lacy fails to disclose “manipulating the innermost dimensions of the first and second tensors,” and more particularly fails to disclose that “the innermost dimension of the tensor under operation is exactly that fixed width (e.g., 32 bits).” Applicant therefore contends that Lacy does not disclose all the technical features recited in claim 6 and requests withdrawal of the rejection.
Lacy discloses a vector processing architecture in which tensor data is processed in vector registers with a fixed element width, such as 32-bit floating-point elements (see Lacy, col. 11, lines 15–45). In particular, Lacy states:
Each vector register contains 128 8-bit values, corresponding to 32 single-precision (32-bit) floating point values.
This explicit description confirms that the architecture performs element-wise operations over 32-bit values per register lane. Such a configuration inherently defines the innermost dimension of the tensor (i.e., the contiguous elements along a single row or vector lane) as the vector width (32 bits). Lacy further discloses that data is loaded, operated on, and stored using this structure, indicating that operations are applied directly on fixed-width segments — the innermost dimension — of the tensor.
Therefore, to the extent that claim 6 requires manipulating the innermost dimension of the tensor as being a fixed width (e.g., 32 bits), Lacy teaches or inherently discloses this feature through its use of fixed-width vector register operations over tensor data.
Accordingly, Applicant’s assertion that the innermost dimension of the tensor is not disclosed as the fixed width is not supported. No additional structural or operational distinction has been identified in the claim that would distinguish it over Lacy.
For the reasons given above, examiner respectfully maintains the previous rejection.
Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Contact Information
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JUANITO C BORROMEO whose telephone number is (571)270-1720. The examiner can normally be reached on Monday - Friday 9 - 5.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Henry Tsai can be reached on 5712724176. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/J.C.B/ Assistant Examiner, Art Unit 2184
/HENRY TSAI/ Supervisory Patent Examiner, Art Unit 2184