DETAILED ACTION
The instant application having Application No. 18/987,158 has a total of 20 claims pending in the application, all of which are ready for examination by the examiner.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Acknowledgement is made of applicant’s claim for foreign priority based on an application filed in REPUBLIC OF KOREA on 5/7/2024. Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55.
Information Disclosure Statement
The information disclosure statements (IDS) submitted on 12/19/2024 and 7/29/2025 are being considered by the examiner.
Claim Objections
Claim 13 is objected to because of the following informalities:
With respect to line 13 of claim 13, the examiner recommends amending the term ‘an outside’ as ‘the outside’ or ‘an external device’ according to paragraph 64 of the specification.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-4 and 9-12 are rejected under 35 U.S.C. 103 as being unpatentable over Dhakal et al. (US 20250321890 A1) in view of Kwon et al. (Kwon, Woosuk, et al. "Efficient memory management for large language model serving with pagedattention." Proceedings of the 29th symposium on operating systems principles. 2023. (Year: 2023)) in view of Yoshida et al. (US 20200133879 A1) in view of Park et al. (US 20220300157 A1).
As per claim 1,
1. A transformer acceleration device comprising: a memory device … [Dhakal teaches an SSD storing key and value vectors (KV cache) associated with tokens generated through accelerators associated with a transformer architecture (para. 18-20, 24-25, 30, 35-38, 10-13; figs. 1-3 and associated paragraphs)] a memory striding circuit configured to access the first and second memory blocks in response to a first striding request provided from an external device, the memory striding circuit including a memory block address management circuit configured to store a first memory block base address for the first memory block, and store a second memory block base address for the second memory block; a target address generation circuit configured to calculate, [Dhakal teaches a network controller (NIC) (memory striding circuit and components therein) that may receive a KV-cache-transfer request from a compute node and access a storage location to obtain the KV cache from the SSD for transfer to a GPU (para. 29-30, 40), where the NIC can queue a data-transfer command to the SSD for the transfer and transfer, to the GPU, an initial portion of the KV cache corresponding to the first layers to allow start of inference operations (para. 45-46; fig. 4 and associated paragraphs)]
Dhakal does not explicitly disclose, but Kwon discloses:
… including a first memory block configured to store a first plurality of cache vectors for a first plurality of tokens, and a second memory block configured to store a second plurality of cache vectors for a second plurality of tokens; and; the first and second memory blocks [Kwon teaches storing key and value vectors for tokens to memory, wherein the key and value vectors are stored sequentially in a plurality of blocks as the associated tokens are generated, and where first vectors generated according to tokens for an initial prompt and first decoding step may be stored in multiple blocks (section 4.1, para. 1; section. 4.3, para. 1-3; figs. 5-6)]
Dhakal and Kwon are analogous to the claimed invention because they are in the same field of endeavor involving data storage.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention, having knowledge of Dhakal and Kwon, to modify the disclosures by Dhakal to include disclosures by Kwon since they both teach data storage and memory access, wherein Kwon is directed towards more flexible paged memory management in LLM (Kwon: section: 4.1). Therefore, it would be applying a known technique (sequentially storing key and value vectors, including those of tokens of an initial prompt and first decoding step, across multiple memory blocks) to a known device (system for storing key and value vectors in an SSD which transfers an initial portion of the key and value vectors from the SSD according to a request) ready for improvement to yield predictable results (system for sequentially storing key and value vectors across multiple memory blocks in an SSD, and, responsive to a request, transferring initial portion of the key and value vectors from the corresponding memory blocks; doing so would provide for more flexible storage of and access to key and value vectors). MPEP 2143
Dhakal in view of Kwon does not explicitly disclose, but Yoshida discloses:
store a first memory block base address for the first memory block, and store a second memory block base address for the second memory block; in response to the first striding request, a first target address included in the first memory block based on the first memory block base address and a first subblock offset, and calculate, in response to the first striding request, a second target address included in the second memory block based on the second memory block base address and the first subblock offset; and [Dhakal in view of Kwon as shown above teaches a NIC receiving transfer requests and transmitting initial key and value vectors in memory blocks as shown above (Dhakal: para. 45-46); Dhakal in view of Kwon does not explicitly disclose, but Yoshida discloses a controller receiving, from a host, a read request corresponding to a plurality of blocks, the command comprising a plurality of block numbers, offsets, and lengths of data to be read (para. 187-188, 299-311; figs. 24-25 and associated paragraphs); Yoshida teaches determining the physical storage location to be read based on the read command (para. 187-188, 299-311; figs. 24-25 and associated paragraphs)] a command issue circuit configured to issue a first plurality of memory access commands for a first target subblock located in the first target address, and issue a second plurality of memory access commands for a second target subblock located in the second target address. [Yoshida teaches reading the data in units of pages (see para. 307 showing an entire page being read for extracting data smaller than the page when the read length is smaller than a page); while Yoshida does not explicitly provide for an example of a situation of a read length exceeding a page size for a read command, where Yoshida as shown above teaches read length of data to be read and performing reads in page units, it would have been obvious for one of ordinary skill in the arts that multiple reads corresponding to multiple pages may be performed in event of a read length exceeding a page size in order to provide for an efficient read process capable of processing read sizes exceeding a page]
Dhakal, Kwon, and Yoshida are analogous to the claimed invention because they are in the same field of endeavor involving data storage.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention, having knowledge of Dhakal in view of Kwon and Yoshida, to modify the disclosures by Dhakal in view of Kwon to include disclosures by Yoshida since they both teach data storage and memory access, wherein Yoshida is directed towards improved interfacing between host and storage (para. 5). Therefore, it would be applying a known technique (read command comprising memory block number, offset, and read length for a plurality of memory blocks) to a known device (transmitting vectors in memory blocks according to a transfer command) ready for improvement to yield predictable results (transmitting vectors in memory blocks according to a command indicating the blocks, offsets, and length of data to be read in order to provide for improved host control over data to be read from storage). MPEP 2143
Dhakal in view of Kwon in view of Yoshida does not explicitly disclose, but Park discloses:
the first subblock offset [Dhakal in view of Kwon in view of Yoshida as shown above teaches reading a plurality of blocks according to a read command; while it does not explicitly disclose using the same offset for the first and second blocks, Park provides for performing reads across a plurality of partial blocks, contained in respective memory blocks, as partial super blocks (para. 49), where the partial blocks so read in the plurality of the blocks may have the same relative positions within their blocks (e.g. pages 1-8) (para. 50-54); it would have been obvious for one of ordinary skill in the arts, provided with disclosures by Dhakal in view of Kwon in view of Yoshida directed towards a read command providing for reading a plurality of blocks through respective block number, offset, and read length, and disclosures by Park, directed towards reading together partial blocks having the same relative positions within their blocks, to provide for a combination where the read command may direct read of data in same relative locations of each of the first and second block by providing a same offset and read length for the blocks, as doing so would provide for greater predictability in read performance by providing for greater uniformity in access location and size]
Dhakal, Kwon, Yoshida, and Park are analogous to the claimed invention because they are in the same field of endeavor involving data storage.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention, having knowledge of Dhakal in view of Kwon in view of Yoshida and Park, to modify the disclosures by Dhakal in view of Kwon in view of Yoshida to include disclosures by Park since they both teach data storage and memory access, wherein Park is directed towards improved storage device and operation thereof (para. 2). Therefore, it would be applying a known technique (reading together partial blocks having the same relative positions within their blocks) to a known device (a command indicating respective blocks, offsets, and read data size length for reading/transferring data) ready for improvement to yield predictable results (a command indicating respective blocks, offsets, and read data lengths for reading/transferring data, wherein the command may indicate, for the respective blocks, the data in same relative positions by using the same offsets and read data lengths to provide for improved predictability in performing the read operations). MPEP 2143
As per claim 2, Dhakal in view of Kwon in view of Yoshida in view of Park teaches claim 1 as shown above and further teaches:
2. The transformer acceleration device of claim 1, wherein: a size of each of the first and second target subblocks is a first reading size. [Dhakal in view of Kwon in view of Yoshida in view of Park as shown above teaches read command comprising a length of data to be read for the blocks (see claim 1 above; Yoshida: para. 187-188, 299-311; Park: para. 49-54)]
Dhakal, Kwon, Yoshida, and Park are analogous to the claimed invention because they are in the same field of endeavor involving data storage.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention, having knowledge of Dhakal in view of Kwon in view of Yoshida and Park, to modify the disclosures by Dhakal in view of Kwon in view of Yoshida to include disclosures by Park since they both teach data storage and memory access, wherein Park is directed towards improved storage device and operation thereof (para. 2). Therefore, it would be applying a known technique (reading together partial blocks having the same relative positions within their blocks) to a known device (a command indicating respective blocks, offsets, and read data size length for reading/transferring data) ready for improvement to yield predictable results (a command indicating respective blocks, offsets, and read data lengths for reading/transferring data, wherein the command may indicate, for the respective blocks, the data in same relative positions by using the same offsets and read data lengths to provide for improved predictability in performing the read operations). MPEP 2143
As per claim 3, Dhakal in view of Kwon in view of Yoshida in view of Park teaches claim 2 as shown above and further teaches:
3. The transformer acceleration device of claim 2, wherein: the first striding request comprises the first memory block base address, the second memory block base address, the first subblock offset, and the first reading size. [Dhakal in view of Kwon in view of Yoshida as shown above teaches read request a plurality of block numbers as well as offset and length of data to be read (see claim 1-2 above; Yoshida: para. 187-188, 299-311; Park: para. 49-54)]
Dhakal, Kwon, Yoshida, and Park are analogous to the claimed invention because they are in the same field of endeavor involving data storage.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention, having knowledge of Dhakal in view of Kwon in view of Yoshida and Park, to modify the disclosures by Dhakal in view of Kwon in view of Yoshida to include disclosures by Park since they both teach data storage and memory access, wherein Park is directed towards improved storage device and operation thereof (para. 2). Therefore, it would be applying a known technique (reading together partial blocks having the same relative positions within their blocks) to a known device (a command indicating respective blocks, offsets, and read data size length for reading/transferring data) ready for improvement to yield predictable results (a command indicating respective blocks, offsets, and read data lengths for reading/transferring data, wherein the command may indicate, for the respective blocks, the data in same relative positions by using the same offsets and read data lengths to provide for improved predictability in performing the read operations). MPEP 2143
As per claim 4, Dhakal in view of Kwon in view of Yoshida in view of Park teaches claim 2 as shown above and further teaches:
4. The transformer acceleration device of claim 2, wherein: the target address generation circuit is configured to, in response to the first striding request, calculate a third target address included in the first memory block based on the first memory block base address and a second subblock offset; and calculate a fourth target address included in the second memory block based on the second memory block base address and the second subblock offset, and the command issue circuit is further configured to, issue a third plurality of memory access commands for a third target subblock located in the third target address, and issue a fourth plurality of memory access commands for a fourth target subblock located in the fourth target address. [Where Dhakal in view of Kwon in view of Yoshida in view of Park as shown above teaches the command directed towards reading the respective partial blocks of the blocks being read (see claims 1-2 above; Yoshida: para. 187-188, 299-311; Park: 49-54); Park additional provides for each block comprising a plurality of partial blocks (see Park: para. 50 providing for first partial block comprising pages 1-8 and second partial block comprising pages 9-16 belonging to respective partial super blocks), where the partial blocks having same relative positions in their blocks may be assigned to respective partial super blocks and accessed in order based their partial super block grouping (para. 50-54; para. 122); it would have been obvious for one of ordinary skill in the arts, provided with the disclosures by Dhakal in view of Kwon in view of Yoshida in view of Park directed towards the command comprising memory block numbers and an offset for reading data having same relative positions in their respective blocks and additional disclosures by Park directed towards also reading a second set of partial blocks in the blocks having same relative positions as each other, to provide for a combination where a command may further comprise a second offset for also reading the second set of partial blocks in the blocks.]
Dhakal, Kwon, Yoshida, and Park are analogous to the claimed invention because they are in the same field of endeavor involving data storage.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention, having knowledge of Dhakal in view of Kwon in view of Yoshida and Park, to modify the disclosures by Dhakal in view of Kwon in view of Yoshida in view of Park to include additional disclosures by Park since they both teach data storage and memory access, wherein Park is directed towards improved storage device and operation thereof (para. 2). Therefore, it would be applying a known technique (blocks comprising multiple partial blocks grouped based on their relative positions within the blocks and accessed together in a group) to a known device (a command indicating, for the respective blocks, the data in same relative positions by using the same offsets) ready for improvement to yield predictable results (a command indicating, for respective blocks, the data in same relative positions in the blocks by using the same offsets and read data lengths, wherein the command may include a plurality of offsets for reading a plurality of sets of data each having same relative position as each other in order to provide for improved throughput). MPEP 2143
As per claim 9, Dhakal in view of Kwon in view of Yoshida in view of Park teaches claim 1 as shown above and further teaches:
The transformer acceleration device of claim 1, wherein the memory device is configured to store cache vectors included in the first target subblock among the plurality of first cache vectors in addresses adjacent to each other, and store cache vectors included in the second target subblock among the plurality of second cache vectors in addresses adjacent to each other. [Dhakal in view of Kwon in view of Yoshida in view of Park teaches key and value vectors stored sequentially according to their tokens (Kwon: section 4.1, para. 1; section. 4.3, para. 1-3; figs. 5-6) and access according to the command comprising a contiguous area within the blocks (Yoshida: para. 187-188, 299-311; Park: para. 50-54)]
Dhakal and Kwon are analogous to the claimed invention because they are in the same field of endeavor involving data storage.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention, having knowledge of Dhakal and Kwon, to modify the disclosures by Dhakal to include disclosures by Kwon since they both teach data storage and memory access, wherein Kwon is directed towards more flexible paged memory management in LLM (Kwon: section: 4.1). Therefore, it would be applying a known technique (sequentially storing key and value vectors, including those of tokens of an initial prompt, across multiple memory blocks) to a known device (system for storing key and value vectors in an SSD which transfers an initial portion of the key and value vectors from the SSD according to a request) ready for improvement to yield predictable results (system for sequentially storing key and value vectors across multiple memory blocks in an SSD, and, responsive to a request, transferring initial portion of the key and value vectors from the corresponding memory blocks; doing so would provide for more flexible storage of and access to key and value vectors). MPEP 2143
Dhakal, Kwon, and Yoshida are analogous to the claimed invention because they are in the same field of endeavor involving data storage.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention, having knowledge of Dhakal in view of Kwon and Yoshida, to modify the disclosures by Dhakal in view of Kwon to include disclosures by Yoshida since they both teach data storage and memory access, wherein Yoshida is directed towards improved interfacing between host and storage (para. 5). Therefore, it would be applying a known technique (read command comprising memory block number, offset, and read length for a plurality of memory blocks) to a known device (transmitting vectors in memory blocks according to a transfer command) ready for improvement to yield predictable results (transmitting vectors in memory blocks according to a command indicating the blocks, offsets, and length of data to be read in order to provide for improved host control over data to be read from storage). MPEP 2143
Dhakal, Kwon, Yoshida, and Park are analogous to the claimed invention because they are in the same field of endeavor involving data storage.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention, having knowledge of Dhakal in view of Kwon in view of Yoshida and Park, to modify the disclosures by Dhakal in view of Kwon in view of Yoshida to include disclosures by Park since they both teach data storage and memory access, wherein Park is directed towards improved storage device and operation thereof (para. 2). Therefore, it would be applying a known technique (reading together partial blocks having the same relative positions within their blocks) to a known device (a command indicating respective blocks, offsets, and read data size length for reading/transferring data) ready for improvement to yield predictable results (a command indicating respective blocks, offsets, and read data lengths for reading/transferring data, wherein the command may indicate, for the respective blocks, the data in same relative positions by using the same offsets and read data lengths to provide for improved predictability in performing the read operations). MPEP 2143
As per claim 10, Dhakal in view of Kwon in view of Yoshida in view of Park teaches claim 1 as shown above and further teaches:
The transformer acceleration device of claim 1, wherein: the first plurality of tokens and the second plurality of tokens are included in a first token sequence, and the first plurality of tokens and the second plurality of tokens are continuous in the first token sequence. [Kwon teaches sequentially storing vectors for tokens into the blocks as they are generated and provides an example of two blocks comprising initially generated tokens sequentially followed by a token generated in the next step (see claim 1 above; section. 4.3, para. 1-3; figs. 6; see fig. 6 on blocks 1 and 7 comprising key and value vectors for an initial set of tokens and a subsequent token generated in the following step based on physical blocks 1 and 7)]
Dhakal and Kwon are analogous to the claimed invention because they are in the same field of endeavor involving data storage.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention, having knowledge of Dhakal and Kwon, to modify the disclosures by Dhakal to include disclosures by Kwon since they both teach data storage and memory access, wherein Kwon is directed towards more flexible paged memory management in LLM (Kwon: section: 4.1). Therefore, it would be applying a known technique (sequentially storing key and value vectors, including those of tokens of an initial prompt, across multiple memory blocks) to a known device (system for storing key and value vectors in an SSD which transfers an initial portion of the key and value vectors from the SSD according to a request) ready for improvement to yield predictable results (system for sequentially storing key and value vectors across multiple memory blocks in an SSD, and, responsive to a request, transferring initial portion of the key and value vectors from the corresponding memory blocks; doing so would provide for more flexible storage of and access to key and value vectors). MPEP 2143
As per claim 11, Dhakal in view of Kwon in view of Yoshida in view of Park teaches claim 10 as shown above and further teaches:
11. The transformer acceleration device of claim 10, further comprising: a calculation circuit configured to perform an attention calculation based on cache vectors included in the first target subblock among the first plurality of cache vectors and cache vectors included in the second target subblock among the second plurality of cache vectors. [Dhakal in view of Kwon in view of Yoshida in view of Park as shown above teaches that the data from the SSD are being retrieved to start inference on a query by the GPU (Dhakal: para. 24-25, 30, 45-46)]
As per claim 12, Dhakal in view of Kwon in view of Yoshida in view of Park teaches claim 1 as shown above and further teaches:
12. The transformer acceleration device of claim 1, wherein: the first plurality of memory access commands include a first plurality of active commands and a first plurality of read commands, and the second plurality of memory access commands include a second plurality of active commands and a second plurality of read commands. [Dhakal in view of Kwon in view of Yoshida in view of Park as shown above teaches issuing a plurality of read commands corresponding to each page being read according to read data length (see claim 1 above; Yoshida: para. 307, para. 187-188, 299-311), wherein a read command may necessarily comprise an active command at least by the virtues of comprising an action.]
Dhakal, Kwon, and Yoshida are analogous to the claimed invention because they are in the same field of endeavor involving data storage.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention, having knowledge of Dhakal in view of Kwon and Yoshida, to modify the disclosures by Dhakal in view of Kwon to include disclosures by Yoshida since they both teach data storage and memory access, wherein Yoshida is directed towards improved interfacing between host and storage (para. 5). Therefore, it would be applying a known technique (read command comprising memory block number, offset, and read length for a plurality of memory blocks) to a known device (transmitting vectors in memory blocks according to a transfer command) ready for improvement to yield predictable results (transmitting vectors in memory blocks according to a command indicating the blocks, offsets, and length of data to be read in order to provide for improved host control over data to be read from storage). MPEP 2143
Claims 5 and 8 are rejected under 35 U.S.C. 103 as being unpatentable over Dhakal et al. (US 20250321890 A1) in view of Kwon et al. (Kwon, Woosuk, et al. "Efficient memory management for large language model serving with pagedattention." Proceedings of the 29th symposium on operating systems principles. 2023. (Year: 2023)) in view of Yoshida et al. (US 20200133879 A1) in view of Park et al. (US 20220300157 A1) in view of Xu et al. (US 11550736 B1).
As per claim 5, Dhakal in view of Kwon in view of Yoshida in view of Park teaches claim 4 as shown above. It does not explicitly disclose, but Xu discloses:
5. The transformer acceleration device of claim 4, wherein the first striding request includes the first memory block base address, the second memory block base address, a head address interval, a layer address interval, and the first reading size. [Dhakal in view of Kwon in view of Yoshida in view of Park as shown above teaches a command comprising block numbers, offset, and read data length (para. 187-188, 299-311); it does not explicitly disclose, but Xu discloses performing memory access according to a data transfer size, address, and stride, where the stride may be a multi-dimensional stride comprising a memory row offset and a memory column offset used for performing successive memory access (col. 8, lines 26-42; col. 8, line 66 – col. 9, line 16; col. 9, line 55 – col. 10, line 5; figs. 4-6 and associated paragraphs); it would have been obvious for one of ordinary skill in the arts, provided with the disclosures by Dhakal in view of Kwon in view of Yoshida in view of Park, directed towards a command comprising parameters including block number, offset, and read data length, and disclosures Xu, directed towards performing successive accesses according to address, data transfer size, and a multi-dimensional stride, to provide for a combination where the command may also comprise the multi-dimensional stride to provide support for performing successive memory operations]
Dhakal, Kwon, Yoshida, Park, and Xu are analogous to the claimed invention because they are in the same field of endeavor involving data storage.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention, having knowledge of Dhakal in view of Kwon in view of Yoshida in view of Park and Xu, to modify the disclosures by Dhakal in view of Kwon in view of Yoshida in view of Park to include disclosures by Xu since they both teach data storage and memory access, wherein Xu is directed towards improved memory access efficiency (col. 1, lines 41-57). Therefore, it would be applying a known technique (performing successive memory access using an address, read data size, and a multi-dimensional stride) to a known device (a command indicating respective blocks, offsets, and read data size length for reading/transferring data) ready for improvement to yield predictable results (a command indicating respective blocks, offsets, and read data size length for reading/transferring data, where the command may further comprise a multi-dimensional stride usable for performing successive memory accesses; doing so would provide for greater variety of options in targeting desired storage locations). MPEP 2143
As per claim 8, Dhakal in view of Kwon in view of Yoshida in view of Park in view of Xu teaches claim 5 as shown above and further teaches:
8. The transformer acceleration device of claim 5, wherein the command issue circuit is configured to: read the first and second plurality of memory access commands during a first time period; and read the third and fourth plurality of memory access commands during a second time period after the first time period. [Dhakal in view of Kwon in view of Yoshida in view of Park as shown above teaches the command having block numbers and offsets corresponding to respective partial blocks of the blocks, where the partial blocks may be accessed together as partial super blocks and accessed in order based on their partial super block grouping (see claim 4 above; Park: para. 50-54, 122)]
Dhakal, Kwon, Yoshida, and Park are analogous to the claimed invention because they are in the same field of endeavor involving data storage.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention, having knowledge of Dhakal in view of Kwon in view of Yoshida and Park, to modify the disclosures by Dhakal in view of Kwon in view of Yoshida in view of Park to include additional disclosures by Park since they both teach data storage and memory access, wherein Park is directed towards improved storage device and operation thereof (para. 2). Therefore, it would be applying a known technique (blocks comprising multiple partial blocks grouped based on their relative positions within the blocks and accessed together in a group) to a known device (a command indicating, for the respective blocks, the data in same relative positions by using the same offsets) ready for improvement to yield predictable results (a command indicating, for respective blocks, the data in same relative positions in the blocks by using the same offsets and read data lengths, wherein the command may include a plurality of offsets for reading a plurality of sets of data each having same relative position as each other in order to provide for improved throughput). MPEP 2143
Claim 19 is rejected under 35 U.S.C. 103 as being unpatentable over Dhakal et al. (US 20250321890 A1) in view of Kwon et al. (Kwon, Woosuk, et al. "Efficient memory management for large language model serving with pagedattention." Proceedings of the 29th symposium on operating systems principles. 2023. (Year: 2023)).
As per claim 19,
A transformer acceleration device comprising: a memory device including a plurality of memory blocks including a plurality of subblocks [Dhakal teaches an SSD storing key and value vectors (KV cache) associated with tokens generated through accelerators associated with a transformer architecture (para. 18-20, 24-25, 30, 35-38, 10-13; figs. 1-3 and associated paragraphs)] a memory striding circuit configured to sequentially access the plurality of subblocks in response to a first striding request provided from an external device; and [Dhakal teaches a network controller (NIC) (memory striding circuit and components therein) that may receive a KV-cache-transfer request from a compute node and access a storage location to obtain the KV cache from the SSD for transfer to a GPU (para. 29-30, 40), where the NIC can queue a data-transfer command to the SSD for the transfer and transfer, to the GPU, an initial portion of the KV cache corresponding to the first layers to allow start of inference operations (para. 45-46; fig. 4 and associated paragraphs)] a calculation circuit configured to perform a first attention calculation based on a first plurality of subblocks accessed by the memory striding circuit during a first time period, and perform a second attention calculation based on a second plurality of subblocks accessed by the memory striding circuit during a second time period after the first time period, [In addition to transferring the initial portion of KV cache to the GPU for performing inference as shown above (para. 29-30, 40, 45-46), Dhakal teaches, subsequent to the transfer of the initial portion, the NIC may perform further transfers of KV cache through prefetching (para. 33, 46-47; see fig. 2 showing #232 for initial transfer and #238 for the prefetch, both directed to LLM inference #234.)]
Dhakal does not explicitly disclose, but Kwon discloses:
a memory device including a plurality of memory blocks including a plurality of subblocks;; the plurality of subblocks; a first plurality of subblocks; a second plurality of subblocks; the plurality of subblocks include the first plurality of subblocks and the second plurality of subblocks. [Dhakal as shown above teaches accessing the SSD for accessing KV cache needed for the first layers to start inference operations and subsequently accessing additional KV cache as shown above (Dhakal: para. 29-30, 33, 45-47); Dhakal does not explicitly provide for the key and value vectors stored in a plurality of blocks so being accessed, but Kwon discloses memory blocks used to sequentially store key and value vectors associated with generated tokens, where key and value vectors for each token may correspond to a subblock, and Kwon shows key and value vectors corresponding to different steps may be stored to the blocks in a staggered fashion (section 4.1, para. 1; section. 4.3, para. 1-3; figs. 5-6; see fig. 5 and section 4.3 providing blocks 1 and 7 initially storing vectors for initial tokens and tokens for a first step)]
Dhakal and Kwon are analogous to the claimed invention because they are in the same field of endeavor involving data storage.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention, having knowledge of Dhakal and Kwon, to modify the disclosures by Dhakal to include disclosures by Kwon since they both teach data storage and memory access, wherein Kwon is directed towards more flexible paged memory management in LLM (Kwon: section: 4.1). Therefore, it would be applying a known technique (sequentially storing key and value vectors, including those of tokens of an initial prompt and first decoding step, across multiple memory blocks) to a known device (system for storing key and value vectors in an SSD which transfers portions of the key and value vectors from the SSD at different time periods) ready for improvement to yield predictable results (system for sequentially storing key and value vectors across multiple memory blocks in an SSD, and transferring, at different time periods, portions of the key and value vectors from the corresponding memory blocks; doing so would provide for more flexible storage of and access to key and value vectors). MPEP 2143
Allowable Subject Matter
Claims 6 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
With respect to claim 6, “… wherein the target address generation circuit is further configured to calculate the first and second subblock offsets based on the head address interval and the layer address interval.” in conjunction with the other limitations of the claim and the limitations of the base claim and the intervening claims, are not disclosed by the prior art of record.
The closest prior art of record are Dhakal et al. (US 20250321890 A1), Kwon et al. (Kwon, Woosuk, et al. "Efficient memory management for large language model serving with pagedattention." Proceedings of the 29th symposium on operating systems principles. 2023. (Year: 2023)), Yoshida et al. (US 20200133879 A1), Park et al. (US 20220300157 A1), Xu et al. (US 11550736 B1), Seo et al. (US 20110087821 A1), and Minato et al. (US 20210294529 A1).
Dhakal teaches performing fetch and prefetch for KV cache. Kwon teaches storing key and value vectors through multiple blocks. Yoshida teaches host read command comprising read data size, block number, and offset. Park teaches accessing multiple partial blocks of blocks as partial super blocks. Xu teaches use of strides for successive memory accesses in a DRAM storing tensors. Seo teaches a stride register having stride values for successive row and column directions. Mintao teaches command comprising page and column addresses.
However, the prior arts of record, neither individually nor in combination, teaches, in association with a striding command comprising a first memory block base address, a second memory block base addresses, a head address interval, a layer address interval, and a first reading size, a striding circuit receiving the command calculating a first subblock offset and a second subblock offset based on the head address interval and the layer address interval, wherein the first subblock offset corresponds to the relative location of both a first and a second subblock respectively located within a first memory block having the first memory block base address and a second memory block the second memory block base addresses, wherein the second subblock offset corresponds to the relative location of both a third and a fourth subblock also respectively located within the first memory block and the second memory block, wherein the striding circuit is configured to calculate the addresses of the respective subblocks based on the memory block base addresses for the first and second memory blocks as well as the first and second subblock offsets and issue a respective plurality of memory access commands for each of the respective subblocks.
Therefore, the prior arts of record, neither individually nor in combination disclose, in conjunction with the other limitations of the claim and the limitations of the base claim and the intervening claims, the claim as a whole.
Claim 7 is objected to for being dependent on an objected claim, but would be allowable if claim 6 is rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Claim 13 is objected pursuant to a minor informality as shown above, but would be allowed if rewritten to overcome the claim objection.
With respect to claim 13, “…execute a plurality of decoder layers including multi head attention calculations respectively performed based on a plurality of heads, the transformer acceleration device comprising: a first memory block including a first subblock configured to store a first plurality of cache vectors generated for a first plurality of tokens based on a first head and a first decoder layer, the first head being one of the plurality of heads and the first decoder layer being one of the plurality of decoder layers; a second memory block including a second subblock configured to store a second plurality of cache vectors generated for a second plurality of tokens based on the first head and the first decoder layer; a memory striding circuit configured to read the first plurality of cache vectors and the second plurality of cache vectors based on sequentially accessing the first subblock and the second subblock in response to a first striding request provided from an outside; and a calculation circuit configured to perform a first attention calculation for the first head and the first decoder layer based on the first plurality of cache vectors and the second plurality of cache vectors.” in conjunction with the other limitations of the claim, are not disclosed by the prior art of record.
The closest prior art of record are Dhakal et al. (US 20250321890 A1), Kwon et al. (Kwon, Woosuk, et al. "Efficient memory management for large language model serving with pagedattention." Proceedings of the 29th symposium on operating systems principles. 2023. (Year: 2023)), Xu et al. (US 11550736 B1), Foo et al. (US 20250147905 A1), Yu et al. (US 11442775 B1), and Hirisave Chandra Shekhara et al. (US 20240176663 A1)
Dhakal teaches performing fetch and prefetch for KV cache. Kwon teaches storing key and value vectors through multiple blocks. Xu teaches use of strides for successive memory accesses in a DRAM storing tensors. Foo teaches a ring buffer configured to reconfigure storage location of its cached vectors as vectors are added. Yu teaches a plurality of memory banks arranged to store respective key and value vectors. Hirisave Chandra Shekhara teaches a tensor memory access unit for generating queries, keys, and values.
However, the prior arts of record, neither individually nor in combination, teaches, in association with a transformer acceleration device executing a plurality of decoder layers based on a plurality of heads, a first subblock of a first memory block and a second subblock of a second memory block respectively storing a first plurality of cache vectors and a second plurality of cache vectors, wherein the first plurality of cache vectors are generated for a first plurality of tokens and the second plurality of cache vectors are generated for a second plurality of tokens, where the first and the second plurality of cache vectors are both generated based on a first head and a first decoder layer among the plurality of heads and the plurality of decoder layers, wherein a memory striding circuit is configured to read the first and second plurality of cache vectors responsive to a striding request, and a calculation circuit is configured to perform a first attention calculation for the first head and the first decoder layer based on the first and second plurality of cache vectors.
Therefore, the prior arts of record, neither individually nor in combination disclose, in conjunction with the other limitations of the claim, the claim as a whole.
Claims 14-18 are objected for being dependent on an objected claim, but would be allowable if claim 13 was rewritten to claim objection.
Claim 20 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
With respect to claim 20, “… wherein: each of the plurality of memory blocks include one of the first plurality of subblocks, and each of the plurality of memory blocks include one of the second plurality of subblocks.” in conjunction with the other limitations of the claim and the limitations of the base claim, are not disclosed by the prior art of record.
The closest prior art of record are Dhakal et al. (US 20250321890 A1), Kwon et al. (Kwon, Woosuk, et al. "Efficient memory management for large language model serving with pagedattention." Proceedings of the 29th symposium on operating systems principles. 2023. (Year: 2023)), Xu et al. (US 11550736 B1), Foo et al. (US 20250147905 A1), Yu et al. (US 11442775 B1), and Hirisave Chandra Shekhara et al. (US 20240176663 A1)
Dhakal teaches performing fetch and prefetch for KV cache. Kwon teaches storing key and value vectors through multiple blocks. Xu teaches use of strides for successive memory accesses in a DRAM storing tensors. Foo teaches a ring buffer configured to reconfigure storage location of its cached vectors as vectors are added. Yu teaches a plurality of memory banks arranged to store respective key and value vectors. Hirisave Chandra Shekhara teaches a tensor memory access unit for generating queries, keys, and values.
However, the prior arts of record, neither individually nor in combination, teaches, in association with a calculation circuit configured to perform a first attention calculation based on a first plurality of subblocks accessed by a memory striding circuit during a first time period and a second attention calculation based on a second plurality of subblocks accessed by the memory striding circuit during a second time period after the first time period, the memory striding circuit being configured to access a plurality of memory blocks comprising the first and second plurality of subblocks responsive to a striding request from an external device, wherein each of the plurality of memory blocks as accessed include one of the first plurality of subblocks used during the first attention calculation and each of the plurality of memory blocks also includes one of the second plurality of subblocks used during the second attention calculation.
Therefore, the prior arts of record, neither individually nor in combination disclose, in conjunction with the other limitations of the claim and the limitations of the base claim, the claim as a whole.
Relevant Prior Art
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure.
Arunkumar et al. (US 12182028 B1) teaches use of cached key and value data from QKV projection layer in attention layers.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ELIAS KIM whose telephone number is (571)272-8093. The examiner can normally be reached Monday - Friday: 7:30-5:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, JARED RUTZ can be reached at 571-272-5535. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/E.Y.K./Examiner, Art Unit 2135 /JARED I RUTZ/Supervisory Patent Examiner, Art Unit 2135