Prosecution Insights
Last updated: May 29, 2026
Application No. 18/958,887

KEY-VALUE CACHE MANAGEMENT, MODEL REASONING, AND DATA PROCESSING METHODS AND APPARATUSES FOR LARGE LANGUAGE MODELS

Final Rejection §103
Filed
Nov 25, 2024
Priority
Jul 09, 2024 — CN 202410915392.8
Examiner
RIGOL, YAIMA
Art Unit
2135
Tech Center
2100 — Computer Architecture & Software
Assignee
Alipay (Hangzhou) Information Technology Co., Ltd.
OA Round
2 (Final)
75%
Grant Probability
Favorable
3-4
OA Rounds
1y 9m
Est. Remaining
93%
With Interview

Examiner Intelligence

Grants 75% — above average
75%
Career Allowance Rate
469 granted / 624 resolved
+20.2% vs TC avg
Strong +18% interview lift
Without
With
+17.5%
Interview Lift
resolved cases with interview
Typical timeline
3y 3m
Avg Prosecution
13 currently pending
Career history
640
Total Applications
across all art units

Statute-Specific Performance

§101
1.6%
-38.4% vs TC avg
§103
85.8%
+45.8% vs TC avg
§102
2.4%
-37.6% vs TC avg
§112
3.3%
-36.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 624 resolved cases

Office Action

§103
Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . DETAILED ACTION As per the instant application having Application No. 18/958,887, the amendment filed on 3/23/2026 is herein acknowledged. Claims 1, 4, 6-7, 10, 12-13, 16 and 18 have been amended and claims 19-20 have been added. Claims 1-20 are pending. In the response to this Office action, the Examiner respectfully requests that support be shown for language added to any original claims on amendment and any new claims. That is, indicate support for newly added claim language by specifically pointing to page(s) and line numbers in the specification and/or drawing figure(s). This will assist the Examiner in prosecuting this application. Examiner cites particular columns and line numbers in the references as applied to the claims below for the convenience of the applicant. Although the specified citations are representative of the teachings in the art and are applied to the specific limitations within the individual claim, other passages and figures may apply as well. It is respectfully requested that, in preparing responses, the applicant fully consider the references in entirety as potentially teaching all or part of the claimed invention, as well as the context of the passage as taught by the prior art or disclosed by the examiner. OBJECTIONS Claim Objections Claims 1, 7, 13 and 20 are objected to because of the following informalities: The limitations “… maximum sequence length multiplies a virtual memory size…” appear to refer to a/the maximum sequence length “multiplied by” a virtual memory size. Appropriate correction is required. Dependent claims 2-6, 8-12 and 14-20 are objected to for encompassing the deficiencies found in the independent claims upon which they depend. REJECTIONS BASED ON PRIOR ART Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 1, 4-7, 10-13 and 16-20 are rejected under 35 U.S.C. 103 as being unpatentable over Kwon et al. “Efficient Memory Management for Large Language Model Serving with PagedAttention”, ACM ISBN 979-8-4007-0229-7/23/10. https://doi.org/10.1145/3600006.3613165, 23 October 2023, 16 pages in view of Kumar, Sachin (hereinafter “Kumar”) “vAttention: Dynamic KV-cache Memory Management for Serving LLMs without PagedAttention”, https://medium.com/@techsachin/vattention-dynamic-kv-cache-memory-management-for-serving-llms-without-pagedattention-d2b5610dd536 , 8 May 2024, 26 pages and Phanishayee et al. (US 12,602,270). 1. A method of key-value cache management, comprising: allocating a virtual memory block in a virtual address slot to store data for a newly-added key-value token for a model reasoning request, wherein the virtual address slot comprises one or more virtual memory blocks, … wherein the virtual address slot is formed by equally dividing a virtual address space of a key-value cache, and [Kwon teaches “PagedAttention partitions the KV cache of each sequence into blocks. Each block contains the key and value vectors for a fixed number of tokens, which we denote as KV block size (B)….” (Section 4.1, first paragraph, page 615) “The key idea behind vLLM’s memory manager is analogous to the virtual memory in operating systems. OS partitions memory into fixed-sized pages and maps user programs’ logical pages to physical pages… physical memory space needs not to be fully reserved in advance, enabling the OS to dynamically allocate physical pages as needed. vLLM uses the ideas behind virtual memory to manage the KV cache in an LLM service… we organize the KV cache as fixed-size KV blocks, like pages in virtual memory.” (Section 4.2, first paragraph, page 615) “A request’s KV cache is represented as a series of logical KV blocks…” (Section 4.2, second paragraph, page 615)] but Kwon does not expressly disclose a size of the virtual address slot is based on a maximum sequence length multiplies a virtual memory size occupied by a single token, the size of the virtual memory block is the virtual size occupied by the single token, a quantity of virtual address slots is equal to a maximum amount of batch request processing of a large language model; [Kwon teaches “vLLM does not require reserving the memory for the maximum possible generated sequence length initially. Instead, it reserves only the necessary KV blocks to accommodate the KV cache generated during prompt computation” (section 4.3, first paragraph, page 616) “for each decoding iteration, vLLM first selects a set of candidate sequences for batching… and allocates the physical block for the newly required logical blocks…” (Section 4.3, second paragraph, page 616 ) (which may be interpretated as accommodating a maximum amount of batch requests) “vLLM dynamically assigns new physical blocks to logical blocks more tokens and their KV cache are generated… a new physical block is only allocated when all previous blocks are full, vLLM limits all the memory wastes for a request within one block, so it can effectively utilize all the memory… This allows more requests to fit into memory for batching – hence improving the throughput. Once a request finishes its generation, its KV blocks can be freed to store the KV cache of other requests” see table 1 in page 619 depicting the memory organization for different GPUs. Note that the maximum number of requests in a virtual KV cache configuration may be dynamic as it may depend on system configuration and specific workload] but Kwon does not expressly refer the quantity of virtual slots is equal to the maximum amount of batch request processing of a large language model in response to determining that a scheduling result of the model reasoning request indicates the model reasoning request is scheduled for execution, maintaining a mapping relationship between an occupied virtual address slot and a physical graphics memory block allocated to the model reasoning request, wherein the physical graphics memory block is formed by equally dividing a maximum available physical graphics memory capacity of model reasoning; and [Kwon teaches “vLLM adopts a centralized scheduler to coordinate the execution of distributed GPU workers.” (Section 4, first paragraph, page 615) “we organize the KV cache as fixed-size KV blocks, like pages in virtual memory” (Section 4.3, first paragraph, page 615) “A request’s KV cache is represented as a series of logical KV blocks… On GPU workers, a block engine allocates a contiguous chunk of GPU DRAM and divides it into physical KV blocks… The KV block manager also maintains block tables (corresponding to a mapping relationship) -the mapping between logical and physical KV blocks of each request. Each block table entry records the corresponding physical blocks of a logical block and the number of filled positions. Separating logical and physical KV blocks allows vLLM to dynamically grow the KV cache memory without reserving it for all positions in advance…” (Section 4.2, second paragraph, page 615) (see Table 1, page 619 depicting model sizes and server configuration including the number of cache blocks) where a fixed block size is discussed in section 7.2, page 622 where “Again, vLLM dynamically assigns new physical blocks to logical blocks as more tokens and their KV cache are generated. As all the blocks are filled from left to right and a new physical block is only allocated when all previous blocks are full, vLLM limits all the memory wastes for a request within one block, so it can effectively utilize all the memory” (section 4.3, third paragraph, page 616)] copying the data for the newly-added key-value token to the physical graphics memory block, [Kwon teaches “During LLM’s computation, vLLM uses the PagedAttention kernel to access the previous KV cache stored in the form of logical KV blocks and saves the newly generated KV cache into the physical KV blocks. Storing multiple tokens within a KV block (block size > 1) enables the PagedAttention kernel to process the KV cache across more positions in parallel, thus increasing the hardware utilization and reducing latency” (Section 4.3, second paragraph, page 616)] wherein the scheduling result is determined based on the allocated virtual memory block and a capacity of an available physical graphics memory block, each model reasoning request occupies a virtual address slot, and slot indication information of the occupied virtual address slot is recorded in a valid virtual address slot table [Kwon teaches “In the first autoregressive decoding step, vLLM generates the new token with the PagedAttention algorithm on physical blocks 7 and 1. Since one slot remains available in the last logical block, the newly generated KV cache is stored there, and the block table’s #filled record is updated. ○3 At the second decoding step, as the last logical block is full, vLLM stores the newly generated KV cache in a new logical block; vLLM allocates a new physical block (physical block 3) for it and stores this mapping in the block table. Globally, for each decoding iteration, vLLM first selects a set of candidate sequences for batching (more in §4.5), and allocates the physical blocks for the newly required logical blocks… Again, vLLM dynamically assigns new physical blocks to logical blocks as more tokens and their KV cache are generated. As all the blocks are filled from left to right and a new physical block is only allocated when all previous blocks are full” (Section 4.3, first-third paragraphs, page 616)]. Regarding the limitations a quantity of virtual address slots is equal to a maximum amount of batch request processing of a large language model; Kumar teaches [advantages of using vLLM KV virtual cache configuration include “a) Preserving virtual memory - virtual memory space allocated is large enough to hold the KV-cache of the maximum batch size (configurable) that needs to be supported.” (vAttention: System Design – i) Design Overview – a) Pre-reserving virtual memory”, page 7)], where the amount of virtual memory space would correspond to the logical/virtual blocks taught by Kwon a size of the virtual address slot is based on a maximum sequence length multiplies a virtual memory size occupied by a single token, the size of the virtual memory block is the virtual size occupied by the single token, Kumar teaches [“vAttention: System Design - i) Design Overview - vAttention builds on the ability to allocate virtual memory and physical memory separately by allocating a large contiguous buffer for the KV-cache in virtual memory ahead-of-time (similar to reservation-based allocators) while deferring the allocation of physical memory to runtime.” (page 7) where “a) Pre-reserving virtual memory - virtual memory space allocated is large enough to hold the KV-cache of the maximum batch size (configurable) that needs to be supported. -Number of virtual memory buffers… -Size of a virtual memory buffer: The maximum size of a buffer is 𝐵𝑆 = 𝐵 × 𝐿 × 𝑆 where B is the maximum batch size, L is the maximum context length supported by the model and 𝑆 is the size of a single token’s per-layer K-cache” (pages 7-8)], where one of ordinary skill in the art would recognize that 𝐵 × 𝐿 or the maximum batch size x the maximum context length represents the maximum sequence length as claimed; however, Kumar does not expressly refer to the maximum sequence length being defined as B x L Kwon and Kumar are analogous art because they are from the same field of endeavor of memory access and control. Before the effective filing date of the claimed inventions, it would have been obvious to a person of ordinary skill in the art to modify Kwon to expressly have the quantify of virtual address slot equal to a maximum amount of batch request processing of a large language model as taught by Kumar since doing so would provide the benefits of avoiding memory contention and providing higher throughput (Kwon, Abstract, page 611) and facilitating efficient dynamic memory allocation (Kumar, pages 3, 7-8). The combination of Kwon and Kumar does not expressly disclose the maximum sequence length defined as B × 𝐿 or the maximum batch size x the maximum context length; however, regarding these limitations, Phanishayee teaches [“ FIGS. 6A and 6B show an example cache layout for a compute cache 800. In various implementations, the compute cache 800 corresponds to a KV cache of a generative transformer model. In reality this KV cache is a 5D tensor. The compute cache 800 in this example is a token cache for a token pipeline. The compute cache includes a predetermined number of columns which is equal to or greater than a maximum sequence length for the model. The maximum sequence length corresponds to a sum of the maximum prompt length (i.e., maximum tokens per prompt) and the maximum number of tokens that can be generated per prompt by the model. In this example, each column is used to store the computed values for one token. The inference request includes a prompt 802 having four words, or tokens, so the first four columns in the cache 800 are used to hold the computed values from the prompt cache generated by the prompt pipeline. In this example, the prompt 802 is processed by a generative transformer model having two layers: Layer 0 and Layer 1. The output of Layer 0 for the first token in the prompt therefore occupies the first three rows in the first column, and the output of Layer 1 for the first token occupies the next three rows in the first column. The output of Layer 0 and Layer 1 for the rest of the tokens in the prompt occupy corresponding positions in the next three columns. Once the prompt pipeline has completed processing the prompt 802, the output of the prompt pipeline for the prompt is stored in the compute cache 800 as shown in FIG. 6A.” (col. 14, lines 19-46)]. Kwon and Kumar are analogous art because they are from the same field of endeavor of memory access and control. Before the effective filing date of the claimed inventions, it would have been obvious to a person of ordinary skill in the art to modify the combination of Kwon and Kumar to have the maximum sequence length defined as B × 𝐿 or the maximum batch size x the maximum context length as taught by Phanishayee since doing so is well known in the art and would facilitate computations and accommodating a maximum sequence in the KV cache. Therefore, it would have been obvious to combine Kwon and Kumar and Phanishayee for the benefit of creating a storage system/method to obtain the invention as specified in claim 1. 4. The method according to claim 1, wherein maintaining the mapping relationship comprises: for a model reasoning request of uncompleted model reasoning processing: determining whether a remaining graphics memory capacity of a currently mapped physical graphics memory block is sufficient to store the data for the newly-added key-value token,; in response to determining that the remaining graphics memory capacity of the currently mapped physical graphics memory is sufficient to store the data for the newly-added key-value toke, keeping the mapping relationship of the physical graphics memory block unchanged, and updating a use capacity of the physical graphics memory block; or in response to determining that the remaining graphics memory capacity of the currently mapped physical graphics memory block is insufficient to store the data for the newly-added key-value token, mapping a sufficient quantity of physical graphics memory blocks in idle physical graphics memory blocks to the occupied virtual address slot, and updating a mapping quantity and a use capacity of physical graphics memory blocks [Kwon teaches “The KV block manager also maintains block tables—the mapping between logical and physical KV blocks of each request. Each block table entry records the corresponding physical blocks of a logical block and the number of filled positions.” (Section 4.2, second paragraph, page 616) “The prompt has 7 tokens, so vLLM maps the first 2 logical KV blocks (0 and 1) to 2 physical KV blocks (7 and 1, respectively). In the prefill step, vLLM generates the KV cache of the prompts and the first output token with a conventional self-attention algorithm (e.g., [13]). vLLM then stores the KV cache of the first 4 tokens in logical block 0 and the following 3 tokens in logical block 1. The remaining slot is reserved for the subsequent autoregressive generation phase. ○2 In the first autoregressive decoding step, vLLM generates the new token with the PagedAttention algorithm on physical blocks 7 and 1. Since one slot remains available in the last logical block, the newly generated KV cache is stored there, and the block table’s #filled record is updated. ○3 At the second decoding step, as the last logical block is full, vLLM stores the newly generated KV cache in a new logical block; vLLM allocates a new physical block (physical block 3) for it and stores this mapping in the block table” (Section 4.3, first paragraph, page 616) “Swapping. This is the classic technique used by most virtual memory implementations which copy the evicted pages to a swap space on the disk. In our case, we copy evicted blocks to the CPU memory. As shown in Fig. 4, besides the GPU block allocator, vLLM includes a CPU block allocator to manage the physical blocks swapped to CPU RAM. When vLLM exhausts free physical blocks for new tokens, it selects a set of sequences to evict and transfer their KV cache to the CPU. Once it preempts a sequence and evicts its blocks, vLLM stops accepting new requests until all preempted sequences are completed. Once a request completes, its blocks are freed from memory, and the blocks of a preempted sequence are brought back in to continue the processing of that sequence. Note that with this design, the number of blocks swapped to the CPU RAM never exceeds the number of total physical blocks in the GPU RAM, so the swap space on the CPU RAM is bounded by the GPU memory allocated for the KV cache.” (Section 4.5 – Swapping, page 618) where table 1 depicts the capacity and number of slots for different model sizes and server configurations, page 619]. 5. The method according to claim 1, wherein maintaining the mapping relationship comprises: for a new model reasoning request, mapping a sufficient quantity of physical graphics memory blocks in idle physical graphics memory blocks to the occupied virtual address slot, and updating a mapping quantity and a use capacity of the physical graphics memory block [Kwon teaches “The KV block manager also maintains block tables—the mapping between logical and physical KV blocks of each request. Each block table entry records the corresponding physical blocks of a logical block and the number of filled positions.” (Section 4.2, second paragraph, page 616) “The prompt has 7 tokens, so vLLM maps the first 2 logical KV blocks (0 and 1) to 2 physical KV blocks (7 and 1, respectively). In the prefill step, vLLM generates the KV cache of the prompts and the first output token with a conventional self-attention algorithm (e.g., [13]). vLLM then stores the KV cache of the first 4 tokens in logical block 0 and the following 3 tokens in logical block 1. The remaining slot is reserved for the subsequent autoregressive generation phase. ○2 In the first autoregressive decoding step, vLLM generates the new token with the PagedAttention algorithm on physical blocks 7 and 1. Since one slot remains available in the last logical block, the newly generated KV cache is stored there, and the block table’s #filled record is updated. ○3 At the second decoding step, as the last logical block is full, vLLM stores the newly generated KV cache in a new logical block; vLLM allocates a new physical block (physical block 3) for it and stores this mapping in the block table” (Section 4.3, first paragraph, page 616) “Swapping. This is the classic technique used by most virtual memory implementations which copy the evicted pages to a swap space on the disk. In our case, we copy evicted blocks to the CPU memory. As shown in Fig. 4, besides the GPU block allocator, vLLM includes a CPU block allocator to manage the physical blocks swapped to CPU RAM. When vLLM exhausts free physical blocks for new tokens, it selects a set of sequences to evict and transfer their KV cache to the CPU. Once it preempts a sequence and evicts its blocks, vLLM stops accepting new requests until all preempted sequences are completed. Once a request completes, its blocks are freed from memory, and the blocks of a preempted sequence are brought back in to continue the processing of that sequence. Note that with this design, the number of blocks swapped to the CPU RAM never exceeds the number of total physical blocks in the GPU RAM, so the swap space on the CPU RAM is bounded by the GPU memory allocated for the KV cache.” (Section 4.5 – Swapping, page 618) where table 1 depicts the capacity and number of slots for different model sizes and server configurations, page 619]. 6. The method according to claim 1, wherein the virtual address space is determined based on a maximum quantity of batch processing requests of the large language model, the maximum sequence length, a quantity of hidden layers of the large language model, and a data type size of stored data [Kwon teaches “vLLM does not require reserving the memory for the maximum possible generated sequence length initially. Instead, it reserves only the necessary KV blocks to accommodate the KV cache generated during prompt computation.” (Section 4.3, first paragraph, page 616) “vLLM is an end-to-end serving system with a FastAPI [15] frontend and a GPU-based inference engine. The frontend extends the OpenAI API [34] interface, allowing users to customize sampling parameters for each request, such as the maximum sequence length and the beam width 𝑘.” (Section 5, first paragraph, page 619) “In each step, the scheduler first prepares the message with input token IDs for each request in the batch, as well as the block table for each request. Next, the scheduler broadcasts this control message to the GPU workers. Then, the GPU workers start to execute the model with the input token IDs. In the attention layers, the GPU workers read the KV cache according to the block table in the control message. During execution, the GPU workers synchronize the intermediate results with the all-reduce communication primitive without the coordination of the scheduler, as in [47]. In the end, the GPU workers send the sampled tokens of this iteration back to the scheduler. In summary, GPU workers do not need to synchronize on memory management as they only need to receive all the memory management information at the beginning of each decoding iteration along with the step inputs.” (Section 4.6, third paragraph, page 619), thus taking in account the sequence length to allocate virtual memory. “Specifically, we set a maximum batch size 𝐵 as large as possible for each experiment, according to the GPU memory capacity. The scheduler takes up to 𝐵 number of earliest arrived requests and sends the batch to FasterTransformer for processing” (Section 6.1, third paragraph, page 620). Kumar teaches “- virtual memory space allocated is large enough to hold the KV-cache of the maximum batch size (corresponding to the maximum quantity of batch requests) (configurable) that needs to be supported… - Number of virtual memory buffers: Each layer in an LLM maintains its own K and V tensors: called as K-cache and V-cache respectively. For a single GPU job, this requires pre-reserving 2 × 𝑁 buffers where 𝑁 is the number of layers in the model (corresponding to the number of layers). In a multi-GPU job, each worker reserves 2 × 𝑁′ buffers where 𝑁′ is the number of layers managed by that worker. - Size of a virtual memory buffer: The maximum size of a buffer is 𝐵𝑆 = 𝐵 × 𝐿 × 𝑆 where B is the maximum batch size, L is the maximum context length (corresponding to the sequence length) supported by the model and 𝑆 is the size of a single token (corresponding to the data type size)’s perlayer K-ca” (vAttention – a) Pre-reserving virtual memory, page 7)]. 7. An apparatus comprising: at least one processor; and one or more memories coupled to the at least one processor and storing programming instructions for execution by the at least one processor to perform operations comprising: allocating a virtual memory block in a virtual address slot to store data for a newly-added key-value token for a model reasoning request, wherein the virtual address slot comprises one or more virtual memory blocks, a size of the virtual address slot is based on a maximum length multiplies a virtual memory size occupied by a single token, the size of the virtual memory block is the virtual memory size occupied by the single token, wherein the virtual address slot is formed by equally dividing a virtual address space of a key-value cache, and a quantity of virtual address slots is equal to a maximum amount of batch request processing of a large language model; in response to determining that a scheduling result of the model reasoning request indicates the model reasoning request is scheduled for execution, maintaining a mapping relationship between an occupied virtual address slot and a physical graphics memory block allocated to the model reasoning request, wherein the physical graphics memory block is formed by equally dividing a maximum available physical graphics memory capacity of model reasoning; and copying the data for the newly-added key-value token to the physical graphics memory block, wherein the scheduling result is determined based on the allocated virtual memory block and a capacity of an available physical graphics memory block, each model reasoning request occupies a virtual address slot, and slot indication information of the occupied virtual address slot is recorded in a valid virtual address slot table [The rationale in the rejection of claim 1 is herein incorporated]. 10. The apparatus according to claim 7, wherein maintaining the mapping relationship comprises: for a model reasoning request of uncompleted model reasoning processing: determining whether a remaining graphics memory capacity of a currently mapped physical graphics memory block is sufficient to store the data for the newly-added key-value token; in response to determining that the remaining graphics memory capacity of the currently mapped physical graphics memory is sufficient to store data for the newly-added key-value token, keeping the mapping relationship of the physical graphics memory block unchanged, and updating a use capacity of the physical graphics memory block; or in response to determining that the a remaining graphics memory capacity of the currently mapped physical graphics memory block is insufficient to store the data for the newly-added key-value token, mapping a sufficient quantity of physical graphics memory blocks in idle physical graphics memory blocks to the occupied virtual address slot, and updating a mapping quantity and a use capacity of physical graphics memory blocks [The rationale in the rejection of claim 4 is herein incorporated]. 11. The apparatus according to claim 7, wherein maintaining the mapping relationship comprises: for a new model reasoning request, mapping a sufficient quantity of physical graphics memory blocks in idle physical graphics memory blocks to the occupied virtual address slot, and updating a mapping quantity and a use capacity of the physical graphics memory block [The rationale in the rejection of claim 5 is herein incorporated]. 12. The apparatus according to claim 7, wherein the virtual address space is determined based on a maximum quantity of batch processing requests of the large language model, the maximum sequence length, a quantity of hidden layers of the large language model, and a data type size of stored data [The rationale in the rejection of claim 6 is herein incorporated]. 13. A non-transitory, computer-readable medium storing one or more instructions executable by at least one processor to perform operations comprising: allocating a virtual memory block in a virtual address slot to store data for a newly-added key-value token for a model reasoning request, wherein the virtual address slot comprises one or more virtual memory blocks, a size of the virtual address slot is based on a maximum sequence length multiples a virtual memory size occupied by a single toke, the size of the virtual memory block is the virtual memory size occupied by a single token, wherein the virtual address slot is formed by equally dividing a virtual address space of a key-value cache, and a quantity of virtual address slots is equal to a maximum amount of batch request processing of a large language model; in response to determining that a scheduling result of the model reasoning request indicates the model reasoning request is scheduled for execution, maintaining a mapping relationship between an occupied virtual address slot and a physical graphics memory block allocated to the model reasoning request, wherein the physical graphics memory block is formed by equally dividing a maximum available physical graphics memory capacity of model reasoning; and copying the data for the newly-added key-value token data to the physical graphics memory block, wherein the scheduling result is determined based on the allocated virtual memory block and a capacity of an available physical graphics memory block, each model reasoning request occupies a virtual address slot, and slot indication information of the occupied virtual address slot is recorded in a valid virtual address slot table [The rationale in the rejection of claim 1 is herein incorporated]. 16. The non-transitory, computer-readable medium according to claim 13, wherein maintaining the mapping relationship comprises: for a model reasoning request of uncompleted model reasoning processing: determining whether a remaining graphics memory capacity of a currently mapped physical graphics memory block is sufficient to store the data for the newly-added key-value token; in response to determining that the remaining graphics memory capacity of the currently mapped physical graphics memory block is sufficient to store the data for the newly-added key-value token, keeping the mapping relationship of the physical graphics memory block unchanged, and updating a use capacity of the physical graphics memory block; or in response to determining that the a remaining graphics memory capacity of the currently mapped physical graphics memory block is insufficient to store the data for the newly-added key-value token, mapping a sufficient quantity of physical graphics memory blocks in idle physical graphics memory blocks to the occupied virtual address slot, and updating a mapping quantity and a use capacity of physical graphics memory blocks [The rationale in the rejection of claim 4 is herein incorporated]. 17. The non-transitory, computer-readable medium according to claim 13, wherein maintaining the mapping relationship comprises: for a new model reasoning request, mapping a sufficient quantity of physical graphics memory blocks in idle physical graphics memory blocks to the occupied virtual address slot, and updating a mapping quantity and a use capacity of the physical graphics memory block [The rationale in the rejection of claim 5 is herein incorporated]. 18. The non-transitory, computer-readable medium according to claim 13, wherein the virtual address space is determined based on a maximum quantity of batch processing requests of the large language model, the maximum sequence length, a quantity of hidden layers of the large language model, and a data type size of stored data [The rationale in the rejection of claim 6 is herein incorporated]. 19. (New) The method of claim 1, wherein the physical graphics memory block is represented using a physical handle, and a plurality of physical handles for a plurality of physical graphics memory blocks are stored in a physical graphics memory pool for use in a batch model reasoning process [Kwon teaches “a block engine allocates a contiguous chunk of GPU DRAM and divides it into physical KV blocks… The KV block manager also maintains block tables-the mapping between logical and physical KV blocks of each request. Each block table entry records the corresponding physical blocks of a logical and the number of filled positions” (Section 4.2 KV Cache Manager, pages 5-6; see Fig. 6 and related text). Kumar teaches virtual to physical mapping, see fig. showing physical memory including identifiers or handles of physical memory (b) On-demand physical memory allocation section, page 8)]. 20. (New) The apparatus of claim 7, wherein the size of the virtual address slot is equal to the maximum sequence length multiplies the virtual memory size occupied by a single token [Kumar teaches “vAttention: System Design - i) Design Overview - vAttention builds on the ability to allocate virtual memory and physical memory separately by allocating a large contiguous buffer for the KV-cache in virtual memory ahead-of-time (similar to reservation-based allocators) while deferring the allocation of physical memory to runtime.” (page 7) where “a) Pre-reserving virtual memory - virtual memory space allocated is large enough to hold the KV-cache of the maximum batch size (configurable) that needs to be supported. -Number of virtual memory buffers… -Size of a virtual memory buffer: The maximum size of a buffer is 𝐵𝑆 = 𝐵 × 𝐿 × 𝑆 where B is the maximum batch size, L is the maximum context length supported by the model and 𝑆 is the size of a single token’s per-layer K-cache” (pages 7-8), where one of ordinary skill in the art would recognize that 𝐵 × 𝐿 or the maximum batch size x the maximum context length represents the maximum sequence length as claimed; however, Kumar does not expressly refer to the maximum sequence length being defined as B x L; however, regarding these limitations, Phanishayee teaches “ … The maximum sequence length corresponds to a sum of the maximum prompt length (i.e., maximum tokens per prompt) and the maximum number of tokens that can be generated per prompt by the model. ” (col. 14, lines 19-46)]. Before the effective filing date of the claimed inventions, it would have been obvious to a person of ordinary skill in the art to modify the combination of Kwon and Kumar to have the maximum sequence length defined as B × 𝐿 or the maximum batch size x the maximum context length as taught by Phanishayee since doing so is well known in the art and would facilitate computations and accommodating a maximum sequence in the KV cache. Claims 2-3, 8-9 and 14-15 are rejected under 35 U.S.C. 103 as being unpatentable over Kwon et al. “Efficient Memory Management for Large Language Model Serving with PagedAttention”, ACM ISBN 979-8-4007-0229-7/23/10. https://doi.org/10.1145/3600006.3613165, 23 October 2023, 16 pages in view of Kumar, Sachin (hereinafter “Kumar”) “vAttention: Dynamic KV-cache Memory Management for Serving LLMs without PagedAttention”, https://medium.com/@techsachin/vattention-dynamic-kv-cache-memory-management-for-serving-llms-without-pagedattention-d2b5610dd536 , 8 May 2024, 26 pages and Phanishayee et al. (US 12,602,270) as applied in the rejection of claims 1, 7 and 13 above, and further in view of Durrant (US 7,234,038). 2. The combination of Kwon, Kumar and Phanishayee teaches The method according to claim 1, wherein before allocating the virtual memory block and after a previous model reasoning process, the method further comprises: for a model reasoning request of completed model reasoning processing, releasing an occupied virtual address slot, terminating the mapping relationship between the physical graphics memory block and the occupied virtual address slot, and deleting the slot indication information of the occupied virtual address slot from the valid virtual address slot table [Kwon teaches “Once a request finishes its generation, its KV blocks can be freed to store the KV cache of other requests. In Fig. 7, we show an example of vLLM managing the memory for two sequences. The logical blocks of the two sequences are mapped to different physical blocks within the space reserved by the block engine in GPU workers. The neighboring logical blocks of both sequences do not need to be contiguous in physical GPU memory and the space of physical blocks can be effectively utilized by both sequences.” (Section 4.3, third paragraph, page 616) “Once it preempts a sequence and evicts its blocks, vLLM stops accepting new requests until all preempted sequences are completed. Once a request completes, its blocks are freed from memory, and the blocks of a preempted sequence are brought back in to continue the processing of that sequence” (section 4.5 – Swapping, page 618) where “The KV block manager also maintains block tables—the mapping between logical and physical KV blocks of each request. Each block table entry records the corresponding physical blocks of a logical block and the number of filled positions. Separating logical and physical KV blocks allows vLLM to dynamically grow the KV cache memory without reserving it for all positions in advance, which eliminates most memory waste in existing systems” (Section 4.2, second paragraph, page 616). Kumar teaches “The framework notifies vAttention of a request’s completion with free_reqid. so that it can unmap the pages of a completed request or defer them to be freed later.” (iii) Serving LLMs with vAttention – d) request completion, page 12); as blocks are freed, the virtual to physical block entries would be updated, note the block table records entries of filled positions and unmapping freed blocks as taught by Kumar includes deleting map entries]; but the combination of Kwon, Kumar and Phanishayee does not expressly discuss deleting the slot of unmapped or freed virtual address entries; however, regarding these limitations, Durrant teaches [“FIG. 6 shows the flow of demapping a virtual memory page in accordance with one or more embodiments of the invention. In one embodiment, the OS (104) maps a new virtual memory page into memory in response to a page fault caused by a virtual address access by Processor A (100). The OS (104) frees up a physical memory page for the new virtual memory page because all physical memory pages are in use. The OS (104), using a known algorithm for determining which virtual memory page to remove from physical memory, selects the physical memory page represented in the mapping of TTE 1(128). The OS (104) then demaps and swaps out the virtual memory page currently in the selected physical memory page and updates the associated page structure (126). The OS (104) also removes TTE 1 (128) from the TSB (130) and from TLB A (112) and, in one embodiment, any entries in cache A (116) that are tagged with the page mapping cookie value in TTE 1 (128).” (col. 6, lines 42-58)]. Kwon, Kumar, Phanishayee and Durrant are analogous art because they are from the same field of endeavor of memory access and control. Before the effective filing date of the claimed inventions, it would have been obvious to a person of ordinary skill in the art to modify the combination of Kwon, Kumar and Phanishayee to expressly deleting the slot of unmapped or freed virtual address entries as taught by Durrant since doing so would facilitate virtual cache management and reuse of freed entries. Therefore, it would have been obvious to combine Kwon, Kumar and Phanishayee with Durrant for the benefit of creating a storage system/method to obtain the invention as specified in claim 2. 3. The method according to claim 2, wherein the model reasoning request comprises a new model reasoning request and a model reasoning request of uncompleted model reasoning processing after the previous model reasoning process, and scheduling processing of the model reasoning request of uncompleted model reasoning processing is completed before scheduling processing of the new model reasoning request [Kwon teaches “Swapping. This is the classic technique used by most virtual memory implementations which copy the evicted pages to a swap space on the disk. In our case, we copy evicted blocks to the CPU memory. As shown in Fig. 4, besides the GPU block allocator, vLLM includes a CPU block allocator to manage the physical blocks swapped to CPU RAM. When vLLM exhausts free physical blocks for new tokens, it selects a set of sequences to evict and transfer their KV cache to the CPU. Once it preempts a sequence and evicts its blocks, vLLM stops accepting new requests until all preempted sequences are completed. Once a request completes, its blocks are freed from memory, and the blocks of a preempted sequence are brought back in to continue the processing of that sequence.” (Section 4.5 – Swapping, page 618)]. 8. The apparatus according to claim 7, wherein before allocating the virtual memory block and after a previous model reasoning process, the operations further comprise: for a model reasoning request of completed model reasoning processing, releasing an occupied virtual address slot, terminating the mapping relationship between the physical graphics memory block and the occupied virtual address slot, and deleting the slot indication information of the occupied virtual address slot from the valid virtual address slot table [The rationale in the rejection of claim 2 is herein incorporated]. 9. The apparatus according to claim 8, wherein the model reasoning request comprises a new model reasoning request and a model reasoning request of uncompleted model reasoning processing after the previous model reasoning process, and scheduling processing of the model reasoning request of uncompleted model reasoning processing is completed before scheduling processing of the new model reasoning request [The rationale in the rejection of claim 3 is herein incorporated]. 14. The non-transitory, computer-readable medium according to claim 13, wherein before allocating the virtual memory block and after a previous model reasoning process, the operations further comprise: for a model reasoning request of completed model reasoning processing, releasing an occupied virtual address slot, terminating the mapping relationship between the physical graphics memory block and the occupied virtual address slot, and deleting the slot indication information of the occupied virtual address slot from the valid virtual address slot table [The rationale in the rejection of claim 2 is herein incorporated]. 15. The non-transitory, computer-readable medium according to claim 14, wherein the model reasoning request comprises a new model reasoning request and a model reasoning request of uncompleted model reasoning processing after the previous model reasoning process, and scheduling processing of the model reasoning request of uncompleted model reasoning processing is completed before scheduling processing of the new model reasoning request [The rationale in the rejection of claim 3 is herein incorporated]. ACKNOWLEDGEMENT OF ISSUES RAISED BY APPLICANT Response to Amendment Applicant's arguments filed on 3/23/2026 with respect to the 35 USC rejections in the non-final office action mailed on 12/29/2025 have been fully considered but are moot in view of the new ground(s) of rejection; however, some of Applicant’s arguments are not deemed persuasive. As required by M.P.E.P. § 707.07(f), a response to these arguments appears below. ARGUMENTS CONCERNING PRIOR ART REJECTIONS Claims must be given the broadest reasonable interpretation during examination and limitations appearing in the specification but not recited in the claim are not read into the claim (See M.P.E.P. 2111 [R-1]). Applicant argues “because Kwon’s “vLLM does not require reserving the memory for the maximum possible generated sequence length,” and “reserves only the necessary KV blocks to accommodate the KV cache generated during prompt computations,” the cited portion of Kwon does not teach, if not teaches away that “a size of the virtual address slot is based on a maximum sequence length multiplies a virtual memory size occupied by a single token… Kumar have not been shown to disclose or suggest that which Kwon is lacking”. In response, these arguments have been fully considered but are not deemed persusive. In response to applicant's arguments against the references individually, one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references. See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986). In this case, the claims have been rejected over the combination of references and what this combination would have suggested to one of ordinary skill in the art, and not over Kwon individually. The Examiner would also like to point out that the reference to Kwon does not teach away from the possibility of combining Kwon with Kumar to obtain the claimed invention as Kwon’s disclosure does not criticize, discredit, or otherwise discourage the solution claimed In re Fulton, 391 F.3d 1195, 1201, 73 USPQ2d 1141, 1146 (Fed. Cir. 2004). See also MPEP 2123. In this instance, the combination of Kwon, Kumar and newly added Phanishayee teaches the claimed limitations of “a size of the virtual address slot is based on a maximum sequence length multiplies a virtual memory size occupied by a single token” as Kumar teaches “vAttention: System Design - i) Design Overview - vAttention builds on the ability to allocate virtual memory and physical memory separately by allocating a large contiguous buffer for the KV-cache in virtual memory ahead-of-time (similar to reservation-based allocators) while deferring the allocation of physical memory to runtime.” (page 7) where “a) Pre-reserving virtual memory - virtual memory space allocated is large enough to hold the KV-cache of the maximum batch size (configurable) that needs to be supported. -Number of virtual memory buffers… -Size of a virtual memory buffer: The maximum size of a buffer is 𝐵𝑆 = 𝐵 × 𝐿 × 𝑆 where B is the maximum batch size, L is the maximum context length supported by the model and 𝑆 is the size of a single token’s per-layer K-cache” (pages 7-8), where one of ordinary skill in the art would recognize that 𝐵 × 𝐿 or the maximum batch size x the maximum context length represents the maximum sequence length as claimed; however, Kumar does not expressly refer to the maximum sequence length being defined as B x L; however, regarding these limitations, Phanishayee teaches “ … The maximum sequence length corresponds to a sum of the maximum prompt length (i.e., maximum tokens per prompt) and the maximum number of tokens that can be generated per prompt by the model. ” (col. 14, lines 19-46)]. Before the effective filing date of the claimed inventions, it would have been obvious to a person of ordinary skill in the art to modify the combination of Kwon and Kumar to have the maximum sequence length defined as B × 𝐿 or the maximum batch size x the maximum context length as taught by Phanishayee since doing so is well known in the art and would facilitate computations and accommodating a maximum sequence in the KV cache. Further, note that, contrary to Applicant’s arguments, the virtual address slot or buffer size as taught by combination of Kwon and Kumar accommodates the maximum sequence length as Kumar explains [advantages of using vLLM KV virtual cache configuration include “a) Preserving virtual memory - virtual memory space allocated is large enough to hold the KV-cache of the maximum batch size (configurable) that needs to be supported.” (vAttention: System Design – i) Design Overview – a) Pre-reserving virtual memory”, page 7)], where the amount of virtual memory space would correspond to the logical/virtual blocks taught by Kwon. Modifying Kwon as taught by Kumar would thus allow the system/method to the quantify of virtual address slot equal to a maximum amount of batch request processing of a large language model as taught by Kumar since doing so would provide the benefits of avoiding memory contention and providing higher throughput (Kwon, Abstract, page 611) and facilitating efficient dynamic memory allocation (Kumar, pages 3, 7-8). Regarding all other Claims not specifically traversed above and whose rejections were upheld, the Applicant contends that the listed claims are allowable by virtue of their dependence on other allowable claims. As this dependence is the sole rationale put forth for the allowability of said dependent claims, the Applicant is directed to the Examiner's remarks above. Additionally, any other arguments the Applicant made that were not specifically addressed in this Office Action appeared to directly rely on an argument presented elsewhere in the Applicant’s response that was traversed, rendered moot or found persuasive above. All arguments by the applicant are believed to be covered in the body of the office action; thus, this action constitutes a complete response to the issues raised in the remarks dated 3/23/2026. CLOSING COMMENTS Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. a. STATUS OF CLAIMS IN THE APPLICATION a(1) CLAIMS REJECTED IN THE APPLICATION Per the instant office action, claims 1-20 have received an action on the merits and are subject to a final rejection. b. DIRECTION OF FUTURE CORRESPONDENCES Any inquiry concerning this communication or earlier communications from the examiner should be directed to YAIMA RIGOL whose telephone number is (571)272-1232. The examiner can normally be reached Monday-Friday 9:00AM-5:00PM. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jared I. Rutz can be reached on (571) 272-5535. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. April 16, 2026 /YAIMA RIGOL/ Primary Examiner, Art Unit 2135
Read full office action

Prosecution Timeline

Nov 25, 2024
Application Filed
Dec 29, 2025
Non-Final Rejection mailed — §103
Mar 23, 2026
Response Filed
Apr 21, 2026
Final Rejection mailed — §103
May 11, 2026
Examiner Interview Summary
May 11, 2026
Applicant Interview (Telephonic)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12639227
Transaction Method and Transaction System Capable of Reducing Dynamic Random-Access Memory Traffic
1y 12m to grant Granted May 26, 2026
Patent 12639233
DEVICE AND METHOD OF SECURE DECRYPTION BY VIRTUALIZATION AND TRANSLATION OF PHYSICAL ENCRYPTION KEYS
1y 8m to grant Granted May 26, 2026
Patent 12639011
ROW HAMMER TELEMETRY
1y 7m to grant Granted May 26, 2026
Patent 12632190
ACHIEVING UNIFORM BANDWIDTH USING BLENDED MEMORY BLOCKS FOR RELOCATION OPERATIONS
2y 3m to grant Granted May 19, 2026
Patent 12613658
Early Read Start Time For Random Access SSDs
2y 2m to grant Granted Apr 28, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

3-4
Expected OA Rounds
75%
Grant Probability
93%
With Interview (+17.5%)
3y 3m (~1y 9m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 624 resolved cases by this examiner. Grant probability derived from career allowance rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month