Last updated: May 29, 2026
Application No. 18/470,827
REQUEST SEGMENTATION FOR REDUCED MEMORY CONSUMPTION BY TRAINED SEQUENTIAL MODELS

Non-Final OA §103
Filed
Sep 20, 2023
Examiner
TRAINOR, DANIEL BRENNAN
Art Unit
2198
Tech Center
2100 — Computer Architecture & Software
Assignee
Microsoft Technology Licensing, LLC
OA Round
1 (Non-Final)
Interview Optional

— +0.0% interview lift. Interview lift (+0.0%) is below the 15.0% threshold. A written response is recommended.
Based on 10 resolved cases, 2023–2026
Examiner Intelligence

TRAINOR, DANIEL BRENNAN View full profile →
Grants 100% — above average
Career Allowance Rate
10 granted / 10 resolved
+45.0% vs TC avg
Minimal +0% lift
Without
With
+0.0%
Interview Lift
resolved cases with interview
Typical timeline
3y 3m
Avg Prosecution
11 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
1.6%
-38.4% vs TC avg
§103
95.2%
+55.2% vs TC avg
§102
1.6%
-38.4% vs TC avg
§112
1.6%
-38.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 10 resolved cases
Office Action

§103
Detailed Action
1. 	This office action is in response to communication filed September 20, 2023. Claims 1-20 are currently pending and claims 1, 8, and 16 are the independent claims.

Notice of Pre-AIA  or AIA  Status
2.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Objections
3.	Claim 15 is objected to because of the following informalities: 
“… wherein the request segmentation engine is executed on a same a client device that executes the client application” should be amended based on the additional “a” in the claim language.
Appropriate correction is required.

Claim Rejections - 35 USC § 103
4.	Claims 1-13 and 15-20 are rejected under 35 U.S.C. 103 as being unpatentable over Cade et al. (NPL – How continuous batching enables 23x throughput in LLM inference while reducing P50 latency) – hereinafter “Cade”, in view of Qadrud-Din et al. (US Patent No. 11,995,411) – hereinafter “Qadrud-Din”.
	Regarding independent claim 1, Cade discloses a method for reducing memory consumption of a trained sequential model, the method comprising: (Page 2, “Due to the large GPU memory footprint and compute cost of LLMs, serving dominates the compute cost for most real world applications. ML engineers often treat LLMs like "black boxes" that can only be optimized with internal changes such as quantization and custom CUDA kernels. However, this is not entirely the case. Because LLMs iteratively generate their output, and because LLM inference is often memory and not compute bound, there are surprising system-level batching optimizations that make 10x or more differences in real-world workloads.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the trained sequential model (LLM) is often memory bound, so reducing memory consumption is possible through system-level batching optimizations.
	receiving, from a client application, an initial processing request identifying an input sequence to be processed by the trained sequential model and an initial value for an output size parameter, the output size parameter representing a requested size of output from the trained sequential model; (Page 3, “For each request: 1. You start with a sequence of tokens (called the "prefix" or "prompt"). 2. The LLM produces a sequence of completion tokens, stopping only after producing a stop token or reaching a maximum sequence length. This is an iterative process. You get one additional completion token for each new forward pass of the model. For example, suppose you prompt with a sentence "What is the capital of California: ", it would take ten forward pass iterations to get back the full response of ["S", "a", "c", "r", "a", "m", "e", "n", "t", "o"]. This example simplifies things a little bit because in actuality tokens do not map 1:1 to ASCH characters (a popular token encoding technique is Byte-Pair Encoding which is beyond the scope of this blog post), but the iterative nature of generation is the same regardless of how you tokenize your sequences.” and Page 11 “To do this, we create a dataset containing 1000 sequences each with 512 input tokens. We configure our model to always emit a per-sequence generation length by ignoring the end-of-sequence token and configuring max_tokens. We then generate 1000 generation lengths, one for each request, sampled from an exponential distribution with mean=128 tokens. We use an exponential distribution as it is a good approximation of the generation lengths that one may encounter while serving an application like ChatGPT.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the request received such as “What is the capital of California” will also have the maximum sequence length of max_tokens paired to it that corresponds to the output size parameter. 
	sequentially transmitting, to the trained sequential model, multiple partial processing requests based on the initial processing request … (Page 3, “For each request: 1. You start with a sequence of tokens (called the "prefix" or "prompt"). 2. The LLM produces a sequence of completion tokens, stopping only after producing a stop token or reaching a maximum sequence length. This is an iterative process. You get one additional completion token for each new forward pass of the model. For example, suppose you prompt with a sentence "What is the capital of California: ", it would take ten forward pass iterations to get back the full response of ["S", "a", "c", "r", "a", "m", "e", "n", "t", "o"]. This example simplifies things a little bit because in actuality tokens do not map 1:1 to ASCH characters (a popular token encoding technique is Byte-Pair Encoding which is beyond the scope of this blog post), but the iterative nature of generation is the same regardless of how you tokenize your sequences.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the sequential model is requested/called for ten forward pass iterations to get back the full response of ["S", "a", "c", "r", "a", "m", "e", "n", "t", "o"].
	receiving a sequence of output responses from the trained sequential model, each response in the sequence of output responses being generated in response to processing of a corresponding one of the multiple partial processing requests; and (Page 3, “For each request: 1. You start with a sequence of tokens (called the "prefix" or "prompt"). 2. The LLM produces a sequence of completion tokens, stopping only after producing a stop token or reaching a maximum sequence length. This is an iterative process. You get one additional completion token for each new forward pass of the model. For example, suppose you prompt with a sentence "What is the capital of California: ", it would take ten forward pass iterations to get back the full response of ["S", "a", "c", "r", "a", "m", "e", "n", "t", "o"]. This example simplifies things a little bit because in actuality tokens do not map 1:1 to ASCH characters (a popular token encoding technique is Byte-Pair Encoding which is beyond the scope of this blog post), but the iterative nature of generation is the same regardless of how you tokenize your sequences.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the sequential model processes requests iteratively and gets one additional completion token for each new forward pass of the model, such that each letter in “Sacramento” is an output response of a partial processing request.
	returning, to the client application, a final merged response including the sequence of output responses formatted to match expected response output associated with the initial processing request. (Page 3, “For each request: 1. You start with a sequence of tokens (called the "prefix" or "prompt"). 2. The LLM produces a sequence of completion tokens, stopping only after producing a stop token or reaching a maximum sequence length. This is an iterative process. You get one additional completion token for each new forward pass of the model. For example, suppose you prompt with a sentence "What is the capital of California: ", it would take ten forward pass iterations to get back the full response of ["S", "a", "c", "r", "a", "m", "e", "n", "t", "o"]. This example simplifies things a little bit because in actuality tokens do not map 1:1 to ASCH characters (a popular token encoding technique is Byte-Pair Encoding which is beyond the scope of this blog post), but the iterative nature of generation is the same regardless of how you tokenize your sequences.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the full response of ["S", "a", "c", "r", "a", "m", "e", "n", "t", "o"] is returned to the client application as a merged response “Sacramento”. 
	Cade does not explicitly disclose:
… each [multiple partial processing request] specify a fraction of the initial value as the output size parameter;
	However, Qadrud-Din discloses:
… each [multiple partial processing request] specify a fraction of the initial value as the output size parameter; (Col. 7, Lines 7-16 “In some implementations, the chunker 240 is configured to divide text into smaller portions. Dividing text into smaller portions may be needed at least in part to comply with one or more size limitations associated with the text. For instance, the text generation API 274 may impose a maximum size limit on prompts provided to the text generation model 276. The chunker may be used to subdivide text included in a request from a client, retrieved from a document, returned in a search result, or received from any other source.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the chunker divides text into smaller requests which in turn divides the output size value by the number of smaller portions created by the chunker.
	Therefore, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to add each [multiple partial processing request] specify a fraction of the initial value as the output size parameter as seen in Qadrud-Din's invention into Cade's invention because these modifications allow an “obvious to try” solution with a reasonable expectation of success such that the multiple partial requests that are split up from the initial request also requires splitting up the fraction of the initial output size parameter. 

	Regarding claim 2, Cade discloses the method of claim 1, but does not explicitly disclose:
	wherein the fraction of the initial value specified for the output size parameter in each different one of the multiple partial processing requests collectively sum to the initial value.
	However, Qadrud-Din discloses:
	wherein the fraction of the initial value specified for the output size parameter in each different one of the multiple partial processing requests collectively sum to the initial value. (Col. 7, Lines 7-16 “In some implementations, the chunker 240 is configured to divide text into smaller portions. Dividing text into smaller portions may be needed at least in part to comply with one or more size limitations associated with the text. For instance, the text generation API 274 may impose a maximum size limit on prompts provided to the text generation model 276. The chunker may be used to subdivide text included in a request from a client, retrieved from a document, returned in a search result, or received from any other source.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the chunker divides text into smaller requests which in turn divides the output size value by the number of smaller portions created by the chunker which will still sum to the initial value.
	Therefore, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to add wherein the fraction of the initial value specified for the output size parameter in each different one of the multiple partial processing requests collectively sum to the initial value as seen in Qadrud-Din's invention into Cade's invention because these modifications allow an “obvious to try” solution with a reasonable expectation of success such that the splitting up the fraction of the initial output size parameter still requires the fractions to collectively sum to the initial output size parameter.  
	
	Regarding claim 3, Cade discloses the method of claim 1, wherein each of the multiple partial processing requests further includes the input sequence. (Page 3, “For each request: 1. You start with a sequence of tokens (called the "prefix" or "prompt"). 2. The LLM produces a sequence of completion tokens, stopping only after producing a stop token or reaching a maximum sequence length. This is an iterative process. You get one additional completion token for each new forward pass of the model. For example, suppose you prompt with a sentence "What is the capital of California: ", it would take ten forward pass iterations to get back the full response of ["S", "a", "c", "r", "a", "m", "e", "n", "t", "o"]. This example simplifies things a little bit because in actuality tokens do not map 1:1 to ASCH characters (a popular token encoding technique is Byte-Pair Encoding which is beyond the scope of this blog post), but the iterative nature of generation is the same regardless of how you tokenize your sequences.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the sequential model is requested/called for ten forward pass iterations to get back the full response of ["S", "a", "c", "r", "a", "m", "e", "n", "t", "o"] and each request includes the input sequence of "What is the capital of California: ".

	Regarding claim 4, Cade discloses the method of claim 1, wherein the multiple partial processing requests include a first partial processing request and one or more additional partial processing requests, each of the one or more additional partial processing requests identifying a subset of the sequence of output responses generated during processing of previous requests of the multiple partial processing requests. (Page 3, “For each request: 1. You start with a sequence of tokens (called the "prefix" or "prompt"). 2. The LLM produces a sequence of completion tokens, stopping only after producing a stop token or reaching a maximum sequence length. This is an iterative process. You get one additional completion token for each new forward pass of the model. For example, suppose you prompt with a sentence "What is the capital of California: ", it would take ten forward pass iterations to get back the full response of ["S", "a", "c", "r", "a", "m", "e", "n", "t", "o"]. This example simplifies things a little bit because in actuality tokens do not map 1:1 to ASCH characters (a popular token encoding technique is Byte-Pair Encoding which is beyond the scope of this blog post), but the iterative nature of generation is the same regardless of how you tokenize your sequences.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the sequential model is requested/called for ten forward pass iterations to get back the full response of ["S", "a", "c", "r", "a", "m", "e", "n", "t", "o"] and each request includes the input sequence of "What is the capital of California: ".

	Regarding claim 5, Cade discloses the method of claim 1, wherein transmitting, to the trained sequential model, multiple partial processing requests includes:
	transmitting, to the trained sequential model, a first partial processing request that includes the input sequence; (Page 3, “For each request: 1. You start with a sequence of tokens (called the "prefix" or "prompt"). 2. The LLM produces a sequence of completion tokens, stopping only after producing a stop token or reaching a maximum sequence length. This is an iterative process. You get one additional completion token for each new forward pass of the model. For example, suppose you prompt with a sentence "What is the capital of California: ", it would take ten forward pass iterations to get back the full response of ["S", "a", "c", "r", "a", "m", "e", "n", "t", "o"]. This example simplifies things a little bit because in actuality tokens do not map 1:1 to ASCH characters (a popular token encoding technique is Byte-Pair Encoding which is beyond the scope of this blog post), but the iterative nature of generation is the same regardless of how you tokenize your sequences.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the sequential model is requested with the prompt “What is the capital of California: ” and the first request has an equivalence of one token to one letter output.
	receiving a first processing result in response to processing of the first partial processing request, the first processing result including a first output sequence corresponding to a first sequential portion of the final merged response; (Page 3, “For each request: 1. You start with a sequence of tokens (called the "prefix" or "prompt"). 2. The LLM produces a sequence of completion tokens, stopping only after producing a stop token or reaching a maximum sequence length. This is an iterative process. You get one additional completion token for each new forward pass of the model. For example, suppose you prompt with a sentence "What is the capital of California: ", it would take ten forward pass iterations to get back the full response of ["S", "a", "c", "r", "a", "m", "e", "n", "t", "o"]. This example simplifies things a little bit because in actuality tokens do not map 1:1 to ASCH characters (a popular token encoding technique is Byte-Pair Encoding which is beyond the scope of this blog post), but the iterative nature of generation is the same regardless of how you tokenize your sequences.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the sequential model outputs ["S"] as the first sequential portion of the final merged response.
	transmitting, to the trained sequential model, a second partial processing request that includes the input sequence and the first output sequence; and (Page 3, “For each request: 1. You start with a sequence of tokens (called the "prefix" or "prompt"). 2. The LLM produces a sequence of completion tokens, stopping only after producing a stop token or reaching a maximum sequence length. This is an iterative process. You get one additional completion token for each new forward pass of the model. For example, suppose you prompt with a sentence "What is the capital of California: ", it would take ten forward pass iterations to get back the full response of ["S", "a", "c", "r", "a", "m", "e", "n", "t", "o"]. This example simplifies things a little bit because in actuality tokens do not map 1:1 to ASCH characters (a popular token encoding technique is Byte-Pair Encoding which is beyond the scope of this blog post), but the iterative nature of generation is the same regardless of how you tokenize your sequences.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the sequential model is prompted with the sentence "What is the capital of California: " and the output after one completion token of [“S”]. 
	receiving, from the trained sequential model, a second processing result in response processing of the second partial processing request, the second processing result including the first output sequence and a second output sequence corresponding to a second sequential portion of the final merged response, wherein the final merged response includes the second output sequence appended to the first output sequence. (Page 3, “For each request: 1. You start with a sequence of tokens (called the "prefix" or "prompt"). 2. The LLM produces a sequence of completion tokens, stopping only after producing a stop token or reaching a maximum sequence length. This is an iterative process. You get one additional completion token for each new forward pass of the model. For example, suppose you prompt with a sentence "What is the capital of California: ", it would take ten forward pass iterations to get back the full response of ["S", "a", "c", "r", "a", "m", "e", "n", "t", "o"]. This example simplifies things a little bit because in actuality tokens do not map 1:1 to ASCH characters (a popular token encoding technique is Byte-Pair Encoding which is beyond the scope of this blog post), but the iterative nature of generation is the same regardless of how you tokenize your sequences.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the sequential model outputs the response of [“S”, “a”] following two passes and by the end of its runtime contains the full response of ["S", "a", "c", "r", "a", "m", "e", "n", "t", "o"].

	Regarding claim 6, Cade discloses the method of claim 1, wherein the trained sequential model is a generative transformer-based model. (Page 10, “NVIDIA’s FasterTransformer. This is a library which provides optimized implementations of various transformer models. It currently only provides static batching (the Triton inference server provides request-level dynamic batching, but not continuous batching yet). This provides us with an idea of how far an extremely optimized implementation of our model can get us with static batching -- it provides a more competitive baseline than the relatively unoptimized OPT-13B implementation available on Hugging Face Hub.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the trained sequential model uses the FasterTransformer framework. 

	Regarding claim 7, Cade discloses the method of claim 1, but does not explicitly disclose:
	wherein transmitting of the multiple partial processing requests is performed by a client device executing the client application.
	However, Qadrud-Din discloses:
	wherein transmitting of the multiple partial processing requests is performed by a client device executing the client application. (Col. 2, Line 63 – Col. 3, Line 13 “According to various embodiments, techniques and mechanisms described herein provide for novel text generation in domain-specific contexts. A text generation interface system may take as input one or more arbitrary documents, process them via optical text recognition, segment them into portions, and process the segmented text via various tasks based on need. Different workflows are provided for different tasks, and this application describes a number of examples of such workflows. In many workflows, an input document is divided into chunks via a chunking technique. Then, chunks are inserted into prompt templates for processing by a large language model such as the GPT-3 or GPT-4 available from OpenAI. The large language model's response is then parsed and potentially used to trigger additional analysis, such as one or more database searches, one or more additional prompts sent back to the large language model, and/or a response returned to a client machine.” and Col. 6, Lines 5-10 “According to various embodiments, a client machine may interact with the text generation interface system in any of various ways. For example, a client machine may access the text generation interface system via a text editor plugin, a dedicated application, a web browser, other types of interactions techniques, or combinations thereof.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the multiple partial processing requests come from an input split into chunks by the text generation interface system which can be accessed by a client machine via a dedication application. 
	Therefore, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to add wherein transmitting of the multiple partial processing requests is performed by a client device executing the client application as seen in Qadrud-Din's invention into Cade's invention because these modifications allow the simple substitution of one known element for another to obtain predictable results such that the transmitting of the multiple partial processing requests is performed on the same device that the client is using rather than exporting the requests to a separate device/server. 

	Regarding claim 8, it is a system claim having the same limitations as cited in method claim 1. Thus, claim 8 is also rejected under the same rationale as addressed in the rejection of claim 1 above.

	Regarding claim 9, it is a system claim having the same limitations as cited in method claim 2. Thus, claim 9 is also rejected under the same rationale as addressed in the rejection of claim 2 above.

	Regarding claim 10, it is a system claim having the same limitations as cited in method claim 3. Thus, claim 10 is also rejected under the same rationale as addressed in the rejection of claim 3 above.

	Regarding claim 11, it is a system claim having the same limitations as cited in method claim 4. Thus, claim 11 is also rejected under the same rationale as addressed in the rejection of claim 4 above.

	Regarding claim 12, it is a system claim having the same limitations as cited in method claims 3 and 4. Thus, claim 12 is also rejected under the same rationale as addressed in the rejection of claims 3 and 4 above.

	Regarding claim 13, Cade discloses the system of claim 12, wherein the sequence of output responses includes the first output sequence and a second output sequence generated by the trained sequential model based on the second partial processing request. (Page 3, “For each request: 1. You start with a sequence of tokens (called the "prefix" or "prompt"). 2. The LLM produces a sequence of completion tokens, stopping only after producing a stop token or reaching a maximum sequence length. This is an iterative process. You get one additional completion token for each new forward pass of the model. For example, suppose you prompt with a sentence "What is the capital of California: ", it would take ten forward pass iterations to get back the full response of ["S", "a", "c", "r", "a", "m", "e", "n", "t", "o"]. This example simplifies things a little bit because in actuality tokens do not map 1:1 to ASCH characters (a popular token encoding technique is Byte-Pair Encoding which is beyond the scope of this blog post), but the iterative nature of generation is the same regardless of how you tokenize your sequences.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the sequential model processes requests iteratively and gets one additional completion token for each new forward pass of the model, such that each letter in “Sacramento” is an output response of a partial processing request.

	Regarding claim 15, Cade discloses the system of claim 8, but does not explicitly disclose:
	wherein the request segmentation engine is executed on a same a client device that executes the client application.
	However, Qadrud-Din discloses:
	wherein the request segmentation engine is executed on a same a client device that executes the client application. (Col. 2, Line 63 – Col. 3, Line 13 “According to various embodiments, techniques and mechanisms described herein provide for novel text generation in domain-specific contexts. A text generation interface system may take as input one or more arbitrary documents, process them via optical text recognition, segment them into portions, and process the segmented text via various tasks based on need. Different workflows are provided for different tasks, and this application describes a number of examples of such workflows. In many workflows, an input document is divided into chunks via a chunking technique. Then, chunks are inserted into prompt templates for processing by a large language model such as the GPT-3 or GPT-4 available from OpenAI. The large language model's response is then parsed and potentially used to trigger additional analysis, such as one or more database searches, one or more additional prompts sent back to the large language model, and/or a response returned to a client machine.” and Col. 6, Lines 5-10 “According to various embodiments, a client machine may interact with the text generation interface system in any of various ways. For example, a client machine may access the text generation interface system via a text editor plugin, a dedicated application, a web browser, other types of interactions techniques, or combinations thereof.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the multiple partial processing requests come from an input split into chunks by the text generation interface system/request segmentation system which can be accessed by a client machine via a dedication application.
	Therefore, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to add wherein the request segmentation engine is executed on a same a client device that executes the client application as seen in Qadrud-Din's invention into Cade's invention because these modifications allow the simple substitution of one known element for another to obtain predictable results such that the request segmentation is performed on the same device that the client is using rather than exporting the segmentation to a separate device/server.

	Regarding claim 16, it is a computer process claim having the same limitations as cited in method claim 1. Thus, claim 16 is also rejected under the same rationale as addressed in the rejection of claim 1 above.

	Regarding claim 17, it is a computer process claim having the same limitations as cited in method claim 2. Thus, claim 17 is also rejected under the same rationale as addressed in the rejection of claim 2 above.

	Regarding claim 18, it is a computer process claim having the same limitations as cited in method claim 3. Thus, claim 18 is also rejected under the same rationale as addressed in the rejection of claim 3 above.

	Regarding claim 19, it is a computer process claim having the same limitations as cited in method claim 4. Thus, claim 19 is also rejected under the same rationale as addressed in the rejection of claim 4 above.

	Regarding claim 20, it is a computer process claim having the same limitations as cited in method claim 5. Thus, claim 20 is also rejected under the same rationale as addressed in the rejection of claim 5 above.

5.	Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Cade et al. (NPL – How continuous batching enables 23x throughput in LLM inference while reducing P50 latency) – hereinafter “Cade”, in view of Qadrud-Din et al. (US Patent No. 11,995,411) – hereinafter “Qadrud-Din”, and further in view of 
	Regarding claim 14, Cade discloses the system of claim 8, but does not explicitly disclose:
	wherein the trained sequential model is a transformer model with a decoder-only architecture.
	However, van Hoorn discloses:
	wherein the trained sequential model is a transformer model with a decoder-only architecture. (Introduction, “Large-language models (LLMs) have gained tons of popularity lately with the releases of ChatGPT, GPT-4, Bard, and more. All these LLMs are based on the transformer neural network architecture. The transformer architecture was first introduced in the paper "Attention is All You Need" by Google Brain in 2017. LLMs/GPT models use a variant of this architecture called de' decoder-only transformer'. The most popular variety of transformers are currently these GPT models. The only purpose of these models is to receive a prompt (an input) and predict the next token/word that comes after this input. Nothing more, nothing less.”) The citation is interpreted to read on the claimed invention because under broadest reasonable interpretation, the trained sequential model like GPT models are a transformer model with a decoder-only architecture. 
	Therefore, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to add wherein the trained sequential model is a transformer model with a decoder-only architecture as seen in van Hoorn’s invention into Cade's invention because these modifications allow combining prior art elements according to known methods to yield predictable results such that the trained sequential model is specifically a transformer model with a decoder-only architecture such as a GPT. Cade’s inventions mention GPTs, but do not go into detail about their decoder-only architecture, so another piece of prior art felt helpful to add.

Conclusion
6.	The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Such prior art includes Yu et al. (US Patent No. 11,442,775) which discloses a machine-learning transformer model that batches split up requests to a decoder which then outputs a result to another decoder continuously until the request is ready for final output. 

	Examiner has cited particular columns/paragraphs/sections and line numbers in the references applied and not relied upon to the claims above for the convenience of the applicant. Although the specified citations are representative of the teachings of the art and are applied to specific limitations within the individual claim, other passages and figures may apply as well. It is respectfully requested from the applicant in preparing responses, to fully consider the references in entirety as potentially teaching all or part of the claimed invention, as well as the context of the passage as taught by the prior art or disclosed by the Examiner.
	When responding to the Office action, applicant is advised to clearly point out the patentable novelty the claims present in view of the state of the art disclosed by the reference(s) cited or the objections made. A showing of how the amendments avoid such references or objections must also be present. See 37 C.F.R. 1.111(c).
	When responding to this Office action, applicant is advised to provide the line and page numbers in the application and/or reference(s) cited to assist in locating the appropriate paragraphs.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to DANIEL B TRAINOR whose telephone number is (571)272-3710. The examiner can normally be reached Monday-Friday 9AM-5PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Vital can be reached at (571) 272-4215. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/D.T./Examiner, Art Unit 2198    


/PIERRE VITAL/Supervisory Patent Examiner, Art Unit 2198
Read full office action
Prosecution Timeline

Sep 20, 2023
Application Filed
Mar 12, 2026
Non-Final Rejection mailed — §103
May 20, 2026
Interview Requested
Precedent Cases

Applications granted by this same examiner with similar technology

17/858,913
Patent 12639120
MANAGING USAGE OF RESOURCES IN A HYBRID COMPUTING ENVIRONMENT
3y 10m to grant Granted May 26, 2026
17/851,442
Patent 12632275
FAIR AND EFFICIENT GUEST TO HYPERVISOR VIRTUAL MACHINE SOCKET PROTOCOL
3y 10m to grant Granted May 19, 2026
Study what changed to get past this examiner. Based on 2 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
100%
Grant Probability
99%
With Interview (+0.0%)
3y 3m (~7m remaining)
Median Time to Grant
Low
PTA Risk
Based on 10 resolved cases by this examiner. Grant probability derived from career allowance rate.
REQUEST SEGMENTATION FOR REDUCED MEMORY CONSUMPTION BY TRAINED SEQUENTIAL MODELS

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email