Prosecution Insights
Last updated: April 19, 2026
Application No. 17/585,380

INTERLOCKING BACKPROBAGATION FOR AUTOMATED TRAINING OF COMPUTER PREDICTIVE MODELS

Final Rejection §103§112
Filed
Jan 26, 2022
Examiner
GIROUX, GEORGE
Art Unit
2128
Tech Center
2100 — Computer Architecture & Software
Assignee
Cohere Inc.
OA Round
2 (Final)
66%
Grant Probability
Favorable
3-4
OA Rounds
4y 6m
To Grant
93%
With Interview

Examiner Intelligence

Grants 66% — above average
66%
Career Allow Rate
401 granted / 612 resolved
+10.5% vs TC avg
Strong +27% interview lift
Without
With
+27.1%
Interview Lift
resolved cases with interview
Typical timeline
4y 6m
Avg Prosecution
28 currently pending
Career history
640
Total Applications
across all art units

Statute-Specific Performance

§101
11.0%
-29.0% vs TC avg
§103
45.5%
+5.5% vs TC avg
§102
16.0%
-24.0% vs TC avg
§112
15.5%
-24.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 612 resolved cases

Office Action

§103 §112
DETAILED ACTION Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Response to Amendment This Office Action is in response to applicant’s communication filed 21 August 2025, in response to the Office Action mailed 21 March 2025. The applicant’s remarks and any amendments to the claims or specification have been considered, with the results that follow. Claim Rejections - 35 USC § 112 The following is a quotation of 35 U.S.C. 112(b): (b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention. The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph: The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention. Claims 1-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention. Claim 1 recites the limitation "its own gradients” in line 2 of pg. 2. There is insufficient antecedent basis for this limitation in the claim. Claims 2-7 depend upon claim 1, and thus include the aforementioned limitation(s). Claim 8 recites the limitation "its own gradients” in line 29. There is insufficient antecedent basis for this limitation in the claim. Claims 9-14 depend upon claim 8, and thus include the aforementioned limitation(s). Claim 15 recites the limitation "its own gradients” in line 30. There is insufficient antecedent basis for this limitation in the claim. Claims 16-20 depend upon claim 15, and thus include the aforementioned limitation(s). Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows: 1. Determining the scope and contents of the prior art. 2. Ascertaining the differences between the prior art and the claims at issue. 3. Resolving the level of ordinary skill in the pertinent art. 4. Considering objective evidence present in the application indicating obviousness or nonobviousness. Claim(s) 1-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Huang et al. (GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism, Sept 2019, pgs. 1-10) in view of Clement (US 2021/0357762). As per claim 1, Huang teaches a method for generating trained transformer models comprising: receiving input data, the input data containing a sequence of elements [Transformer models are trained on different training datasets with different input sizes (pgs. 4-5, section 3; pgs. 6-7, section 5; etc.)]; initializing a transformer model that comprises a plurality of neural network layers [the transformer model layers are initialized (pg. 7, section 5; etc.)]; training the transformer model by accessing and utilizing a plurality of processing units, wherein each processing unit comprises a contiguous group of functions within the transformer model [the transformer models are trained with different numbers of pipelined accelerators (processing units), where the layers of the models are divided into different numbers of partitions of accelerators (pg. 2, section 1; pg. 4, Table 1; pg. 5, Table 2; etc.); and GPipe assumes that each accelerator can fit at least one layer (pg. 8, section 6; etc.)], the training comprising: arranging the plurality of processing units sequentially [the transformer models are trained with different numbers of pipelined (sequentially arranged) accelerators (processing units), where the layers of the models are divided into different numbers of partitions of accelerators (pg. 2, section 1; pg. 4, Table 1; pg. 5, Table 2; etc.)]; dividing the processing units into a plurality of interlocking subsets based on boundaries of encoders or decoders in the transformer model across layers of the transformer model [Transformer-L includes L encoders and decoders, and is partitioned into a pipeline of k accelerators (pg. 4, Table 1; pg. 5, Table 2; etc.) the transformer models are trained with different numbers of pipelined accelerators (processing units), where the layers of the models are divided into different numbers of partitions of accelerators (pg. 2, section 1; pg. 4, Table 1; pg. 5, Table 2; etc.); and GPipe assumes that each accelerator can fit at least one layer (pg. 8, section 6; etc.); where the division of encoders and decoders into layers and the layers into accelerators is based on boundaries of encoders or decoders in the transformer model across layers of the transformer model], such that certain processing units appear in more than one subset and each subset comprises more than one processing unit [Transformer-L includes L encoders and decoders, and is partitioned into a pipeline of k accelerators (pg. 4, Table 1; pg. 5, Table 2; etc.) the transformer models are trained with different numbers of pipelined accelerators (processing units), where the layers of the models are divided into different numbers of partitions of accelerators (pg. 2, section 1; pg. 4, Table 1; pg. 5, Table 2; etc.); further see Examiner’s Note, below]; and for each processing unit: repeatedly forward propagating one or more values calculated based on a set of parameters of the transformer model and based on one or more activation functions [Fk is the composite forward computation function of the k-th cell using forward computation function f with model parameters wi where, during the forward pass, GPipe first divides every mini-batch of size N into M equal micro-batches, which are pipelined through the K accelerators (pg. 3, fig. 2 and section 2.2; etc.) where each accelerator only stores output activations at partition boundaries (pg. 3, section 2.3; etc.)]; repeatedly backpropagating one or more error terms obtained from one or more loss functions [during the backward pass, gradients for each micro-batch are computed based on the same model parameters used for the forward pass. At the end of each mini-batch, gradients from all M micro-batches are accumulated and applied to update the model parameters across all accelerators, based on the calculated loss (pg. 3, fig. 2 and section 2.2; etc.)]; updating the set of parameters of the transformer model based on its own gradients and the one or more error terms backpropagated from a set of subsequent processing units [during the backward pass, gradients for each micro-batch are computed based on the same model parameters used for the forward pass. At the end of each mini-batch, gradients from all M micro-batches are accumulated and applied to update the model parameters across all accelerators, based on the calculated loss (pg. 3, fig. 2 and section 2.2; etc.), where fig. 2 shows the pipeline including subsequent accelerators (processing units)], wherein the set of subsequent processing units is determined based on the subsets [Transformer-L includes L encoders and decoders, and is partitioned into a pipeline of k accelerators (pg. 4, Table 1; pg. 5, Table 2; etc.) and the back-propagation function Bk of gradients depends on both Bk+1 from the upper layer and Fk (the composite forward computation (pg. 3, fig. 2; etc.); which determines how many accelerators each accelerator’s gradient is based on (for backpropagation), as does the number L of encoders/decoders and how many accelerators K they are partitioned among]; and outputting the trained transformer model [machine translation predictions are made with final, stored parameters of the trained model (pgs. 6-7, section 5; etc.)]. While Huang teaches set blocks of gradients to be backpropagated (mini-batches) and stopping predictions at a specified magnitude threshold (see, e.g., Huang: pg. 7, section 5), it has not been relied upon for teaching stopping the backpropagation after a predetermined number of updates. Clement teaches stopping the backpropagation after a predetermined number of updates [the transformer model is trained in a number of iterations/epochs, where the training dataset is partitioned into batches and each iteration includes forward propagation, loss calculation, and backpropagation steps (paras. 0068, 0076, etc.) using a set of hyperparameters, including a hyperparameter to determine the number of gradient accumulation steps to perform (para. 0070, 0077-78, etc.); and can be trained for a specified number of epochs (paras. 0092, 0099, etc.)]. Huang and Clement are analogous art, as they are within the same field of endeavor, namely training a transformer model for translation. It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to include using a set number of iterations/epochs for training the transformer model (including backpropagation) in each stage, as taught by Clement, for ending the iterative training (including backpropagation) of the transformer model in the system taught by Huang. Because both Huang and Clement teach systems/methods of iteratively training a transformer model for translation tasks, it would have been obvious to one of ordinary skill in the art to include using a set number of iterations/epochs for training the transformer model (including backpropagation) in each stage, as taught by Clement, for ending the iterative training (including backpropagation) of the transformer model in the system taught by Huang, to achieve the predictable result of ending training so that the trained model can be used for inference. Clement provides further motivation as [after training iterations the model can be checked for a desired performance until the goal is reached, and if not training can begin again (para. 0078, etc.) and otherwise fine-tuning a model can be performed for a small, set number of epochs to decrease computation cost (paras. 0092, 0094, 0099, etc.)]. Huang also provides motivation as [if the model is trained for too long, over too many iterations, the predictions become peaky and vulnerable to noise (pg. 7, section 5, “Trainability Challenges with Deep Models”; etc.)]. Examiner’s Note: while Huang teaches that the layers of encoders and decoders may be assigned to various numbers of pipelined accelerators (see, e.g., Huang: pg. 3, fig. 2; pg. 4, table 1; etc.), dividing the processing units such that certain processing units appear in more than one subset and each subset comprises more than one processing unit would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, since it has been held that rearranging parts of an invention involves only routine skill in the art. In re Japikse, 86 USPQ 70. It has also been held that mere duplication of the essential working parts of the invention (e.g., the accelerators in each grouping) involves only routine skill in the art. St. Regis Paper Co. v. Bemis Co., 193 USPQ 8. As per claim 2, Huang/Clement teaches storing the set of parameters of the transformer model obtained from a processing unit of the plurality of processing units; and making predictions using the stored set of parameters [machine translation predictions are made with final, stored parameters of the trained model (Huang: pgs. 6-7, section 5; etc.); and where the model parameters are stored, either locally or remotely, and the model can be used for inference (Clement: paras. 0100-103, etc.)]. As per claim 3, Huang/Clement teaches a first processing unit that is associated with a first set of subsequent processing units; a second processing unit that is associated with a second set of subsequent processing units, the second processing unit following the first processing unit; and the first set and the second set of subsequent processing units comprising common processing units [the layers of the network are partition across the accelerators, which includes subsequent accelerators for each accelerator (Huang: pg. 3, fig. 2; etc.), where, e.g., the figure shows that Devices 1-3 are subsequent to Device 0 and Devices 2-3 are subsequent to Device 1]. As per claim 4, Huang/Clement teaches wherein length of idling time for each processing unit is associated with the boundaries [the idle time per accelerator is measured as part of the bubble overhead (Huang: pg. 4, section 2.3, etc.); where the boundary number is associated with the number of accelerators (see above)]. As per claim 5, Huang/Clement teaches wherein the transformer model contains one or more decoders, the one or more decoders each containing a plurality of neural network layers [the Transformer L includes L encoder and L decoder layers (Huang: pg. 2, fig. 1; pg. 4, table 1; etc.) and each encoder and decoder includes a number of layers (Clement: figs. 2-3, etc.)]. As per claim 6, Huang/Clement teaches wherein at least one of the plurality of processing units comprises one or more decoders [the Transformer L includes L encoder and L decoder layers partitioned among k accelerators (Huang: pg. 2, fig. 1; pg. 4, table 1; etc.); where table 1 includes at least one instance of an accelerator (processing unit) comprising one or more decoders]. As per claim 7, Huang/Clement teaches wherein an auxiliary network layer is generated to backpropagate losses to preceding processing units [gradients from all M micro-batches are accumulated and applied to update the model parameters across all accelerators in a backward pass (back-propagation) (Huang: pg. 3, fig. 2 and section 2.2; etc.) using a set of hyperparameters, including a hyperparameter to determine the number of gradient accumulation steps to perform before backpropagating the weight update gradients (Clement: para. 0070, 0077-78, etc.)]. As per claim 8, see the rejection of claim 1, above, wherein Huang/Clement also teaches a non-transitory computer-readable medium storing executable computer instructions that, when executed by one or more processors, cause the one or more processors to perform operations to generate trained transformer models, the instructions comprising instructions to: [perform the method] [the Transformer models are trained using Cloud TPUv3s with 16GB memory per accelerator core (Huang: pg. 4, section 3; etc.); which includes one or more processors performing instructions from memory]. As per claim 9, see the rejection of claim 2, above. As per claim 10, see the rejection of claim 3, above. As per claim 11, see the rejection of claim 4, above. As per claim 12, see the rejection of claim 5, above. As per claim 13, see the rejection of claim 6, above. As per claim 14, see the rejection of claim 7, above. As per claim 15, see the rejection of claim 1, above, wherein Huang/Clement also teaches a computing system to generate trained transformer models, the system comprising: a processor; a non-transitory computer-readable storage medium storing instructions, the instructions when executed by the processor cause the processor to perform steps [of the method] [the Transformer models are trained using Cloud TPUv3s with 16GB memory per accelerator core (Huang: pg. 4, section 3; etc.); which includes one or more processors performing instructions from memory]. As per claim 16, see the rejection of claim 2, above. As per claim 17, see the rejection of claim 3, above. As per claim 18, see the rejection of claim 4, above. As per claim 19, see the rejection of claim 5, above. As per claim 20, see the rejection of claim 6, above. Response to Arguments The objection to the title has been withdrawn due to the amendment filed. The rejections of claims 1-20 under 35 U.S.C. 112(b) have been updated due to the amendments filed. Applicant’s arguments, see the remarks, filed 21 August 2025, with respect to the rejections under 35 U.S.C. 101 have been fully considered and are persuasive in view of the amendments made to the independent claims, which provide a practical application of the recited judicial exception and provide a process that could not reasonably be performed in the human mind. The rejections of claims 1-20 have been withdrawn. Applicant's arguments filed 21 August 2025, with respect to the rejections under 35 U.S.C. 103 have been fully considered but they are not persuasive. Applicant argues that the cited art does not teach interlocking backpropagation including grouping processing units which as results in overlapping subsets, such that certain processing units appear in more than one subset and each subset comprises more than one processing units, and that Huang only teaches partitioning GPUs between DNN layers. However, as described above, Huang teaches Transformer-L includes L encoders and decoders, and is partitioned into a pipeline of k accelerators (pg. 4, Table 1; pg. 5, Table 2; etc.) the transformer models are trained with different numbers of pipelined accelerators (processing units), where the layers of the models are divided into different numbers of partitions of accelerators (pg. 2, section 1; pg. 4, Table 1; pg. 5, Table 2; etc.); and GPipe assumes that each accelerator can fit at least one layer (pg. 8, section 6; etc.); where the division of encoders and decoders into layers and the layers into accelerators is based on boundaries of encoders or decoders in the transformer model across layers of the transformer model, where dividing the processing units such that certain processing units appear in more than one subset and each subset comprises more than one processing unit would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, since it has been held that rearranging parts (overlapping subsets of the accelerators) of an invention involves only routine skill in the art. In re Japikse, 86 USPQ 70. It has also been held that mere duplication of the essential working parts of the invention (e.g., the accelerators in each grouping) involves only routine skill in the art. St. Regis Paper Co. v. Bemis Co., 193 USPQ 8. Conclusion The following is a summary of the treatment and status of all claims in the application as recommended by M.P.E.P. 707.07(i): claims 1-20 are rejected. The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Liu et al. (Very Deep Transformers for Neural Machine Translation, Oct 2020, pgs. 1-7 – cited in an IDS) – discloses initializing a deep transformer model for training. Lepikhin et al. (GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding, June 2020, pgs. 1-35) – discloses a set of API extensions, etc., for distributing models for parallel computation. Harlap et al. (PipeDream: Fast and Efficient Pipeline Parallel DNN Training, June 2018, pgs. 1-14) – discloses parallel, pipelined execution of DNN layers for training, by partitioning layers across multiple units and performing load/work balancing. Jo (US 2018/0129901) – discloses assigning an encoder and decoder to different cores based upon processing complexity, as well as assigning post-processing to the cores based on complexity/situation. Nurvitadhi (US 2018/0293691) – discloses allocating EUs across different combinations of layers of a convolutional neural network. Lewis (US 2018/0314935) – discloses adaptive runtime layer logic that determines whether to use data parallelism (assigning data across cores/nodes), model parallelism (assigning layers to cores/nodes), or a combination of both. Smith (US 2022/0141026) – discloses assigning layers between compute engines of a GPU and multiple cores of a CPU. The examiner requests, in response to this Office action, that support be shown for language added to any original claims on amendment and any new claims. That is, indicate support for newly added claim language by specifically pointing to page(s) and line number(s) in the specification and/or drawing figure(s). This will assist the examiner in prosecuting the application. When responding to this office action, Applicant is advised to clearly point out the patentable novelty which he or she thinks the claims present, in view of the state of the art disclosed by the references cited or the objections made. He or she must also show how the amendments avoid such references or objections. See 37 CFR 1.111(c). Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. Any inquiry concerning this communication or earlier communications from the examiner should be directed to GEORGE GIROUX whose telephone number is (571)272-9769. The examiner can normally be reached M-F 10am-6pm. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Omar Fernandez Rivas can be reached on 571-272-2589. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /GEORGE GIROUX/Primary Examiner, Art Unit 2128
Read full office action

Prosecution Timeline

Jan 26, 2022
Application Filed
Mar 18, 2025
Non-Final Rejection — §103, §112
Aug 21, 2025
Response Filed
Dec 13, 2025
Final Rejection — §103, §112 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12572807
Neural Network Methods for Defining System Topology
2y 5m to grant Granted Mar 10, 2026
Patent 12572818
DEVICE AND METHOD FOR RANDOM WALK SIMULATION
2y 5m to grant Granted Mar 10, 2026
Patent 12554986
WEIGHT QUANTIZATION IN NEURAL NETWORKS
2y 5m to grant Granted Feb 17, 2026
Patent 12554983
MACHINE LEARNING-BASED SYSTEMS AND METHODS FOR IDENTIFYING AND RESOLVING CONTENT ANOMALIES IN A TARGET DIGITAL ARTIFACT
2y 5m to grant Granted Feb 17, 2026
Patent 12541696
ENHANCED VALIDITY MODELING USING MACHINE-LEARNING TECHNIQUES
2y 5m to grant Granted Feb 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

3-4
Expected OA Rounds
66%
Grant Probability
93%
With Interview (+27.1%)
4y 6m
Median Time to Grant
Moderate
PTA Risk
Based on 612 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month