Last updated: July 17, 2026

Application No. 17/585,380

INTERLOCKING BACKPROBAGATION FOR AUTOMATED TRAINING OF COMPUTER PREDICTIVE MODELS

Non-Final OA §103§112

Filed

Jan 26, 2022

Priority

Jan 28, 2021 — provisional 63/142,898

Examiner

GIROUX, GEORGE

Art Unit

2128

Tech Center

2100 — Computer Architecture & Software

Assignee

Cohere Inc.

OA Round

3 (Non-Final)

Interview Optional

— +26.7% interview lift. Examiner has a relatively high allowance rate (66%); +26.7% interview lift. A written response may suffice.

Based on 614 resolved cases, 2023–2026

Examiner Intelligence

GIROUX, GEORGE View full profile →

Grants 66% — above average

Career Allowance Rate

402 granted / 614 resolved

+10.5% vs TC avg

Strong +27% interview lift

Without

With

+26.7%

Interview Lift

resolved cases with interview

Typical timeline

4y 4m

Avg Prosecution

24 currently pending

Career history

646

Total Applications

across all art units

Statute-Specific Performance

§101

4.7%

-35.3% vs TC avg

§103

76.4%

+36.4% vs TC avg

§102

9.6%

-30.4% vs TC avg

§112

7.0%

-33.0% vs TC avg

Black line = Tech Center average estimate • Based on career data from 614 resolved cases

Office Action

§103 §112

DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


Response to Amendment
This Office Action is in response to applicant’s communication filed 17 April 2026, in response to the Office Action mailed 17 December 2025.  The applicant’s remarks and any amendments to the claims or specification have been considered, with the results that follow.


Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 17 April 2026 has been entered.
 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claim(s) 1-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Huang et al. (GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism, Sept 2019, pgs. 1-10), in view of Franca-Neto (US 2019/0244106), further in view of Matveev (US 2019/0156215), and further in view of Clement (US 2021/0357762).

As per claim 1, Huang teaches a method for generating trained transformer models comprising: receiving input data, the input data containing a sequence of elements [Transformer models are trained on different training datasets with different input sizes (pgs. 4-5, section 3; pgs. 6-7, section 5; etc.)]; initializing a transformer model that comprises a plurality of neural network layers [the transformer model layers are initialized (pg. 7, section 5; etc.)]; training the transformer model by accessing and utilizing a plurality of processing units, wherein each processing unit comprises a contiguous group of functions within the transformer model [the transformer models are trained with different numbers of pipelined accelerators (processing units), where the layers of the models are divided into different numbers of partitions of accelerators (pg. 2, section 1; pg. 4, Table 1; pg. 5, Table 2; etc.); and GPipe assumes that each accelerator can fit at least one layer (pg. 8, section 6; etc.)], the training comprising: arranging the plurality of processing units sequentially [the transformer models are trained with different numbers of pipelined (sequentially arranged) accelerators (processing units), where the layers of the models are divided into different numbers of partitions of accelerators (pg. 2, section 1; pg. 4, Table 1; pg. 5, Table 2; etc.)]; dividing the processing units into a plurality of interlocking subsets based on boundaries of encoders or decoders in the transformer model across layers of the transformer model [Transformer-L includes L encoders and decoders, and is partitioned into a pipeline of k accelerators  (pg. 4, Table 1; pg. 5, Table 2; etc.) the transformer models are trained with different numbers of pipelined accelerators (processing units), where the layers of the models are divided into different numbers of partitions of accelerators (pg. 2, section 1; pg. 4, Table 1; pg. 5, Table 2; etc.); and GPipe assumes that each accelerator can fit at least one layer (pg. 8, section 6; etc.); where the division of encoders and decoders into layers and the layers into accelerators is based on boundaries of encoders or decoders in the transformer model across layers of the transformer model], such that each subset comprises more than one processing unit [Transformer-L includes L encoders and decoders, and is partitioned into a pipeline of k accelerators  (pg. 4, Table 1; pg. 5, Table 2; etc.) the transformer models are trained with different numbers of pipelined accelerators (processing units), where the layers of the models are divided into different numbers of partitions of accelerators (pg. 2, section 1; pg. 4, Table 1; pg. 5, Table 2; etc.); further see Examiner’s Note, below]; and for each processing unit: repeatedly forward propagating one or more values calculated based on a set of parameters of the transformer model and based on one or more activation functions [Fk is the composite forward computation function of the k-th cell using forward computation function f with model parameters wi where, during the forward pass, GPipe first divides every mini-batch of size N into M equal micro-batches, which are pipelined through the K accelerators (pg. 3, fig. 2 and section 2.2; etc.) where each accelerator only stores output activations at partition boundaries (pg. 3, section 2.3; etc.)]; repeatedly backpropagating one or more error terms obtained from one or more loss functions [during the backward pass, gradients for each micro-batch are computed based on the same model parameters used for the forward pass. At the end of each mini-batch, gradients from all M micro-batches are accumulated and applied to update the model parameters across all accelerators, based on the calculated loss (pg. 3, fig. 2 and section 2.2; etc.)]; updating the set of parameters of the transformer model based on gradients derived from the one or more loss functions obtained using outputs of the respective processing unit and the one or more error terms backpropagated from a set of subsequent processing units [during the backward pass, gradients for each micro-batch are computed based on the same model parameters used for the forward pass. At the end of each mini-batch, gradients from all M micro-batches are accumulated and applied to update the model parameters across all accelerators, based on the calculated loss (which includes at least one error term) (pg. 3, fig. 2 and section 2.2; etc.), where fig. 2 shows the pipeline including subsequent accelerators (processing units) backpropagating the loss/error term(s)]; and outputting the trained transformer model [machine translation predictions are made with final, stored parameters of the trained model (pgs. 6-7, section 5; etc.)].
While Huang teaches assigning multiple accelerators to multiple subsets of layers, backpropagating blocks of gradients (in mini-batches) among subsets that include layers of the network (see above) and stopping predictions at a specified magnitude threshold (see, e.g., Huang: pg. 7, section 5), it has not been relied upon for teaching dividing the processing units into a plurality of interlocking subsets, such that certain processing units appear in more than one subset; the one or more error terms backpropagated from a set of subsequent processing units within an interlocking subset that includes the respective processing unit; or stopping the backpropagation after a predetermined number of updates.
Franca-Neto teaches dividing the processing units into a plurality of interlocking subsets, such that certain processing units appear in more than one subset and each subset comprises more than one processing unit [a device configured to perform computations for a neural network comprising multiple layers, wherein the device includes an array of processing units and memory, as well as a controller that assigns the processing units to subsets of processing units to perform calculations for the layers of the neural network, which can include re-assigning processing units or subsets of processing units to computations for multiple layers (para. 0091, etc.), and where each of the processing units can be included in multiple subsets of the processing units (paras. 0092-93, etc.); which means that certain processing units appear in more than one subset and each subset comprises more than one processing unit].
Huang and Franca-Neto are analogous art, as they are within the same field of endeavor, namely distributed processing of neural network operations/computations.
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to include a controller for assigning subsets of processing units to calculations for different layers of the NN, including re-assigning processing units to multiple subsets and assigning multiple processing units to each subset, as taught by Franca-Neto, for the accelerators performing calculations of different sets of layers of the NN assigned to different operations in the system taught by Huang.
Franca-Neto provides motivation as [assigning subsets of processing units with the controller improves neural network processing by reducing the resource usage, minimizing communication/storage usage and latencies, and improves the speed of processing (para. 0188, etc.); see also the Examiner’s Note, below].
Matveev teaches the one or more error terms backpropagated from a set of subsequent processing units within an interlocking subset that includes the respective processing unit [backward propagation and stochastic gradient descent is used in training the NN on batches and/or iterations of training (para. 0041, etc.), where the backpropagation can include using non-local losses on a selected layer or subset of layers of the NN (paras. 0081, 0098; etc.), on one or more connected processing nodes/units with memory (paras. 0034-36, etc.); where a subset of multiple layers performing backpropagation is an interlocking subset of processing unit including the respective processing unit (i.e., at least one processing unit in one layer, and at least one processing unit in another layer) of Huang, above].
Huang and Matveev are analogous art, as they are within the same field of endeavor, namely distributed training of neural networks/machine learning models.
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to backpropagate error terms from a loss function in training a model across subsets of layers of the model, as taught by Matveev, in the backpropagation of error terms of a loss function across layers of the model distributed between accelerators in the system/method taught by Huang.
Matveev provides motivation as [selecting a subset of units/layers for backpropagation of error/loss terms provides a tradeoff between the data movement and computation burdens, which allows for faster processing based upon the system being used (para. 0081, etc.)].
Clement teaches stopping the backpropagation after a predetermined number of updates [the transformer model is trained in a number of iterations/epochs, where the training dataset is partitioned into batches and each iteration includes forward propagation, loss calculation, and backpropagation steps (paras. 0068, 0076, etc.) using a set of hyperparameters, including a hyperparameter to determine the number of gradient accumulation steps to perform (para. 0070, 0077-78, etc.); and can be trained for a specified number of epochs (paras. 0092, 0099, etc.)].
Huang and Clement are analogous art, as they are within the same field of endeavor, namely training a transformer model for translation.
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to include using a set number of iterations/epochs for training the transformer model (including backpropagation) in each stage, as taught by Clement, for ending the iterative training (including backpropagation) of the transformer model in the system taught by Huang.
Because both Huang and Clement teach systems/methods of iteratively training a transformer model for translation tasks, it would have been obvious to one of ordinary skill in the art to include using a set number of iterations/epochs for training the transformer model (including backpropagation) in each stage, as taught by Clement, for ending the iterative training (including backpropagation) of the transformer model in the system taught by Huang, to achieve the predictable result of ending training so that the trained model can be used for inference. Clement provides further motivation as [after training iterations the model can be checked for a desired performance until the goal is reached, and if not training can begin again (para. 0078, etc.) and otherwise fine-tuning a model can be performed for a small, set number of epochs to decrease computation cost (paras. 0092, 0094, 0099, etc.)].  Huang also provides motivation as [if the model is trained for too long, over too many iterations, the predictions become peaky and vulnerable to noise (pg. 7, section 5, “Trainability Challenges with Deep Models”; etc.)].
Examiner’s Note: while Franca-Neto has been relied upon for teaching dividing the processing units such that certain processing units appear in more than one subset and Matveev has been relied upon for teaching including multiple layers in different subsets, the examiner notes that rearranging which units/layers to assign to each subset would also have been obvious to one of ordinary skill in the art for the system taught by Huang, before the effective filing date of the claimed invention, since it has been held that rearranging parts of an invention involves only routine skill in the art. In re Japikse, 86 USPQ 70.  It has also been held that mere duplication of the essential working parts of the invention (e.g., the accelerators in each grouping) involves only routine skill in the art.  St. Regis Paper Co. v. Bemis Co., 193 USPQ 8.

As per claim 2, Huang/Franca-Neto/Matveev/Clement teaches storing the set of parameters of the transformer model obtained from a processing unit of the plurality of processing units; and making predictions using the stored set of parameters [machine translation predictions are made with final, stored parameters of the trained model (Huang: pgs. 6-7, section 5; etc.); and where the model parameters are stored, either locally or remotely, and the model can be used for inference (Clement: paras. 0100-103, etc.)].

As per claim 3, Huang/Franca-Neto/Matveev/Clement teaches a first processing unit that is associated with a first set of subsequent processing units; a second processing unit that is associated with a second set of subsequent processing units, the second processing unit following the first processing unit; and the first set and the second set of subsequent processing units comprising common processing units [the layers of the network are partition across the accelerators, which includes subsequent accelerators for each accelerator (Huang: pg. 3, fig. 2; etc.), where, e.g., the figure shows that Devices 1-3 are subsequent to Device 0 and Devices 2-3 are subsequent to Device 1].

As per claim 4, Huang/Franca-Neto/Matveev/Clement teaches wherein length of idling time for each processing unit is associated with the boundaries [the idle time per accelerator is measured as part of the bubble overhead (Huang: pg. 4, section 2.3, etc.); where the boundary number is associated with the number of accelerators (see above)].

As per claim 5, Huang/Franca-Neto/Matveev/Clement teaches wherein the transformer model contains one or more decoders, the one or more decoders each containing a plurality of neural network layers [the Transformer L includes L encoder and L decoder layers (Huang: pg. 2, fig. 1; pg. 4, table 1; etc.) and each encoder and decoder includes a number of layers (Clement: figs. 2-3, etc.)].

As per claim 6, Huang/Franca-Neto/Matveev/Clement teaches wherein at least one of the plurality of processing units comprises one or more decoders [the Transformer L includes L encoder and L decoder layers partitioned among k accelerators (Huang: pg. 2, fig. 1; pg. 4, table 1; etc.); where table 1 includes at least one instance of an accelerator (processing unit) comprising one or more decoders].

As per claim 7, Huang/Franca-Neto/Matveev/Clement teaches wherein an auxiliary network layer is generated to backpropagate losses to preceding processing units [gradients from all M micro-batches are accumulated and applied to update the model parameters across all accelerators in a backward pass (back-propagation) (Huang: pg. 3, fig. 2 and section 2.2; etc.) using a set of hyperparameters, including a hyperparameter to determine the number of gradient accumulation steps to perform before backpropagating the weight update gradients (Clement: para. 0070, 0077-78, etc.)].

As per claim 8, see the rejection of claim 1, above, wherein Huang/Franca-Neto/Matveev/Clement also teaches a non-transitory computer-readable medium storing executable computer instructions that, when executed by one or more processors, cause the one or more processors to perform operations to generate trained transformer models, the instructions comprising instructions to: [perform the method] [the Transformer models are trained using Cloud TPUv3s with 16GB memory per accelerator core (Huang: pg. 4, section 3; etc.); which includes one or more processors performing instructions from memory].

As per claim 9, see the rejection of claim 2, above.

As per claim 10, see the rejection of claim 3, above.

As per claim 11, see the rejection of claim 4, above.

As per claim 12, see the rejection of claim 5, above.

As per claim 13, see the rejection of claim 6, above.

As per claim 14, see the rejection of claim 7, above.

As per claim 15, see the rejection of claim 1, above, wherein Huang/Franca-Neto/Matveev/Clement also teaches a computing system to generate trained transformer models, the system comprising: a processor; a non-transitory computer-readable storage medium storing instructions, the instructions when executed by the processor cause the processor to perform steps [of the method] [the Transformer models are trained using Cloud TPUv3s with 16GB memory per accelerator core (Huang: pg. 4, section 3; etc.); which includes one or more processors performing instructions from memory].

As per claim 16, see the rejection of claim 2, above.

As per claim 17, see the rejection of claim 3, above.

As per claim 18, see the rejection of claim 4, above.

As per claim 19, see the rejection of claim 5, above.

As per claim 20, see the rejection of claim 6, above.


Response to Arguments
The rejections of claims 1-20 under 35 U.S.C. 112(b) have been withdrawn due to the amendments filed.

Applicant's arguments filed 17 April 2026 have been fully considered but they are not persuasive.

Applicant argues that dividing the processing units into interlocking subsets, such that certain processing units appear in more than one subset and each subset comprises more than one processing unit “constitutes a distinct technical and inventive feature, and is not merely an obvious rearrangement of known components.”
While Franca-Neto has been relied upon for teaching dividing the processing units such that certain processing units appear in more than one subset and Matveev has been relied upon for teaching including multiple layers in different subsets, the examiner also notes that this claim limitation does not describe a specific arrangement of processing units that provides a “technical and inventive feature,” but rather includes a number of different arrangements of processing units into different subsets (i.e. a group of processing units could be rearranged into different subsets that all meet the claim limitations).  Thus, this rearrangement of the parts of the invention (the processing units and/or subset assignments) would have been obvious to one of ordinary skill in the art, for the reasons described above.


Conclusion
The following is a summary of the treatment and status of all claims in the application as recommended by M.P.E.P. 707.07(i): claims 1-20 are rejected.

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Liu et al. (Very Deep Transformers for Neural Machine Translation, Oct 2020, pgs. 1-7 – cited in an IDS) – discloses initializing a deep transformer model for training.
Lepikhin et al. (GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding, June 2020, pgs. 1-35) – discloses a set of API extensions, etc., for distributing models for parallel computation.
Harlap et al. (PipeDream: Fast and Efficient Pipeline Parallel DNN Training, June 2018, pgs. 1-14) – discloses parallel, pipelined execution of DNN layers for training, by partitioning layers across multiple units and performing load/work balancing.
Jo (US 2018/0129901) – discloses assigning an encoder and decoder to different cores based upon processing complexity, as well as assigning post-processing to the cores based on complexity/situation.
Nurvitadhi (US 2018/0293691) – discloses allocating EUs across different combinations of layers of a convolutional neural network.
Lewis (US 2018/0314935) – discloses adaptive runtime layer logic that determines whether to use data parallelism (assigning data across cores/nodes), model parallelism (assigning layers to cores/nodes), or a combination of both.
Smith (US 2022/0141026) – discloses assigning layers between compute engines of a GPU and multiple cores of a CPU.

The examiner requests, in response to this Office action, that support be shown for language added to any original claims on amendment and any new claims. That is, indicate support for newly added claim language by specifically pointing to page(s) and line number(s) in the specification and/or drawing figure(s). This will assist the examiner in prosecuting the application.

When responding to this office action, Applicant is advised to clearly point out the patentable novelty which he or she thinks the claims present, in view of the state of the art disclosed by the references cited or the objections made. He or she must also show how the amendments avoid such references or objections.  See 37 CFR 1.111(c).

Any inquiry concerning this communication or earlier communications from the examiner should be directed to GEORGE GIROUX whose telephone number is (571)272-9769. The examiner can normally be reached M-F 10am-6pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Omar Fernandez Rivas can be reached on 571-272-2589. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/GEORGE GIROUX/Primary Examiner, Art Unit 2128

Read full office action

Prosecution Timeline

Jan 26, 2022

Application Filed

Mar 21, 2025

Non-Final Rejection mailed — §103, §112

Aug 21, 2025

Response Filed

Dec 17, 2025

Final Rejection mailed — §103, §112

Apr 17, 2026

Request for Continued Examination

Apr 25, 2026

Response after Non-Final Action

Jun 25, 2026

Non-Final Rejection mailed — §103, §112 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

17/493,064

Patent 12651657

SYSTEMS AND METHODS FOR INITIATING AN UPDATED USER AMELIORATIVE PLAN

4y 8m to grant Granted Jun 09, 2026

17/970,509

Patent 12646590

DEEP LEARNING-BASED SPLICE SITE CLASSIFICATION

3y 7m to grant Granted Jun 02, 2026

17/351,425

Patent 12632257

SYSTEM AND ARCHITECTURE NEURAL NETWORK ACCELERATOR INCLUDING FILTER CIRCUIT

4y 11m to grant Granted May 19, 2026

17/591,897

Patent 12626104

METHODS OF TRAINING VARIATIONAL AUTOENCODERS TO RECOGNIZE ANOMALOUS DATA IN DISTRIBUTED SYSTEMS

4y 3m to grant Granted May 12, 2026

17/597,223

Patent 12619863

DEEP NEURAL NETWORK BASED ON FLASH ANALOG FLASH COMPUTING ARRAY

4y 4m to grant Granted May 05, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

3-4

Expected OA Rounds

66%

Grant Probability

92%

With Interview (+26.7%)

4y 4m (~0m remaining)

Median Time to Grant

High

PTA Risk

Based on 614 resolved cases by this examiner. Grant probability derived from career allowance rate.