Last updated: April 18, 2026

Application No. 17/814,041

System and Method for Low Rank Training of Neural Networks

Final Rejection §103

Filed

Jul 21, 2022

Examiner

WU, NICHOLAS S

Art Unit

2148

Tech Center

2100 — Computer Architecture & Software

Assignee

Cohere Inc.

OA Round

2 (Final)

This examiner grants 47% of cases after interview

— +43.1% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.

Based on 38 resolved cases, 2023–2026

Examiner Intelligence

WU, NICHOLAS S View full profile →

Grants 47% of resolved cases

Career Allow Rate

18 granted / 38 resolved

-7.6% vs TC avg

Strong +43% interview lift

Without

With

+43.1%

Interview Lift

resolved cases with interview

Typical timeline

3y 9m

Avg Prosecution

44 currently pending

Career history

Total Applications

across all art units

Statute-Specific Performance

§101

26.7%

-13.3% vs TC avg

§103

52.6%

+12.6% vs TC avg

§102

3.1%

-36.9% vs TC avg

§112

17.4%

-22.6% vs TC avg

Black line = Tech Center average estimate • Based on career data from 38 resolved cases

Office Action

§103

DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant's arguments filed 10/15/2025 have been fully considered but they are not fully persuasive.
Regarding the 101 rejections, applicant’s arguments and amendments to the independent claims are persuasive and overcome the previous 101 rejections. Specifically, applicant’s amended limitations to the independent claim provides a technical improvement because the amended claim now recites a specific two-stage training process that reduces model complexity by leveraging singular value decomposition for weight matrices of layers of a neural network. See pg. 3-5 of “Remarks”: “Amended claim 1 is directed to a method of training a neural network model including a two-phase training technique. As noted in the present application, the claimed subject matter provides an efficient training system for training the neural network models, both in terms of memory consumption and training time while keeping the performance loss minimal. Please see paragraphs [0022] - [0025] of the present application…Applicant respectfully submits that while the claimed method may involve a mathematical concept or an abstract idea, the overall training method, claimed in amended claims, involves highly technical steps that result in an efficient way of training neural networks, which improve memory usage and time required for such training. This contributes to a technical improvement of systems that are used for training deep neural networks that have a plurality of layers. Accordingly, Applicant submits that the amended claims, when considered as a whole, do not recite an abstract idea…As explained above, the amended claims clearly recite a technical integration of a method to train neural networks having a plurality of layers in a more efficient manner, both in terms of memory usage and the time required to complete training. Training deep neural networks has traditionally been difficult to optimize without loss in performance, as conventional systems for large vision and language models are inefficient due to poorly understood training dynamics in deep non-linear networks. See, e.g., paragraphs [0003]-[0008] of the present application. The claimed two-phase training technique provides a unique technical solution to these challenges by stabilizing optimization during initial training and then continuing training in a structured manner. This approach reduces complexity, lowers resource consumption, and accelerates training while maintaining model accuracy, thereby improving the overall functioning of neural network training systems. In view of the above, Applicant submits that the amended claims 1, 13 and 20, when considered as a whole, provide a technical solution to a technical problem and thus clearly include elements that amount to significantly more than the abstract idea.” Applicant’s amendments and corresponding arguments that the claimed invention provides a technical improvement to the field of neural networks are persuasive. Therefore, the 101 rejections are withdrawn.
Regarding the 103 rejections, applicant's arguments filed with respect to the prior art rejections have been fully considered but they are moot. Applicant has amended the claims to recite new combinations of limitations. Applicant's arguments are directed at the amendment. Please see below for new grounds of rejection, necessitated by Amendment.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1, 9-13, and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Yang, et al., Non-Patent Literature “Learning Low-rank Deep Neural Networks via Singular Vector Orthogonality Regularization and Singular Value Sparsification” (“Yang”) in view of Lin, et al., US Patent Publication 10157343B1 (“Lin”) and further in view of Mayer, et al., US Pre-Grant Publication 2021/0287089A1 (“Mayer”) and Math Stack Exchange, et al., Non-Patent Literature “What does it mean/imply that all my singular values are ones?” (“MSE”).
Regarding claim 1 and analogous claims 13 and 20, Yang discloses:
A method of training a neural network model including a plurality of layers, (Yang, abstract, “Modern deep neural networks (DNNs) [A method of training a neural network model including a plurality of layers,] often require high memory consumption and large computational loads. In order to deploy DNN algorithms efficiently on edge or mobile devices, a series of DNN compression algorithms have been explored, including factorization methods.”).
the method comprising: providing a training data set comprising a plurality of inputs and a corresponding plurality of expected results; (Yang, pg. 5 col. 1, “With the analysis above, we propose the overall objective function of the decomposed training as: L(U, s,V ) =LT (diag( p |s|)V T , Udiag( p |s|)) + λo X D l=1 Lo(Ul ,Vl) + λs X D l=1 Ls(sl). (3) Here LT is the training loss computed on the model with decomposed layers; training with a loss is interpreted as having training inputs with corresponding expected results (i.e. the method comprising: providing a training data set comprising a plurality of inputs and a corresponding plurality of expected results;).”).
…decomposing, based on a singular value decomposition scheme, the weight matrix W, of each layer into two singular vectors, U and V, that encode direction, and a diagonal matric of singular values,… (Yang, pg. 3 col. 1, “In this work, we propose to train the neural network in its singular value decomposition form […decomposing, based on a singular value decomposition scheme,], where each layer is decomposed into two consecutive layers with no additional operations in between. For a fully connected layer, the weight W is a 2-D matrix with dimension W ∈ R m×n. Following the form of SVD, W can be directly decomposed into three variables U,V , s as Udiag(s)V T [the weight matrix W, of each layer into two singular vectors, U and V, that encode direction, and a diagonal matric of singular values,…], with dimension U ∈ R m×r , V ∈ R n×r and s ∈ R r . Both U and V shall be unitary matrices.”).
iteratively updating,…, the singular vectors U and V based on an error determined by comparing a prediction of the neural network model based on the input to the expected result corresponding to the input upon which the prediction is based; (Yang, pg. 3 col. 2, “During the SVD training, for each layer we use the variables from the decomposition, i.e., U, s,V [iteratively updating,…, the singular vectors U and V], instead of the original kernel K or weight W as the trainable variables in the network. The forward pass will be executed by converting the U, s,V into a form of the two consecutive layers as demonstrated above, and the back propagation and optimization will be done directly with respect to the U, s,V of each layer; backpropagation and optimization is interpreted as training with a loss function and a loss function is interpreted as the difference between the predicted and expected predictions (i.e. based on an error determined by comparing (i) a prediction of the low rank neural network model based on the input, to (ii) the expected result corresponding to the input upon which the prediction is based;).”).
While Yang teaches low-rank training using singular value decomposition, Yang does not explicitly teach:
training the neural network model for a predefined total number of training iterations, using the training data set and a two-phase training technique by: pre-training, in a first phase, the neural network model for a subset of the predefined total number of training iterations, the pre-training comprising updating a weight matrix W, of each layer of the plurality of layers; when the pre-training is complete,…
…wherein the singular value decomposition scheme is constrained by setting all singular values of the diagonal matrix to be one such that the decomposing depends only on the singular vectors;…
…in a second phase and for the remainder of the predefined total number of training iterations…
and storing the trained neural network model.
Lin teaches and storing the trained neural network model. (Lin, col. 4 lines 50-53, “User data files such as model representations (e.g., model representation 116) that are accessed, modified or created by the model importer 110 can be stored in repositories [and storing the trained neural network model.] that are accessible to the servers.”).
Yang and Lin are both in the same field of endeavor (i.e. machine learning). It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to combine Yang and Lin to teach the above limitation(s). The motivation for doing so is that storing multiple models in a model repository allows for improved model selection for future tasks because the stored models can be used as base models to speed up training (cf. Lin, see col. 3 lines 4-18).
While Yang in view of Lin teaches training a low-rank model using singular value decomposition and storing the model, the combination does not explicitly teach:
training the neural network model for a predefined total number of training iterations, using the training data set and a two-phase training technique by: pre-training, in a first phase, the neural network model for a subset of the predefined total number of training iterations, the pre-training comprising updating a weight matrix W, of each layer of the plurality of layers; when the pre-training is complete,…
…wherein the singular value decomposition scheme is constrained by setting all singular values of the diagonal matrix to be one such that the decomposing depends only on the singular vectors;…
…in a second phase and for the remainder of the predefined total number of training iterations…
Mayer teaches:
training the neural network model for a predefined total number of training iterations, using the training data set and a two-phase training technique by: pre-training, in a first phase, the neural network model for a subset of the predefined total number of training iterations, the pre-training comprising updating a weight matrix W, of each layer of the plurality of layers; when the pre-training is complete,… (Mayer, ⁋147, “After completion of the preliminary training schedule 1000, the total number of cycles and/or iterations performed during the preliminary training session; preliminary training is interpreted as pre-training (i.e. training the neural network model for a predefined total number of training iterations, using the training data set and a two-phase training technique by: pre-training, in a first phase,) can be used to define the number of epochs to be used during the subsequent or final training of the neural network model [when the pre-training is complete,…]. For example, the number of epochs used for the final training session can be equal to the number of cycles performed during the preliminary training session. Additionally or alternatively, the number of epochs can be defined as the number of cycles performed plus or minus some integer number, such as 1 or 2 [the neural network model for a subset of the predefined total number of training iterations,]”, and Mayer, ⁋80, “For example, the training processes can repeatedly take a small batch of data (e.g., a mini-batch of training data), calculate a difference between predictions and actuals, and adjust weights (e.g., parameters within a neural network that transform input data within each of the network's hidden layers) in the model by a small amount, layer by layer, to generate predictions closer to actual values [the pre-training comprising updating a weight matrix W, of each layer of the plurality of layers;].”).
…in a second phase and for the remainder of the predefined total number of training iterations… (Mayer, ⁋147, “After completion of the preliminary training schedule 1000, the total number of cycles and/or iterations performed during the preliminary training session can be used to define the number of epochs to be used during the subsequent or final training of the neural network model […in a second phase and for the remainder of the predefined total number of training iterations…].”).
Yang, in view of Lin, and Mayer are both in the same field of endeavor (i.e. machine learning). It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to combine Yang, in view of Lin, and Mayer to teach the above limitation(s). The motivation for doing so is that using a pre-training schedule improves model performance by determining the amount of additional training needed (cf. Mayer, see ⁋6-7).
While Yang in view of Lin and Mayer teaches low-rank training using singular value decomposition using pre-training and model storing, the combination does not explicitly teach:
…wherein the singular value decomposition scheme is constrained by setting all singular values of the diagonal matrix to be one such that the decomposing depends only on the singular vectors;…
MSE teaches …wherein the singular value decomposition scheme is constrained by setting all singular values of the diagonal matrix to be one such that the decomposing depends only on the singular vectors;… (MSE, pg. 1, “Recall that the singular values of a real matrix M are precisely the eigenvalues of the positive-semidefinite real matrix MTM. Then the singular values of M are all 1 if and only if the eigenvalues of MTM are all 1, if and only if MTM =1, if and only if M is orthogonal. EDIT: As Vedran Šego points out, there is, of course, a more straightforward way: On the one hand, if the singular values of M are all 1, then any singular value decomposition M=USVT of M will have S=1 […wherein the singular value decomposition scheme is constrained by setting all singular values of the diagonal matrix to be one such that the decomposing depends only on the singular vectors;…], so that M=UVT is a product of orthogonal matrices, and thus orthogonal. On the other hand, if M is orthogonal, then MTM =1, so that the singular values of M are all 1.”).
Yang, in view of Lin and Mayer, and MSE are both in the same field of endeavor (i.e. singular value decomposition). It would have been obvious for a person having ordinary skill in the art before the effective filing date of the claimed invention to combine Yang, in view of Lin and Mayer, and MSE to teach the above limitation(s). The motivation for doing so is that having a dimensional matrix of all ones ensures orthogonality (cf. MSE, pg. 1, “Then the singular values of M are all 1 if and only if the eigenvalues of MTM are all 1, if and only if MTM =1, if and only if M is orthogonal.”). Note, that Yang states that orthogonality improves the decomposed training process (cf. Yang, pg. 4 col. 1, “Beyond maintaining valid SVD form, the orthogonality regularization also bring additional benefit to the performance of the decomposed training process.”).

Regarding claim 9, Yang in view of Lin, Mayer, and MSE teaches the method of claim 1. Lin further teaches further comprising transmitting the trained neural network model to a third party. (Lin, col. 5 lines 31-36, “FIG. 2 illustrates an example predictive modeling system 200. The system 200 includes one or more clients (clients 202, 204 and 206) that can communicate through one or more networks 106 with a collection of remote servers, such as servers deployed in a data center 108 or in different geographic locations; the model repository communicating with multiple users at different locations is interpreted as sending a trained model to a third party (i.e. further comprising transmitting the trained neural network model to a third party.).”). 
It would have been obvious to one of ordinary skill in the art before the effective filling date of the present application to combine the teachings of Lin with the teachings of Yang, Mayer, and MSE for the same reasons disclosed in claim 1.

Regarding claim 10, Yang in view of Lin, Mayer, and MSE teaches the method of claim 1. Lin further teaches further comprising receiving the training data from a third party. (Lin, col. 3 lines 4-8, “Particular embodiments of the subject matter described in this specification can be implemented so as to realize one or more of the following advantages. Predictive models can be trained in third party systems and imported for use in systems described herein [further comprising receiving the training data from a third party.].”). 
It would have been obvious to one of ordinary skill in the art before the effective filling date of the present application to combine the teachings of Lin with the teachings of Yang, Mayer, and MSE for the same reasons disclosed in claim 1.

Regarding claim 11 and analogous claim 19, Yang in view of Lin, Mayer, and MSE teaches the method of claim 1. Yang further teaches further comprising processing one or more new data points with the trained neural network model to generate a new prediction. (Yang, pg. 8 col. 2, “We further apply the proposed method [with the trained neural network model to generate a new prediction.] to various depth of ResNet models on both CIFAR-10 and ImageNet dataset [further comprising processing one or more new data points], where we find our accuracy-#FLOPs tradeoff constantly stays above the Pareto frontier of previous methods, including both factorization and structural pruning methods.”).

Regarding claim 12, Yang in view of Lin, Mayer, and MSE teaches the method of claim 1. Yang further teaches wherein the neural network model processes at least one of language or image data. (Yang, pg. 1 col. 1-2, “The booming development in deep learning models and applications has enabled beyond human performance in tasks like large-scale image classification [18, 9, 12, 13] [wherein the neural network model processes at least one of language or image data.], object detection [27, 22, 8], and semantic segmentation [24, 3].”).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Bermeitinger, et al., “Singular Value Decomposition and Neural Networks” discloses using singular value decomposition to compress neural networks.  
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NICHOLAS S WU whose telephone number is (571)270-0939. The examiner can normally be reached Monday - Friday 8:00 am - 4:00 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michelle Bechtold can be reached at 571-431-0762. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/N.S.W./Examiner, Art Unit 2148                                                                                                                                                                                                        /MICHELLE T BECHTOLD/Supervisory Patent Examiner, Art Unit 2148

Read full office action

Prosecution Timeline

Jul 21, 2022

Application Filed

Jun 12, 2025

Non-Final Rejection — §103

Oct 15, 2025

Response Filed

Mar 25, 2026

Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/882,311

Patent 12488244

APPARATUS AND METHOD FOR DATA GENERATION FOR USER ENGAGEMENT

2y 5m to grant Granted Dec 02, 2025

17/444,687

Patent 12423576

METHOD AND APPARATUS FOR UPDATING PARAMETER OF MULTI-TASK MODEL, AND STORAGE MEDIUM

2y 5m to grant Granted Sep 23, 2025

17/265,476

Patent 12361280

METHOD AND DEVICE FOR TRAINING A MACHINE LEARNING ROUTINE FOR CONTROLLING A TECHNICAL SYSTEM

2y 5m to grant Granted Jul 15, 2025

17/191,518

Patent 12354017

ALIGNING KNOWLEDGE GRAPHS USING SUBGRAPH TYPING

2y 5m to grant Granted Jul 08, 2025

17/161,152

Patent 12333425

HYBRID GRAPH NEURAL NETWORK

2y 5m to grant Granted Jun 17, 2025

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

3-4

Expected OA Rounds

47%

Grant Probability

90%

With Interview (+43.1%)

3y 9m

Median Time to Grant

Moderate

PTA Risk

Based on 38 resolved cases by this examiner. Grant probability derived from career allow rate.