Office Action Analysis: 17958080 — METHOD AND SYSTEM FOR TRAINING LARGE-SCALE LANGUAGE MODELS

Office Action

§101 §102 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
	This action is in response to amendments filed December 8th, 2025, in which claims 1-9 and 11-20 have been amended. No claims have been cancelled nor added. The amendments have been entered, and claims 1-20 are currently pending in the case. Claims 1, 14 and 20 are independent claims.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.

Regarding claim 1:
Step 1: Claim 1 is directed to [a] computer-implemented method, therefore it falls under the statuary category of a process.
Step 2A Prong 1: The claim recites, in part:
“determining a set of first weights based on a first matrix” This encompasses the mental determination of first weights based on an observed metric.
“determining a set of second weights based on the set of first weights” this encompasses the mental determination of a second set of weights based on an observed first set of weights.
“forming, based on the set of first weights and the set of second weights, a second matrix” this encompasses the mental formation of a second matrix based on an observed first and second set of weights.
“the second matrix is obtained, during the forming, by modifying a dimension of the first matrix by performing at least one matrix transformation on the first matrix taken from the group consisting of: 
widening a row of the first matrix corresponding to the embedding layer, widening a row of the first matrix corresponding to the classifier layer, or widening a row of the first matrix corresponding to the transformer layers and adding a new row to a set of rows of the first matrix corresponding to the transformer layers” this encompasses the mental modification of an observed matrix by widening or adding rows to it. Further, this limitation is a mathematical concept. 
Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “model training, performed by a processing system”, “associated with a source model”, “associated with a target model”, “the source model and the target model both comprise: an embedding layer, a plurality of transformer layers, and a classifier layer” these limitations are an additional element that generally links the use of the judicial exception to a particular technological environment or field of use. See MPEP §2106.05(h). “initializing the target model based on the second matrix”, “training the target model” the limitations are an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f)(2). 
Step 2B: The additional elements, taken individually and in combination, do not provide an inventive concept of significantly more than the abstract idea itself for the reasons set forth in step 2A prong 2 above. Therefore, the claim is ineligible.

Regarding claim 2, the rejection of claim 1 is incorporated and further:
Step 2A Prong 1: The claim recites, in part:
“the first matrix comprises weights associated with connections between nodes in a current layer and nodes in an upper layer, and wherein determining a set of first weights based on a first matrix” a continuation of the abstract idea identified in the parent claim. 
 “sampling weights associated with the nodes in the current layer among the weights in the first matrix” this encompasses the mental sampling of observed weights.
“determining the set of first weights based on the sampled weights associated with one node among the nodes in the current layer” this encompasses the mental determination of a set of first weights based on observed sampled weights.
“forming a first intermediate matrix based on the set of first weights and the first matrix, wherein the determining a set of second weights is based on the intermediate matrix” this encompasses the mental formation of an intermediate matrix based on observed data in order to determine a second set of weights. 
Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “in the source model”, “associated with a source model” these limitations are an additional element that generally links the use of the judicial exception to a particular technological environment or field of use. See MPEP §2106.05(h). 
Step 2B: The additional elements, taken individually and in combination, do not provide an inventive concept of significantly more than the abstract idea itself for the reasons set forth in step 2A prong 2 above. Therefore, the claim is ineligible.

Regarding claim 3, the rejection of claim 2 is incorporated and further:
Step 2A Prong 1: The claim recites, in part:
“sampling weights associated with the nodes in the upper layer among the weights in the first intermediate matrix” this encompasses the mental sampling of observed weights associated with nodes.
“determining the set of second weights based on the sampled weights associated with one node among the nodes in the upper layer” this encompasses the mental determination of second weights based on observed sampled weights.
“forming the second matrix is based on the first intermediate matrix and the set of second weights” this encompasses the mental formation of a second matrix based on observed data.
Step 2A Prong 2: The claim does not recite any additional limitations, thus does not further recite any additional elements that integrates the judicial exception into a practical application or amount to significantly more. 

Regarding claim 4, the rejection of claim 2 is incorporated and further:
Step 2A Prong 1: a continuation of the abstract idea identified in the parent claim. 
Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “the current layer is comprised in a multi- head attention (MHA) module in a transformer network, and wherein the nodes in the current layer are neurons for multiple attention heads” the limitation is an additional element that generally links the use of the judicial exception to a particular technological environment or field of use. See MPEP §2106.05(h). 
Step 2B: The additional elements, taken individually and in combination, do not provide an inventive concept of significantly more than the abstract idea itself for the reasons set forth in step 2A prong 2 above. Therefore, the claim is ineligible.
 
Regarding claim 5, the rejection of claim 2 is incorporated and further:
Step 2A Prong 1: The claim recites, in part:
“sampling weights associated with the nodes in the upper layer among the weights in the third matrix” this encompasses the mental sampling of weights associated with observed data. 
“determining a set of third weights based on the sampled weights associated with one node among the nodes in the upper layer” this encompasses the mental determination  of a set of third weights based on observed data.
“forming a second intermediate matrix based on the set of third weights and the third matrix” this encompasses the mental formation of a second intermediate matrix based on observed data. 
Step 2A Prong 2: The claim does not recite any additional limitations, thus does not further recite any additional elements that integrates the judicial exception into a practical application or amount to significantly more. 

Regarding claim 6, the rejection of claim 5 is incorporated and further:
Step 2A Prong 1: The claim recites, in part:
“sampling weights associated with the nodes in the third layer among the weights in the second intermediate matrix” this encompasses the mental sampling of observed weights associated with nodes.
“determining the set of second weights based on the sampled weights associated with one node among the nodes in the third layer” this encompasses the mental determination of second weights based on observed sampled weights.
“the forming, based on the second matrix based on the first intermediate matrix and the set of second weights, the second matrix is based on the first intermediate matrix and the set of second weights” this encompasses the mental formation of a second matrix based on observed data.
Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows:
Step 2B: The additional elements, taken individually and in combination, do not provide an inventive concept of significantly more than the abstract idea itself for the reasons set forth in step 2A prong 2 above. Therefore, the claim is ineligible.

Regarding claim 7, the rejection of claim 1 is incorporated and further:
Step 2A Prong 1: The claim recites, in part:
“generating multiple copies of the second matrix by duplicating the second matrix multiple times” this encompasses the mental copying of an observed matrix.
Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “initializing the target model using the multiple copies of the second matrix” the limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f)(2). 
Step 2B: The additional elements, taken individually and in combination, do not provide an inventive concept of significantly more than the abstract idea itself for the reasons set forth in step 2A prong 2 above. Therefore, the claim is ineligible.

Regarding claim 8, the rejection of claim 7 is incorporated and further:
Step 2A Prong 1: The claim recites, in part:
“obtaining the multiple copies of the second matrix having target dimensions of the target model by carrying out multiple iterations of a), b), c), and f)” this encompasses the mental creation of copies of observed matrices by repeating steps a), b), c), and f).
Step 2A Prong 2: The claim does not recite any additional limitations, thus does not further recite any additional elements that integrates the judicial exception into a practical application or amount to significantly more. 

Regarding claim 9, the rejection of claim 1 is incorporated and further:
Step 2A Prong 1: The claim recites, in part:
“forming, by carrying out multiple iterations of a), b) and c),multiple instances of the a second matrix, each of the multiple instances of the second matrix being associated with one of each of the other modules” this encompasses the mental formation of a second matrix by carrying out steps a) through c).
Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “the plurality of moduls in the source model” the limitation is an additional element that generally links the use of the judicial exception to a particular technological environment or field of use. See MPEP §2106.05(h). 
Step 2B: The additional elements, taken individually and in combination, do not provide an inventive concept of significantly more than the abstract idea itself for the reasons set forth in step 2A prong 2 above. Therefore, the claim is ineligible.

Regarding claim 10, the rejection of claim 1 is incorporated and further:
Step 2A Prong 1: a continuation of the abstract idea identified in the parent claim. 
Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “the trained target model is used as a second source model to initialize a second target model” the limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f)(2). 
Step 2B: The additional elements, taken individually and in combination, do not provide an inventive concept of significantly more than the abstract idea itself for the reasons set forth in step 2A prong 2 above. Therefore, the claim is ineligible.

Regarding claim 11, the rejection of claim 1 is incorporated and further:
Step 2A Prong 1: The claim recites, in part:
“determining a plurality of sub-models based on the target model, wherein the target model comprises a plurality of layers and each sub-model is used for updating a subset of layers among the plurality of layers in the target model” this encompasses the mental determination of a plurality of sub-models based on an observed target model.
Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “updating the plurality of layers in the target model by training the plurality of sub-models”, “training the target model to update the plurality of layers thereof” the limitations are an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f)(2).  
Step 2B: The additional elements, taken individually and in combination, do not provide an inventive concept of significantly more than the abstract idea itself for the reasons set forth in step 2A prong 2 above. Therefore, the claim is ineligible.

Regarding claim 12, the rejection of claim 11 is incorporated and further:
Step 2A Prong 1: The claim recites, in part:
“sampling the plurality of sub-models” this encompasses the mental sampling of observed sub-models.
Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “training the sampled sub-model by using a training dataset”, “updating a corresponding subset of layers among the plurality of layers in the target model” the limitations are an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f)(2).  
Step 2B: The additional elements, taken individually and in combination, do not provide an inventive concept of significantly more than the abstract idea itself for the reasons set forth in step 2A prong 2 above. Therefore, the claim is ineligible.

Regarding claim 13, the rejection of claim 12 is incorporated and further:
Step 2A Prong 1: The claim recites, in part:
“computing, by using all of the layers in the corresponding sub-model, a training loss based on the training dataset” this encompasses the mental computation of a training loss based on an observed dataset.
Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “updating the subset of layers in the corresponding sub-model based on the training loss” the limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f)(2). 
Step 2B: The additional elements, taken individually and in combination, do not provide an inventive concept of significantly more than the abstract idea itself for the reasons set forth in step 2A prong 2 above. Therefore, the claim is ineligible.

Regarding claim 14:
Step 1: Claim 1 is directed to [a] system, therefore it falls under the statuary category of a machine.
Step 2A Prong 1: The claim recites, in part:
“determining a set of first weights based on a first matrix” This encompasses the mental determination of first weights based on an observed metric.
“determining a set of second weights based on the set of first weights” this encompasses the mental determination of a second set of weights based on an observed first set of weights.
“forming, based on the set of first weights and the set of second weights, a second matrix” this encompasses the mental formation of a second matrix based on an observed first and second set of weights.
“the second matrix is obtained, during the forming, by modifying a dimension of the first matrix by performing at least one matrix transformation on the first matrix taken from the group consisting of: 
widening a row of the first matrix corresponding to the embedding layer, widening a row of the first matrix corresponding to the classifier layer, or widening a row of the first matrix corresponding to the transformer layers and adding a new row to a set of rows of the first matrix corresponding to the transformer layers” this encompasses the mental modification of an observed matrix by widening or adding rows to it. Further, this limitation is a mathematical concept.
Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “one or more processors”, “a non-transitory computer-readable medium, having computer-executable instructions stored thereon”, “associated with a source model”, “associated with a target model”, “the source model and the target model both comprise: an embedding layer, a plurality of transformer layers, and a classifier layer” these limitations are an additional element that generally links the use of the judicial exception to a particular technological environment or field of use. See MPEP §2106.05(h). “the computer-executable instructions, when executed by one or more processors, causing the one or more processors to facilitate”, “initializing the target model based on the second matrix”, “training the target model”, “initializing the target model based on the second matrix” the limitations are an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f)(2).
Step 2B: The additional elements, taken individually and in combination, do not provide an inventive concept of significantly more than the abstract idea itself for the reasons set forth in step 2A prong 2 above. Therefore, the claim is ineligible.

Regarding claims 15-19:
The rejection of claim 14 is further incorporated, the rejection of claims 2, 3, 5, 6 and 11 are applicable to claims 15-19, respectively.

Regarding claim 20:
Step 1: Claim 1 is directed to [a] non-transitory computer-readable medium, therefore it falls under the statuary category of manufacture.
Step 2A Prong 1: The claim recites, in part: 
“determining a set of first weights based on a first matrix” This encompasses the mental determination of first weights based on an observed metric.
“determining a set of second weights based on the set of first weights” this encompasses the mental determination of a second set of weights based on an observed first set of weights.
“forming, based on the set of first weights and the set of second weights, a second matrix” this encompasses the mental formation of a second matrix based on an observed first and second set of weights.
“the second matrix is obtained, during the forming, by modifying a dimension of the first matrix by performing at least one matrix transformation on the first matrix taken from the group consisting of: 
widening a row of the first matrix corresponding to the embedding layer, widening a row of the first matrix corresponding to the classifier layer, or widening a row of the first matrix corresponding to the transformer layers and adding a new row to a set of rows of the first matrix corresponding to the transformer layers” this encompasses the mental modification of an observed matrix by widening or adding rows to it. Further, this limitation is a mathematical concept.
Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “having computer-executable instructions stored thereon, for model training”, “associated with a source model”, “associated with a target model”, “the source model and the target model both comprise: an embedding layer, a plurality of transformer layers, and a classifier layer” these limitations are an additional element that generally links the use of the judicial exception to a particular technological environment or field of use. See MPEP §2106.05(h). “the computer-executable instructions, when executed by one or more processors, causing the one or more processors to facilitate”, “initializing the target model based on the second matrix”, “training the target model” the limitations are an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f)(2).
Step 2B: The additional elements, taken individually and in combination, do not provide an inventive concept of significantly more than the abstract idea itself for the reasons set forth in step 2A prong 2 above. Therefore, the claim is ineligible.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-3, 5-10, 14-18 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Chen et al. ("Net2Net: Accelerating Learning via Knowledge Transfer", Chen et al., 23 April 2016) (hereinafter “Chen”) in view of Chen et al. ("bert2BERT: Towards Reusable Pretrained Language Models", Chen et al., 14 Oct 2021) (hereinafter "Chen 2").

Regarding claim 1:
Chen teaches [a] computer-implemented method for model training, performed by a processing system (Chen, page 6, section 3.2, ¶1 “To simplify the software for our experiments, we did not modify any component of the network other than the Inception modules.” The system is run via software which is computer implemented), comprising:
a) determining a set of first weights based on a first matrix associated with a source model (Chen, page 3, Algorithm 1 “{W(i) |i = 1, 2, …n}, the weight matrix of teacher net” here, the weight matrix of the teacher net can be considered the first matrix associated with a source model, and weights W(i) can be considered the first weights);
b) determining a set of second weights based on the set of first weights (Chen, page 3, ¶1 “Our strategy is to choose a new set of parameters                         
                            
                                
                                    θ
                                
                                
                                    '
                                
                            
                        
                     for a student network                         
                            g
                            
                                
                                    x
                                    ;
                                    
                                        
                                            θ
                                        
                                        
                                            '
                                        
                                    
                                
                            
                        
                     such that                         
                            ∀
                            
                                
                                    x
                                
                            
                            ,
                            f
                            
                                
                                    x
                                    ;
                                    θ
                                
                            
                            =
                            g
                            
                                
                                    x
                                    ;
                                    
                                        
                                            θ
                                        
                                        
                                            '
                                        
                                    
                                
                            
                        
                    .” Here, the new set of parameters for a student network can be considered the second set of weights);
c) forming, based on the set of first weights and the set of second weights, a second matrix associated with a target model (Chen, page 3, Algorithm 1 “{U(i) |i = 1, 2, … n}: the transformed weight matrix for wider net” here, U(i) the transformed weight matrix can be considered a second matrix and the wider net can be considered the target model it is associated with);
d) initializing the target model based on the second matrix (Chen, page 1, section 1, ¶1 “Specifically, we initialize the student to be a neural network that represents the same function as the teacher, but using a different parameterization.”); and
e) training the target model (Chen, page 1, section 1, ¶1 “After initializing the larger network to contain all of the knowledge previously acquired by the smaller network, the larger network may be trained to improve its performance.”).
Chen does not teach "wherein the source model and the target model both comprise: an embedding layer, a plurality of transformer layers, and a classifier layer; and
wherein the second matrix is obtained, during the forming, by modifying a dimension of the first matrix by performing at least one matrix transformation on the first matrix taken from the group consisting of: widening a row of the first matrix corresponding to the embedding layer, widening a row of the first matrix corresponding to the classifier layer, or widening a row of the first matrix corresponding to the transformer layers and adding a new row to a set of rows of the first matrix corresponding to the transformer layers." 
However, Chen 2 teaches wherein the source model and the target model both comprise (Chen, page 3, col 2, section 3.1, ¶1 “We aim to accelerate the pre-training of target model T(Lt,Dt) by transferring the knowledge of an existing pre-trained model S(Ls,Ds),where Ls|t means the numbers of Transformer layer and Ds|t means the model width(i.e., hidden size),satisfying Ls≤Lt and Ds≤Dt.”): an embedding layer, a plurality of transformer layers, and a classifier layer (Chen 2, page 2, col 2, section 2, ¶1 “Before presenting our method, we first introduce some details about the BERT architecture, consisting of one embedding layer and multiple Trans former (Vaswani et al., 2017) layers.” Furthermore, Chen 2, page 6, col 1, section 3.3.4, ¶1 “These sub structures are built with bottom Transformer layers of target model and share one classification head.”); and
wherein the second matrix is obtained, during the forming, by modifying a dimension of the first matrix by performing at least one matrix transformation on the first matrix taken from the group consisting of: widening a row of the first matrix corresponding to the embedding layer (Chen 2, page 4, col 2, ¶2 “Specifically, for the embedding matrix UE, we only conduct the out-dimension expansion: 
    PNG
    media_image1.png
    46
    335
    media_image1.png
    Greyscale
”), widening a row of the first matrix corresponding to the classifier layer, or widening a row of the first matrix corresponding to the transformer layers (Chen 2, page 4, col 4, ¶4 “The out-dimension expansion is: 
    PNG
    media_image2.png
    120
    590
    media_image2.png
    Greyscale
 where                         
                            
                                
                                    a
                                
                                
                                    s
                                    |
                                    t
                                
                            
                        
                     mean that head numbers of source model and target model respectively.”) and adding a new row to a set of rows of the first matrix corresponding to the transformer layers. 
It is noted the claim recites alternative language, and Chen in view of Chen 2 teaches at least one of the alternatives. 
Chen and Chen 2 are analogous art because both references concern methods for model knowledge transfer. Accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Chen’s system to incorporate the widening taught by Chen 2. The motivation for doing so would have been to save computation using a model-agnostic method, as stated in Chen 2, page 2, col 1, ¶3 “The results show that: (1) our method can save a significant amount of computation in pre training compared to the traditional way of learning from scratch and progressive stacking methods such as StackBERT (Gong et al., 2019) and MSLT(Yanget al., 2020); (2) our method is model agnostic, which can be applied on a wide range of Transformer-based PLMs.”

Regarding claim 2:
Chen in view of Chen 2 teaches [t]he method according to claim 1, wherein the first matrix comprises weights associated with connections between nodes in a current layer and nodes in an upper layer in the source model, wherein the determining a set of first weights based on a first matrix comprises:
sampling weights associated with the nodes in the current layer among the weights in the first matrix (Chen, page 4, ¶3 “Columns n+1 through q of U(i) are created by choosing a random as defined in g. The random selection is performed with replacement, so each column of W(i) is copied potentially many times.” Further, Chen, page 4, ¶1 “We will introduce a random mapping function g : {1, 2, … , q} → {1, 2, … , n}” here, the random mapping function can be considered sampling weights and W(i) is the first matrix);
determining the set of first weights based on the sampled weights associated with one node among the nodes in the current layer (Chen, page 4, ¶6 “The random remapping for the multiplication parameters must match the random remapping for the weight matrix. Otherwise we could generate a new unit that uses the weight vector for pre-existing unit i but is scaled by the multiplication parameter for unit j. The new unit would not implement the same function as the old unit i or as the old unit j.” here the units can be considered the nodes that are associated with weights); and
forming a first intermediate matrix and the first matrix, wherein determining a set of second weights based on the set of first weights is based on the intermediate matrix (Chen, page 4, ¶3 “Here, the first n columns of W(i) are copied directly into U(i). Columns n+1 through q of U(i) are created by choosing a random as defined in g. The random selection is performed with replacement, so each column of W(i) is copied potentially many times.” Here, each column of weights can be considered an intermediate weight matrix based on the first set of weights and used to determine the second).

Regarding claim 3:
Chen in view of Chen 2 teaches [t]he method according to claim 2, wherein the determining a set of second weights based on the set of first weights comprises:
sampling weights associated with the nodes in the upper layer among the weights in the first intermediate matrix (Chen, page 4, ¶3 “Columns n+1 through q of U(i) are created by choosing a random as defined in g. The random selection is performed with replacement, so each column of W(i) is copied potentially many times. For weights in U(i+1), we must account for the replication by dividing the weight by replication factor given by                         
                            
                                
                                    1
                                
                                
                                    
                                        
                                            {
                                            x
                                            |
                                            g
                                            
                                                
                                                    x
                                                
                                            
                                            =
                                            g
                                            (
                                            j
                                            )
                                            }
                                        
                                    
                                
                            
                        
                    , so all the units have the exactly the same value as the unit in the original net.” Here, U(i+1) can be considered an upper layer and the columns the intermediate matrix and the random selection the sampling); and 
determining the set of second weights based on the sampled weights associated with one node among the nodes in the upper layer (Chen, page 4, ¶6 “The random remapping for the multiplication parameters must match the random remapping for the weight matrix. Otherwise we could generate a new unit that uses the weight vector for pre-existing unit i but is scaled by the multiplication parameter for unit j. The new unit would not implement the same function as the old unit i or as the old unit j.” here, the units can be considered the nodes that the associated with weights), wherein the forming the second matrix is based on the first intermediate matrix and the set of second weights (Chen, page 4, ¶3 “Here, the first n columns of W(i) are copied directly into U(i). Columns n+1 through q of U(i) are created by choosing a random as defined in g. The random selection is performed with replacement, so each column of W(i) is copied potentially many times.” Here, the column of weights can be considered an intermediate weight matrix based on the first set of weights and used to determine the second weight matrix U(i)).

Regarding claim 5:
Chen in view of Chen 2 teaches [t]he method according to claim 2, wherein a third matrix comprises weights associated with connections between the nodes in the upper layer and nodes in a third layer in the source model, wherein the third layer is the next layer of the upper layer (Chen, page 4, ¶4 “This general procedure is illustrated by Fig. 2. So far we have only discussed the use of a single random mapping function to expand one layer. We can in fact introduce a random mapping function g(i) for every non-output layer.” and further Chen, page 5, ¶3 “This operator can be applied arbitrarily many times; we can expand only one layer of the network, or we can expand all non-output layers.” Here, it is shown that the method can be repeated resulting in further i+1 layers, each of which can be considered an upper layer to the pervious i), and wherein the method comprises:
sampling weights associated with the nodes in the upper layer among the weights in the third matrix (Chen, page 4, ¶3 “Columns n+1 through q of U(i) are created by choosing a random as defined in g. The random selection is performed with replacement, so each column of W(i) is copied potentially many times.” Further, Chen, page 4, ¶1 “We will introduce a random mapping function g : {1, 2, … , q} → {1, 2, … , n}” here, the random mapping function can be considered sampling weights and W(i) is the third matrix);
determining a set of third weights based on the sampled weights associated with one node among the nodes in the upper layer (Chen, page 4, ¶6 “The random remapping for the multiplication parameters must match the random remapping for the weight matrix. Otherwise we could generate a new unit that uses the weight vector for pre-existing unit i but is scaled by the multiplication parameter for unit j. The new unit would not implement the same function as the old unit i or as the old unit j.” here the units can be considered the nodes that are associated with weights); and
forming a second intermediate matrix based on the set of third weights and the third matrix (Chen, page 4, ¶3 “Here, the first n columns of W(i) are copied directly into U(i). Columns n+1 through q of U(i) are created by choosing a random as defined in g. The random selection is performed with replacement, so each column of W(i) is copied potentially many times.” Here, each column of weights can be considered an intermediate weight matrix based on the third set of weights and used to determine the third).

Regarding claim 6:
Chen in view of Chen 2  teaches [t]he method according to claim 5, wherein determining a set of second weights based on the set of first weights comprises:
sampling weights associated with the nodes in the third layer among the weights in the second intermediate matrix (Chen, page 4, ¶3 “Columns n+1 through q of U(i) are created by choosing a random as defined in g. The random selection is performed with replacement, so each column of W(i) is copied potentially many times. For weights in U(i+1), we must account for the replication by dividing the weight by replication factor given by                         
                            
                                
                                    1
                                
                                
                                    
                                        
                                            {
                                            x
                                            |
                                            g
                                            
                                                
                                                    x
                                                
                                            
                                            =
                                            g
                                            (
                                            j
                                            )
                                            }
                                        
                                    
                                
                            
                        
                    , so all the units have the exactly the same value as the unit in the original net.” Here, U(i+1) can be considered an upper layer and the columns the intermediate matrix and the random selection the sampling); and
determining the set of second weights based on the sampled weights associated with one node among the nodes in the third layer (Chen, page 4, ¶6 “The random remapping for the multiplication parameters must match the random remapping for the weight matrix. Otherwise we could generate a new unit that uses the weight vector for pre-existing unit i but is scaled by the multiplication parameter for unit j. The new unit would not implement the same function as the old unit i or as the old unit j.” here the units can be considered the nodes that the associated with weights), wherein forming, based on the set of first weights and the set of second weights, a second matrix associated with a target model comprises forming the second matrix based on the first intermediate matrix and the set of second weights (Chen, page 4, ¶3 “Here, the first n columns of W(i) are copied directly into U(i). Columns n+1 through q of U(i) are created by choosing a random as defined in g. The random selection is performed with replacement, so each column of W(i) is copied potentially many times.” Here, the column of weights can be considered an intermediate weight matrix based on the first set of weights and used to determine the second weight matrix U(i)).

Regarding claim 7:
Chen in view of Chen 2  teaches [t]he method according to claim 1, comprising:
f) generating multiple copies of the second matrix by duplicating the second matrix multiple times, wherein the initializing the target model based on the second matrix comprises initializing the target model using the multiple copies of the second matrix (Chen page 2, figure 1 “This description can be generalized to making multiple layers wider, with the layers composed as described by an arbitrary directed acyclic computation graph.” Here, by making multiple copies wider it can be considered to use multiple copies of the second matrix).

Regarding claim 8:
Chen in view of Chen 2 teaches The method according to claim 7, comprising:
obtaining the multiple copies of the second matrix having target dimensions (Chen, page 6, ¶2 “However, Net2WiderNet may be composed with Net2DeeperNet, so we may in fact add any hidden layer that is at least as wide as the layer below it.” Here the width of the layer below can be considered the target dimension. The target dimension has been interpreted as a target dimension in light of the 112(b) rejection of claim 8.) of the target model by carrying out multiple iterations of a), b), c), and f) (Chen, page 4, figure 2 “This is a simple example intended to illustrate the conceptual idea. For a practical application, we would simultaneously replicate many randomly chosen units, and we would add a small amount of noise to break symmetry after the replication. We also typically widen many layers rather than just one layer, by recursively applying the Net2WiderNet operator.”).

Regarding claim 9:
Chen in view of Chen 2 teaches [t]he method according to claim 1, wherein the first matrix is associated with one module among a plurality of modules in the source model, and wherein the method comprises;
forming, by carrying out multiple iterations of a), b) and c), multiple instances of the second matrix, each of the multiple instances of the second matrix being associated with one of each of the other modules among the plurality of modules in the source model (Chen, page 4, figure 2 “This is a simple example intended to illustrate the conceptual idea. For a practical application, we would simultaneously replicate many randomly chosen units, and we would add a small amount of noise to break symmetry after the replication. We also typically widen many layers rather than just one layer, by recursively applying the Net2WiderNet operator.”).

Regarding claim 10:
Chen in view of Chen 2 teaches [t]he method according to claim 1, wherein the trained target model is used as a second source model to initialize a second target model (Chen, page 1, section 1, ¶4 “Net2Net operations accelerate these workflows by rapidly transferring knowledge from the previous best model into each new model that an experimenter proposes.”).

Regarding claim 14:
Chen teaches [a] system for model training, comprising:
one or more processors; and a non-transitory computer-readable medium, having computer-executable instructions stored thereon, the computer-executable instructions, when executed by one or more processors, causing the one or more processors to facilitate (Chen, page 6, section 3.2, ¶1 “To simplify the software for our experiments, we did not modify any component of the network other than the Inception modules.” The system is run via software which is computer implemented and involves processors in communication with memory):
a) determining a set of first weights based on a first matrix associated with a source model (Chen, page 3, Algorithm 1 “{W(i) |i = 1, 2, …n}, the weight matrix of teacher net” here, the weight matrix of the teacher net can be considered the first matrix associated with a source model, and weights W(i) can be considered the first weights);
b) determining a set of second weights based on the set of first weights (Chen, page 3, ¶1 “Our strategy is to choose a new set of parameters                         
                            
                                
                                    θ
                                
                                
                                    '
                                
                            
                        
                     for a student network                         
                            g
                            
                                
                                    x
                                    ;
                                    
                                        
                                            θ
                                        
                                        
                                            '
                                        
                                    
                                
                            
                        
                     such that                         
                            ∀
                            
                                
                                    x
                                
                            
                            ,
                            f
                            
                                
                                    x
                                    ;
                                    θ
                                
                            
                            =
                            g
                            
                                
                                    x
                                    ;
                                    
                                        
                                            θ
                                        
                                        
                                            '
                                        
                                    
                                
                            
                        
                    .” Here, the new set of parameters for a student network can be considered the second set of weights);
c) forming, based on the set of first weights and the set of second weights, a second matrix associated with a target model (Chen, page 3, Algorithm 1 “{U(i) |i = 1, 2, … n}: the transformed weight matrix for wider net” here, U(i) the transformed weight matrix can be considered a second matrix and the wider net can be considered the target model it is associated with);
d) initializing the target model based on the second matrix (Chen, page 1, section 1, ¶1 “Specifically, we initialize the student to be a neural network that represents the same function as the teacher, but using a different parameterization.”); and
e) training the target model (Chen, page 1, section 1, ¶1 “After initializing the larger network to contain all of the knowledge previously acquired by the smaller network, the larger network may be trained to improve its performance.”)
Chen does not teach "wherein the source model and the target model both comprise: an embedding layer, a plurality of transformer layers, and a classifier layer; and
wherein the second matrix is obtained, during the forming, by modifying a dimension of the first matrix by performing at least one matrix transformation on the first matrix taken from the group consisting of: widening a row of the first matrix corresponding to the embedding layer, widening a row of the first matrix corresponding to the classifier layer, or widening a row of the first matrix corresponding to the transformer layers and adding a new row to a set of rows of the first matrix corresponding to the transformer layers." 
However, Chen 2 teaches wherein the source model and the target model both comprise (Chen, page 3, col 2, section 3.1, ¶1 “We aim to accelerate the pre-training of target model T(Lt,Dt) by transferring the knowledge of an existing pre-trained model S(Ls,Ds),where Ls|t means the numbers of Transformer layer and Ds|t means the model width(i.e., hidden size),satisfying Ls≤Lt and Ds≤Dt.”): an embedding layer, a plurality of transformer layers, and a classifier layer (Chen 2, page 2, col 2, section 2, ¶1 “Before presenting our method, we first introduce some details about the BERT architecture, consisting of one embedding layer and multiple Trans former (Vaswani et al., 2017) layers.” Furthermore, Chen 2, page 6, col 1, section 3.3.4, ¶1 “These sub structures are built with bottom Transformer layers of target model and share one classification head.”); and
wherein the second matrix is obtained, during the forming, by modifying a dimension of the first matrix by performing at least one matrix transformation on the first matrix taken from the group consisting of: widening a row of the first matrix corresponding to the embedding layer (Chen 2, page 4, col 2, ¶2 “Specifically, for the embedding matrix UE, we only conduct the out-dimension expansion: 
    PNG
    media_image1.png
    46
    335
    media_image1.png
    Greyscale
”), widening a row of the first matrix corresponding to the classifier layer, or widening a row of the first matrix corresponding to the transformer layers (Chen 2, page 4, col 4, ¶4 “The out-dimension expansion is: 
    PNG
    media_image2.png
    120
    590
    media_image2.png
    Greyscale
 where                         
                            
                                
                                    a
                                
                                
                                    s
                                    |
                                    t
                                
                            
                        
                     mean that head numbers of source model and target model respectively.”) and adding a new row to a set of rows of the first matrix corresponding to the transformer layers. 
It is noted the claim recites alternative language, and Chen in view of Chen 2 teaches at least one of the alternatives. 
Chen and Chen 2 are analogous art because both references concern methods for model knowledge transfer. Accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Chen’s system to incorporate the widening taught by Chen 2. The motivation for doing so would have been to save computation using a model-agnostic method, as stated in Chen 2, page 2, col 1, ¶3 “The results show that: (1) our method can save a significant amount of computation in pre training compared to the traditional way of learning from scratch and progressive stacking methods such as StackBERT (Gong et al., 2019) and MSLT(Yanget al., 2020); (2) our method is model agnostic, which can be applied on a wide range of Transformer-based PLMs.”

Regarding claims 15-18:
Claims 15-18 are rejected based on the same rationale as analogous claims 2, 3, 5 and 6, respectively. 

Regarding claim 20:
Chen teaches [a] non-transitory computer-readable medium, having computer-executable instructions stored thereon, for model training, the computer-executable instructions, when executed by one or more processors, causing the one or more processors to facilitate (Chen, page 6, section 3.2, ¶1 “To simplify the software for our experiments, we did not modify any component of the network other than the Inception modules.” The system is run via software which is computer implemented and involves processors in communication with memory):
a) determining a set of first weights based on a first matrix associated with a source model (Chen, page 3, Algorithm 1 “{W(i) |i = 1, 2, …n}, the weight matrix of teacher net” here, the weight matrix of the teacher net can be considered the first matrix associated with a source model, and weights W(i) can be considered the first weights);
b) determining a set of second weights based on the set of first weights (Chen, page 3, ¶1 “Our strategy is to choose a new set of parameters                         
                            
                                
                                    θ
                                
                                
                                    '
                                
                            
                        
                     for a student network                         
                            g
                            
                                
                                    x
                                    ;
                                    
                                        
                                            θ
                                        
                                        
                                            '
                                        
                                    
                                
                            
                        
                     such that                         
                            ∀
                            
                                
                                    x
                                
                            
                            ,
                            f
                            
                                
                                    x
                                    ;
                                    θ
                                
                            
                            =
                            g
                            
                                
                                    x
                                    ;
                                    
                                        
                                            θ
                                        
                                        
                                            '
                                        
                                    
                                
                            
                        
                    .” Here, the new set of parameters for a student network can be considered the second set of weights);
c) forming, based on the set of first weights and the set of second weights, a second matrix associated with a target model (Chen, page 3, Algorithm 1 “{U(i) |i = 1, 2, … n}: the transformed weight matrix for wider net” here, U(i) the transformed weight matrix can be considered a second matrix and the wider net can be considered the target model it is associated with);
d) initializing the target model based on the second matrix (Chen, page 1, section 1, ¶1 “Specifically, we initialize the student to be a neural network that represents the same function as the teacher, but using a different parameterization.”); and
e) training the target model (Chen, page 1, section 1, ¶1 “After initializing the larger network to contain all of the knowledge previously acquired by the smaller network, the larger network may be trained to improve its performance.”)
Chen does not teach "wherein the source model and the target model both comprise: an embedding layer, a plurality of transformer layers, and a classifier layer; and
wherein the second matrix is obtained, during the forming, by modifying a dimension of the first matrix by performing at least one matrix transformation on the first matrix taken from the group consisting of: widening a row of the first matrix corresponding to the embedding layer, widening a row of the first matrix corresponding to the classifier layer, or widening a row of the first matrix corresponding to the transformer layers and adding a new row to a set of rows of the first matrix corresponding to the transformer layers." 
However, Chen 2 teaches wherein the source model and the target model both comprise (Chen, page 3, col 2, section 3.1, ¶1 “We aim to accelerate the pre-training of target model T(Lt,Dt) by transferring the knowledge of an existing pre-trained model S(Ls,Ds),where Ls|t means the numbers of Transformer layer and Ds|t means the model width(i.e., hidden size),satisfying Ls≤Lt and Ds≤Dt.”): an embedding layer, a plurality of transformer layers, and a classifier layer (Chen 2, page 2, col 2, section 2, ¶1 “Before presenting our method, we first introduce some details about the BERT architecture, consisting of one embedding layer and multiple Trans former (Vaswani et al., 2017) layers.” Furthermore, Chen 2, page 6, col 1, section 3.3.4, ¶1 “These sub structures are built with bottom Transformer layers of target model and share one classification head.”); and
wherein the second matrix is obtained, during the forming, by modifying a dimension of the first matrix by performing at least one matrix transformation on the first matrix taken from the group consisting of: widening a row of the first matrix corresponding to the embedding layer (Chen 2, page 4, col 2, ¶2 “Specifically, for the embedding matrix UE, we only conduct the out-dimension expansion: 
    PNG
    media_image1.png
    46
    335
    media_image1.png
    Greyscale
”), widening a row of the first matrix corresponding to the classifier layer, or widening a row of the first matrix corresponding to the transformer layers (Chen 2, page 4, col 4, ¶4 “The out-dimension expansion is: 
    PNG
    media_image2.png
    120
    590
    media_image2.png
    Greyscale
 where                         
                            
                                
                                    a
                                
                                
                                    s
                                    |
                                    t
                                
                            
                        
                     mean that head numbers of source model and target model respectively.”) and adding a new row to a set of rows of the first matrix corresponding to the transformer layers. 
It is noted the claim recites alternative language, and Chen in view of Chen 2 teaches at least one of the alternatives. 
Chen and Chen 2 are analogous art because both references concern methods for model knowledge transfer. Accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Chen’s system to incorporate the widening taught by Chen 2. The motivation for doing so would have been to save computation using a model-agnostic method, as stated in Chen 2, page 2, col 1, ¶3 “The results show that: (1) our method can save a significant amount of computation in pre training compared to the traditional way of learning from scratch and progressive stacking methods such as StackBERT (Gong et al., 2019) and MSLT(Yanget al., 2020); (2) our method is model agnostic, which can be applied on a wide range of Transformer-based PLMs.”

Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Chen in view of Chen 2 in view of Gong et al. ("Efficient Training of BERT by Progressively Stacking", Gong et al., 2019) (hereinafter “Gong”).

Regarding claim 4:
Chen teaches [t]he method according to claim 2
Chen does not teach “wherein the current layer is comprised in a multi-head attention (MHA) module in a transformer network, and wherein the nodes in the current layer are neurons for multiple attention heads.”
However, Gong teaches wherein the current layer is comprised in a multi- head attention (MHA) module in a transformer network, and wherein the nodes in the current layer are neurons for multiple attention heads (Gong, page 2, col 2, section 3, ¶1 “The BERT (Bidirectional Encoder Representation from Transformers) model is developed on a multi-layer bidirectional Transformer (Vaswani et al., 2017) encoder. The architecture is shown in Figure 1. The encoder consists of L encoder layers, each of which consists of a multi-head self-attention sub-layer and a feed forward sub-layer: both of them have residual connections (He et al., 2015).”).

Chen in view of Chen 2 and Gong are analogous art because both references concern methods for training machine learning models. Accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Chens’s training system to incorporate the multi-head self-attention sub-layer taught by Gong. The motivation for doing so would have been to accelerate training as stated in Gong, page 1, Abstract “By visualizing the self-attention distributions of different layers at different positions in a well-trained BERT model, we find that in most layers, the self-attention distribution will concentrate locally around its position and the start-of-sentence token. Motivated by this, we propose the stacking algorithm to transfer knowledge from a shallow model to a deep model; then we apply stacking progressively to accelerate BERT training.”

Claims 11-13 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Chen in view of Chen 2 in further view of Srivastava ("Improving Neural Networks with Dropout", Srivastava, 2013).

Regarding claim 11:
Chen teaches method according to claim 1 
Chen does not teach “wherein the training the target model comprises:
determining a plurality of sub-models based on the target model, wherein the target model comprises a plurality of layers and each sub-model is used for updating a subset of layers among the plurality of layers in the target model;
updating the plurality of layers in the target model by training the plurality of sub-models;
and training the target model to update the plurality of layers thereof” 
However, Srivastava teaches wherein the training the target model comprises:
determining a plurality of sub-models based on the target model, wherein the target model comprises a plurality of layers and each sub-model is used for updating a subset of layers among the plurality of layers in the target model (Srivastava, page 23, ¶2 “The central idea of dropout is to take a large model that overfits easily and repeatedly sample and train smaller sub-models from it.” Here, each smaller sub-model can be considered to have a subset of layers of the large model);
updating the plurality of layers in the target model by training the plurality of sub-models (Srivastava, page 23, ¶2 “The central idea of dropout is to take a large model that overfits easily and repeatedly sample and train smaller sub-models from it. Since all the sub-models share parameters with the large model, this process trains the large model which is then used at test time.”);
and training the target model to update the plurality of layers thereof (Srivastava, page 23, ¶2 “Since all the sub-models share parameters with the large model, this process trains the large model which is then used at test time.”).
	Chen in view of Chen 2 and Srivastava are analogous art because both references concern methods for training machine learning models. Accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Chen’s training system to incorporate the dropout taught by Srivastava. The motivation for doing so would have been to improve the performance of the neural network as stated in Srivastava, page 23, ¶1 “Dropout improves performance of neural nets in a wide variety of application domains including object classification, digit recognition, speech recognition, document classification and analysis of bio-medical data.”

Regarding claim 12:
Chen in view of Srivastava teaches [t]he method according to claim 11, wherein the updating the plurality of layers in the target model by training the plurality of sub-models comprises:
sampling the plurality of sub-models (Srivastava, page 18, ¶1 “The core idea behind dropout is to sample smaller sub-models from a large model, train them and then combine them at test time.”); 
training the sampled sub-model by using a training dataset (Srivastava, page 18, ¶1 “The core idea behind dropout is to sample smaller sub-models from a large model, train them and then combine them at test time.”); and
updating a corresponding subset of layers among the plurality of layers in the target model (Srivastava, page 18, ¶1 “The core idea behind dropout is to sample smaller sub-models from a large model, train them and then combine them at test time.”).
	It would have been obvious to combine the teachings of Chen in view of Chen 2 and Srivastava for the reasons set forth in connection with claim 11 above. 

Regarding claim 13:
Chen in view of Srivastava teaches [t]he method according to claim 12, wherein each of the plurality of sub-models comprises all or part of the plurality of layers in the target model, wherein the subset of layers in a corresponding sub-model is a portion or all layers in the corresponding sub-model (Srivastava, page 23, ¶2 “The central idea of dropout is to take a large model that overfits easily and repeatedly sample and train smaller sub-models from it. Since all the sub-models share parameters with the large model, this process trains the large model which is then used at test time.” Here, the smaller model can be considered to have a portion of the layers); and wherein the raining the sampled sub-model by using a training dataset comprises:
computing, by using all of the layers in the corresponding sub-model, a training loss based on the training dataset; and updating the subset of layers in the corresponding sub-model based on the training loss (Srivastava, page 6, section 2.1, ¶2 “For learning, the derivatives of the loss function are backpropagated through the thinned network.” here, the backpropagation using the loss can be considered an update.).
It would have been obvious to combine the teachings of Chen in view of Chen 2 and Srivastava for the reasons set forth in connection with claim 11 above.

Regarding claim 15:
Claim 19 is rejected based on the same rationale as analogous claim 11. 
It would have been obvious to combine the teachings of Chen in view of Chen 2 and Srivastava for the reasons set forth in connection with claim 11 above.


Response to Arguments
Applicant's arguments filed December 8th, 2025 (hereinafter “Remarks”) have been fully considered but they are not persuasive.
Applicant’s arguments regarding the 35 U.S.C. 112(b) rejections of the previous office action have been fully considered, and are persuasive. The rejections have been withdrawn due to claim amendments.
Applicant’s arguments regarding the 35 U.S.C. 112(d) rejections of the previous office action have been fully considered, and are persuasive. The rejections have been withdrawn due to claim amendments.

Rejections under 35 U.S.C. § 101:
Argument 1: “the currently claimed invention is directed to a technological innovation addressing a technological problem in the art of generating trained large language models.” (Remarks, page 12).

Examiners Response:
	Examiner respectfully disagrees, the MPEP states “it is important to keep in mind that an improvement in the abstract idea itself (e.g. a recited fundamental economic concept) is not an improvement in technology.” See MPEP § 2106.05(a)(II). An improvement to generating trained large language models may be an improvement to the abstract idea, but it is not an improvement in technology.


Argument 2: “The Office action, at page 4, asserts (without factual support) that each of Applicant's recited (computer implemented) operations falls under the category of a "mental process". However, such conclusion is implicitly based upon the essential premise that training a language model can be implemented by mental processes - which of course it cannot.” (Remarks, page 13).

Examiners Response:
Examiner respectfully disagrees, the MPEP states “When performing the analysis at Step 2A Prong One, it is sufficient for the examiner to provide a reasoned rationale that identifies the judicial exception recited in the claim and explains why it is considered a judicial exception (e.g., that the claim limitation(s) falls within one of the abstract idea groupings). Therefore, there is no requirement for the examiner to rely on evidence, such as publications or an affidavit or declaration under 37 CFR 1.104(d)(2), to find that a claim recites a judicial exception.” See MPEP § 2106.07(a)(III). The recitation that certain additional elements are performed by a computer amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f)(2). 

Argument 3: “Applicant's disclosed/claimed invention itself is addressing the already extremely computer resource intensive process of training large language models. Such process itself is a monumental task when performed by computer systems. The task becomes insurmountable if carried out by "mental processes" (i.e., by human computations).” (Remarks, page 13).

Examiners Response:
Examiner respectfully disagrees, the MPEP states “(2) Whether the claim invokes computers or other machinery merely as a tool to perform an existing process. Use of a computer or other machinery in its ordinary capacity for economic or other tasks (e.g., to receive, store, or transmit data) or simply adding a general purpose computer or computer components after the fact to an abstract idea (e.g., a fundamental economic practice or mathematical equation) does not integrate a judicial exception into a practical application or provide significantly more. See Affinity Labs v. DirecTV, 838 F.3d 1253, 1262, 120 USPQ2d 1201, 1207 (Fed. Cir. 2016) (cellular telephone); TLI Communications LLC v. AV Auto, LLC, 823 F.3d 607, 613, 118 USPQ2d 1744, 1748 (Fed. Cir. 2016) (computer server and telephone unit). Similarly, "claiming the improved speed or efficiency inherent with applying the abstract idea on a computer" does not integrate a judicial exception into a practical application or provide an inventive concept. Intellectual Ventures I LLC v. Capital One Bank (USA), 792 F.3d 1363, 1367, 115 USPQ2d 1636, 1639 (Fed. Cir. 2015). In contrast, a claim that purports to improve computer capabilities or to improve an existing technology may integrate a judicial exception into a practical application or provide significantly more. McRO, Inc. v. Bandai Namco Games Am. Inc., 837 F.3d 1299, 1314-15, 120 USPQ2d 1091, 1101-02 (Fed. Cir. 2016); Enfish, LLC v. Microsoft Corp., 822 F.3d 1327, 1335-36, 118 USPQ2d 1684, 1688-89 (Fed. Cir. 2016). See MPEP §§ 2106.04(d)(1) and 2106.05(a) for a discussion of improvements to the functioning of a computer or to another technology or technical field.” See MPEP § 2106.5(f). While the calculations required might be intensive, as currently claimed, the claimed invention is directed to an abstract idea that can be performed in the human mind, or by a human using a pen and paper.

Argument 4: “Applicant's claimed invention, per Step 2A, prong 2, is indeed integrated into a practical application (i.e., training an expanded/enhanced large language model) that encompasses ALL the recited operations, not merely the last operation of "training" the model generated in accordance with the combination of preceding (computer-implemented) operations.” (Remarks, page 13). 

Examiners Response:
	Examiner respectfully disagrees, the recitations of computer-implemented operations are merely an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f)(2). Further, the use of a model in the mental process amounts to generally linking the use of the judicial exception to a particular technological environment or field of use. See MPEP § 2106.05(h). 

 Rejections under 35 U.S.C. § 102/3:
Argument 5: “Moreover, the Office action does not establish a proper reason for one skilled in the art to modify Chen's teachings, in a way that would result in Applicant's claimed invention. Notably, Chen does not demonstrate any awareness of a particularized problem that would provide a reason to modify Chen in a way that resulted in Applicant's claimed invention.” (Remarks, page 14). 

Examiners Response:
Examiner respectfully disagrees, In response to applicant’s argument that there is no teaching, suggestion, or motivation to combine the references, the examiner recognizes that obviousness may be established by combining or modifying the teachings of the prior art to produce the claimed invention where there is some teaching, suggestion, or motivation to do so found either in the references themselves or in the knowledge generally available to one of ordinary skill in the art.  See In re Fine, 837 F.2d 1071, 5 USPQ2d 1596 (Fed. Cir. 1988), In re Jones, 958 F.2d 347, 21 USPQ2d 1941 (Fed. Cir. 1992), and KSR International Co. v. Teleflex, Inc., 550 U.S. 398, 82 USPQ2d 1385 (2007). 

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JACOB Z SUSSMAN MOSS whose telephone number is (571) 272-1579. The examiner can normally be reached Monday - Friday, 9 a.m. - 5 p.m. ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on (571) 272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/J.S.M./Examiner, Art Unit 2122  
                                                                                                                                                                                                  /KAKALI CHAKI/Supervisory Patent Examiner, Art Unit 2122
Read full office action
METHOD AND SYSTEM FOR TRAINING LARGE-SCALE LANGUAGE MODELS

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

METHOD AND SYSTEM FOR TRAINING LARGE-SCALE LANGUAGE MODELS

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email