Last updated: May 29, 2026
Application No. 18/060,341
System and Method for Training Language Models Using Already Trained Language Models

Final Rejection §103§112
Filed
Nov 30, 2022
Priority
Dec 03, 2021 — provisional 63/285,516
Examiner
WITHEY, THEODORE JOHN
Art Unit
2655
Tech Center
2600 — Communications
Assignee
Cohere Inc.
OA Round
4 (Final)
This examiner grants 44% of cases after interview

— +51.3% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 25 resolved cases, 2023–2026
Examiner Intelligence

WITHEY, THEODORE JOHN View full profile →
Grants 44% of resolved cases
Career Allowance Rate
11 granted / 25 resolved
-18.0% vs TC avg
Strong +51% interview lift
Without
With
+51.3%
Interview Lift
resolved cases with interview
Typical timeline
2y 11m
Avg Prosecution
17 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
0.6%
-39.4% vs TC avg
§103
99.4%
+59.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 25 resolved cases
Office Action

§103 §112
DETAILED ACTION
	This office action is in response to Applicant’s Amendment/Request for Reconsideration, received on 01/12/2026. Claims 1, 12, and 20 have been amended. Claims 1, 4, 6-12, 15, 17-26 are pending and have been considered.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant’s arguments, see pgs. 7-9, filed 01/12/2026, with respect to the rejection(s) of claim(s) 1, 4, 6-12, 15, 17-26 under 35 U.S.C. 103 (rejected under the combination of Song in view of Luong) have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of Zhang et al. (US-20230229912-A1), hereinafter Zhang. Zhang discloses updating a neural network model based on a combined output from distinct, separate other neural network models (see Fig. 10) based on loss and self-attention. See updated rejections below.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1, 4, 6-12, 15, 17-26 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claims 1, 12, 20 recites the limitation "the updated attention mechanism and loss mechanism being different from those used for training the first language model" (emphasis added to underlined portion).  There is insufficient antecedent basis for this limitation in the claim. There is no prior definition of the first language model being trained using attention and loss mechanisms; therefore, it is unclear to the examiner what “those” is referring to. Further, there is no previously recited training step for the first language model. It is unclear how obtaining a “trained first language model” results in a training operation from which attention and loss mechanisms are gathered to be updated and applied to the second language model. For further analysis of the claims, updated attention and loss mechanisms will be any attention and/or loss mechanisms present in a second language model, wherein an updated mechanism tracks to any generated value/metric representing loss/attention. Claims 4, 6-11, 15, 17-19, 21-26 are rejected as being dependent upon rejected base claims.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim(s) 1, 4, 6-12, 15 and 17-26 is/are rejected under 35 U.S.C. 103 as being unpatentable over Song et al. (US-20210224660-A1), hereinafter Song, in view of Luong et al. (US-20230015737-A1), hereinafter Luong, further in view of Zhang et al. (US-20230229912-A1), hereinafter Zhang.

	Regarding claim 1, Song discloses a computer-implemented ([Fig. 2B, Computing device 10]) method of training language models (Abstract, Knowledge distillation technique for training a student language model), the method comprising:
	obtaining, using a computing device ([Fig. 2A, “User Computing Device 102”, “Language Model(s) 120”], [Storing language models on a user device to communicate with other computing systems indicates the models are obtained for use in these systems]), a trained first language model ([0040] Although FIG. 1 illustrates simultaneous training of both the teacher language model and the student language model, each model can be individually trained within the illustrated scheme [Individual training indicates a first, i.e. teacher, model is obtained through this training]);
	using, by the computing device, the trained first language model to determine a set of weights of the first language model ([0020] In addition, instead of distilling solely on the teacher language model's final-layer outputs, in some implementations, the proposed techniques leverage layer-wise teacher language model parameters to directly optimize the parameters of the corresponding layers in the student language model [Using teacher model parameters, i.e. weights (see weight decay of [0060]), to adjust a student model indicates a required determination of weights for applying to the student]); and,
	initializing a second language model of the same size using the determined set of weights of the trained first language model ([0024] The student model's architecture can be the same as or different from the teacher model's architecture. A model's architecture can include the type, number, and dimensionality of layers, among other characteristics, [Wherein weights reasonably track to characteristics of layers, keeping the same number and/or dimensionality of layers indicates the same size between the first, i.e. teacher, and second, i.e. student, models]), 	the first and second language model being different model types ([0021] In the dual training technique, a teacher language model and a student language model have different vocabularies and incompatible tokenizations for the same sequence [Incompatibilities between different models for the same input indicates difference between the two models]), 
wherein the first language model is a generation model type and the second language model is a representational model type, or the first language model is a representation model and the second language model is a generation model ([0026] Example language models include ELMo, GPT, BERT, XLNet, and derivatives thereof. A language model can but is not required to include one or more transformer layers. Another example language model is a neural network (e.g., a recurrent neural network such as an LSTM network) [Song provides example models of both representational, i.e. BERT, and generational, i.e. unidirectional LSTM, and the inclusion of these indicates that any language model disclosed could be used for any language model defined in the instant application, i.e. generative or representational, regardless of order]).
Song does not disclose:
initializing the second language model by updating an attention mechanism and a loss mechanism for the second language model to adapt the determined set of weights.
Luong discloses:
initializing the second language model by updating an attention mechanism and a loss mechanism for the second language model to adapt the determined set of weights ([0047] one example approach trains two models (e.g., neural networks), a generator G 14 and a discriminator D 12. Each one can be or include, for example, an encoder… each encoder can be or include a Transformer network or other network that includes self-attention, [0071] a loss function can be backpropagated through the model(s) to update one or more parameters of the model(s) [Defining models, i.e. either of which can be representative of a second model, to be trained both on self-attention and loss indicates a combination of these techniques in not outside the scope of this disclosure. Further, backpropagation indicates an “initialization” through updated, i.e. adapted, mechanisms based on calculated loss and attention to update model parameters for reduced loss (see minimized combined loss learning objective between generator and discriminator of Luong, [0051])]).
Song and Luong are considered analogous art within language task training. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Song to incorporate the teachings of Luong, because of the novel way to maintain a high level of accuracy by keeping the same large number of parameters between two models (Luong, [0004]).
Song in view of Luong does not disclose:
the updated attention mechanism and loss mechanism being different from those used for training the first language model.
Zhang discloses:
the updated attention mechanism and loss mechanism being different from those used for training the first language model ([Fig. 10, Gradient Feedback and Parameter Update], [0104] The self-attention mechanism is an attention mechanism that occurs between elements in a source or between elements in a target, and may also be understood as an attention calculation mechanism in a special case of Target=Source. A specific calculation process of the self-attention mechanism is the same except that a calculation object changes, [0157] Gradient feedbackward and parameter update are performed on the second neural network model based on the target loss, [0160] To improve model processing precision of the third neural network model, the target loss may be constructed based on the outputs of the first neural network model and the third neural network model, and a model parameter of the second neural network model may be updated based on the target loss, [0165] A second target loss may be determined based on the third output and the fourth output, and the updated second neural network model may be updated based on the second target loss, to obtain a fourth neural network model, [Performing model parameter updates, wherein the models are neural networks containing self-attention, indicating applying self-attention and loss to models (an updated model parameter effectively makes that model distinct from its previous iteration of parameters) with differing parameters indicates the outputs from the self-attention and loss, i.e. reasonably understood to be mechanisms, e.g. for determining whether further training is required, are different for different updates of the model(s) in view of the plurality of models of Fig. 10 (the first/third neural networks correspond to a first language model and the second neural network corresponds to a second language model) in view of the model types of Luong both containing loss and attention mechanisms. Further, the loss mechanism for the third neural network is gathered from itself, the loss mechanism for the second neural network is a combination of the first and third losses, indicating two distinct loss mechanisms similar to two distinct self-attention mechanisms for distinct models.]).
 Song, Luong, and Zhang are considered analogous art within language task training. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Song in view of Luong to incorporate the teachings of Zhang, because of the novel way to improve model precision through updating parameters of a target model at a small granularity based upon a target loss gathered from training models, reducing required data processing precision at the target model (Zhang, [0161]).
Song further discloses:
	training the initialized second language model ([0033] train a general-purpose student language model, in some implementations, the teacher model's original training objective can be re-used to optimize the student model, [Re-optimization indicates an original optimization tracks to an initialization in view of the backpropagation for optimization of Luong]); and,
	applying, by the computing device, the second language model to perform an operation ([0090] At 408, the computing system can receive a student output generated by the student language model based on the second sub-word version of the natural language training input [Output generation tracks to an operation being performed]).

Regarding claim 4, Song in view of Luong, further in view of Zhang discloses: the method of claim 1.
Song further discloses:
wherein the second language model is trained further based on training samples relevant to the operation ([0031] The size of the teacher and/or student model can be selected by the user in furtherance of various design goals. Using the same WordPiece algorithm and training corpus as BERT, a student vocabulary of 4,928 WordPieces can be obtained and used for the student model [Selecting a model size based on design goals indicates the second, i.e. student, model is trained to a relevant size, WordPiece vocabulary tracks to relevant samples when the operation is obtaining optimal word embeddings as disclosed in the abstract of Song]).

Regarding claim 6, Song in view of Luong, further in view of Zhang discloses: the method of claim 1.
	Song further discloses:
	wherein the attention mechanism is one of a unidirectional attention mechanism or a bi-directional attention mechanism (As previously discussed, Song discloses both LSTM, i.e. unidirectional, and BERT, i.e. bi-directional, as potential models to be used in their implementation [0026]. Song further discloses an attention mechanism which takes parameters from a first model to update a second model [0020]. With unidirectional and bidirectional models being disclosed in addition to an attention mechanism, Song indicates both unidirectional and bidirectional attention mechanisms when the attention mechanism is applies to the models defined in Song).

Regarding claim 7, Song in view of Luong, further in view of Zhang discloses: the method of claim 1.
	Song further discloses:
	wherein the loss mechanism is one of an auto-regressive loss ([0063] Likelihood loss [Likelihood tracks to probability which is how applicant defines auto-regressive lost]), a masked token loss ([0050] One example final loss function includes, in addition to an optional projection loss, masked language modeling cross-entropy losses for the student as well as the teacher models), and a contrastive loss ([0063] Mean squared error [MSE tracks to distance and/or similarity between embeddings, i.e. teacher and student, which aligns with applicant’s definition], also see two-way sentence pair classification [0081]).

Regarding claim 8, Song in view of Luong, further in view of Zhang discloses: the method of claim 1.
	Song further discloses:
	wherein the operation is one of paragraph completion ([0081] next sentence prediction [Next sentence prediction tracks to being a next sentence to end a paragraph, i.e. completion]), text classification ([0081] two-way sentence pair classification), sentiment analysis ([0081] two-way sentence sentiment classification).
	Luong further discloses:
Wherein the operation is one of semantic textual similarity analysis ([0040] a second loss function 28 that evaluates a difference between the one or more replacement tokens 23a and 23b and the one or more tokens selected to serve as masked tokens [Evaluating a difference between selected and replacement tokens indicates the evaluation takes semantic similarity into account between those two words in the context of the sentence]) and question answering ([0057] In one example, a Q&A model can be trained).

Regarding claim 9, Song in view of Luong, further in view of Zhang discloses: the method of claim 1.
Song further discloses:
training the first language model ([0037] In some implementations of the present disclosure, during distillation and for a given training sequence input to the teacher model, the teacher and student vocabularies can be mixed [A training sequence input to a teacher, i.e. first, language model indicates the teacher model is being trained]), storing the first language model ([0054] the user computing device 102 can store or include one or more machine-learned models), and retrieving the first language model for use in initializing the second model ([0020] In addition, instead of distilling solely on the teacher language model's final-layer outputs, in some implementations, the proposed techniques leverage layer-wise teacher language model parameters to directly optimize the parameters of the corresponding layers in the student language model [Using parameters from one model to be sent to another model indicates retrieval of that model to gather the parameters]).

Regarding claim 10, Song in view of Luong, further in view of Zhang discloses: the method of claim 1.
Song further discloses:
transmitting the second language model to perform the operation ([Fig. 4, 408], [0090] At 408, the computing system can receive a student output generated by the student language model based on the second sub-word version of the natural language training input. For example, the output can be an output for any number of training tasks such as masked language modeling, two-way sentence sentiment classification, two-way sentence pair classification, next sentence prediction, and/or others [Performing an operation with a second, i.e. student, language model indicates the second language model was transmitted from somewhere to be used in this context]).

Regarding claim 11, Song in view of Luong, further in view of Zhang discloses: the method of claim 1.
Song further discloses:
storing the second language model ([0054] the user computing device 102 can store or include one or more machine-learned models, where, [0054] Machine-learned models 120 can include, for example, student models and/or teacher models); and,
retrieving the second language model for use in the operation ([Fig. 4, 404], [Generation of sub-word vectors based on a student vocabulary indicates the student model was retrieved to generate these embeddings]).

Regarding claim 12, Song discloses: a computing system ([Fig. 2B, Computing Device 10]) for training language models (Abstract, Knowledge distillation technique for training a student language model), the system comprising:
A processor ([0053] one or more processors 112);
A memory in communication with the processor ([0053] a memory 114, The memory 114 can store data 116 and instructions 118 which are executed by the processor 112), the memory comprising computer executable instructions ([0053] The memory 114 can include one or more non-transitory computer-readable storage mediums) that when executed by the processor cause the processor to:
obtain, using a computing device ([Fig. 2A, “User Computing Device 102”, “Language Model(s) 120”], [Storing language models on a user device to communicate with other computing systems indicates the models are obtained for use in these systems]), a trained first language model ([0040] Although FIG. 1 illustrates simultaneous training of both the teacher language model and the student language model, each model can be individually trained within the illustrated scheme [Individual training indicates a first, i.e. teacher, model is obtained through this training]);
	use, by the computing device, the trained first language model to determine a set of weights of the first language model ([0020] In addition, instead of distilling solely on the teacher language model's final-layer outputs, in some implementations, the proposed techniques leverage layer-wise teacher language model parameters to directly optimize the parameters of the corresponding layers in the student language model [Using teacher model parameters, i.e. weights (see weight decay of [0060]), to adjust a student model indicates a required determination of weights for applying to the student]); and,
	initialize a second language model of the same size using the determined set of weights of the trained first language model ([0024] The student model's architecture can be the same as or different from the teacher model's architecture. A model's architecture can include the type, number, and dimensionality of layers, among other characteristics, [Wherein weights reasonably track to characteristics of layers, keeping the same number and/or dimensionality of layers indicates the same size between the first, i.e. teacher, and second, i.e. student, models]), 	the first and second language model being different model types ([0021] In the dual training technique, a teacher language model and a student language model have different vocabularies and incompatible tokenizations for the same sequence [Incompatibilities between different models for the same input indicates difference between the two models]), 
wherein the first language model is a generation model type and the second language model is a representational model type, or the first language model is a representation model and the second language model is a generation model ([0026] Example language models include ELMo, GPT, BERT, XLNet, and derivatives thereof. A language model can but is not required to include one or more transformer layers. Another example language model is a neural network (e.g., a recurrent neural network such as an LSTM network) [Song provides example models of both representational, i.e. BERT, and generational, i.e. unidirectional LSTM, and the inclusion of these indicates that any language model disclosed could be used for any language model defined in the instant application, i.e. generative or representational, regardless of order]).
Song does not disclose:
initialize the second language model by updating an attention mechanism and a loss mechanism to adapt the determined set of weights for the second language model.
Luong discloses:
initialize the second language model by updating an attention mechanism and a loss mechanism to adapt the determined set of weights for the second language model ([0047] one example approach trains two models (e.g., neural networks), a generator G 14 and a discriminator D 12. Each one can be or include, for example, an encoder… each encoder can be or include a Transformer network or other network that includes self-attention, [0071] a loss function can be backpropagated through the model(s) to update one or more parameters of the model(s) [Defining models, i.e. either of which can be representative of a second model, to be trained both on self-attention and loss indicates a combination of these techniques in not outside the scope of this disclosure. Further, backpropagation indicates an “initialization” through updated, i.e. adapted, mechanisms based on calculated loss and attention to update model parameters for reduced loss (see minimized combined loss learning objective between generator and discriminator of Luong, [0051])]).
Song and Luong are considered analogous art within language task training. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Song to incorporate the teachings of Luong, because of the novel way to maintain a high level of accuracy by keeping the same large number of parameters between two models (Luong, [0004]).
Song in view of Luong does not disclose:
the updated attention mechanism and loss mechanism being different from those used for training the first language model.
Zhang discloses:
the updated attention mechanism and loss mechanism being different from those used for training the first language model ([Fig. 10, Gradient Feedback and Parameter Update], [0104] The self-attention mechanism is an attention mechanism that occurs between elements in a source or between elements in a target, and may also be understood as an attention calculation mechanism in a special case of Target=Source. A specific calculation process of the self-attention mechanism is the same except that a calculation object changes, [0157] Gradient feedbackward and parameter update are performed on the second neural network model based on the target loss, [0160] To improve model processing precision of the third neural network model, the target loss may be constructed based on the outputs of the first neural network model and the third neural network model, and a model parameter of the second neural network model may be updated based on the target loss, [0165] A second target loss may be determined based on the third output and the fourth output, and the updated second neural network model may be updated based on the second target loss, to obtain a fourth neural network model, [Performing model parameter updates, wherein the models are neural networks containing self-attention, indicating applying self-attention and loss to models (an updated model parameter effectively makes that model distinct from its previous iteration of parameters) with differing parameters indicates the outputs from the self-attention and loss, i.e. reasonably understood to be mechanisms, e.g. for determining whether further training is required, are different for different updates of the model(s) in view of the plurality of models of Fig. 10 (the first/third neural networks correspond to a first language model and the second neural network corresponds to a second language model) in view of the model types of Luong both containing loss and attention mechanisms. Further, the loss mechanism for the third neural network is gathered from itself, the loss mechanism for the second neural network is a combination of the first and third losses, indicating two distinct loss mechanisms similar to two distinct self-attention mechanisms for distinct models.]).
 Song, Luong, and Zhang are considered analogous art within language task training. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Song in view of Luong to incorporate the teachings of Zhang, because of the novel way to improve model precision through updating parameters of a target model at a small granularity based upon a target loss gathered from training models, reducing required data processing precision at the target model (Zhang, [0161]).
Song further discloses:
	train the initialized second language model ([0033] train a general-purpose student language model, in some implementations, the teacher model's original training objective can be re-used to optimize the student model, [Re-optimization indicates an original optimization tracks to an initialization in view of the backpropagation for optimization of Luong]); and,
	apply, by the computing device, the second language model to perform an operation ([0090] At 408, the computing system can receive a student output generated by the student language model based on the second sub-word version of the natural language training input [Output generation tracks to an operation being performed]).

Regarding claim 15, Song in view of Luong, further in view of Zhang discloses: the system of claim 12.
Song further discloses:
wherein the second language model is trained further based on training samples relevant to the operation ([0031] The size of the teacher and/or student model can be selected by the user in furtherance of various design goals. Using the same WordPiece algorithm and training corpus as BERT, a student vocabulary of 4,928 WordPieces can be obtained and used for the student model [Selecting a model size based on design goals indicates the second, i.e. student, model is trained to a relevant size, WordPiece vocabulary tracks to relevant samples when the operation is obtaining optimal word embeddings as disclosed in the abstract of Song]).

Regarding claim 17, Song in view of Luong, further in view of Zhang discloses: the system of claim 12.
	Song further discloses:
	wherein the attention mechanism is one of a unidirectional attention mechanism or a bi-directional attention mechanism (As previously discussed, Song discloses both LSTM, i.e. unidirectional, and BERT, i.e. bi-directional, as potential models to be used in their implementation [0026]. Song further discloses an attention mechanism which takes parameters from a first model to update a second model [0020]. With unidirectional and bidirectional models being disclosed in addition to an attention mechanism, Song indicates both unidirectional and bidirectional attention mechanisms when the attention mechanism is applies to the models defined in Song).

Regarding claim 18, Song in view of Luong, further in view of Zhang discloses: the system of claim 12.
	Song further discloses:
	wherein the loss mechanism is one of an auto-regressive loss ([0063] Likelihood loss [Likelihood tracks to probability which is how applicant defines auto-regressive lost]), a masked token loss ([0050] One example final loss function includes, in addition to an optional projection loss, masked language modeling cross-entropy losses for the student as well as the teacher models), and a contrastive loss ([0063] Mean squared error [MSE tracks to distance and/or similarity between embeddings, i.e. teacher and student, which aligns with applicant’s definition], also see two-way sentence pair classification [0081]).

Regarding claim 19, Song in view of Luong, further in view of Zhang discloses: the system of claim 12.
	Song further discloses:
	wherein the operation is one of paragraph completion ([0081] next sentence prediction [Next sentence prediction tracks to being a next sentence to end a paragraph, i.e. completion]), text classification ([0081] two-way sentence pair classification), sentiment analysis ([0081] two-way sentence sentiment classification).
	Luong further discloses:
Wherein the operation is one of semantic textual similarity analysis ([0040] a second loss function 28 that evaluates a difference between the one or more replacement tokens 23a and 23b and the one or more tokens selected to serve as masked tokens [Evaluating a difference between selected and replacement tokens indicates the evaluation takes semantic similarity into account between those two words in the context of the sentence]) and question answering ([0057] In one example, a Q&A model can be trained).

Regarding claim 20, Song discloses a non-transitory computer readable medium for training language models ([0062] one or more non-transitory computer-readable storage mediums, where, [0054] For example, the machine-learned models 120 can be or can otherwise include various machine-learned models such as neural networks [The neural networks defined in the application (recurrent, LSTM, convolutional, BERT, etc.) are used for language model training tasks]), the computer readable medium comprising computer executable instructions that, when executed by a processor of a computing system ([0053] instructions 118 which are executed by the processor 112), cause the computing system to: 
obtain, using a computing device ([Fig. 2A, “User Computing Device 102”, “Language Model(s) 120”], [Storing language models on a user device to communicate with other computing systems indicates the models are obtained for use in these systems]), a trained first language model ([0040] Although FIG. 1 illustrates simultaneous training of both the teacher language model and the student language model, each model can be individually trained within the illustrated scheme [Individual training indicates a first, i.e. teacher, model is obtained through this training]);
	use, by the computing device, the trained first language model to determine a set of weights of the first language model ([0020] In addition, instead of distilling solely on the teacher language model's final-layer outputs, in some implementations, the proposed techniques leverage layer-wise teacher language model parameters to directly optimize the parameters of the corresponding layers in the student language model [Using teacher model parameters, i.e. weights (see weight decay of [0060]), to adjust a student model indicates a required determination of weights for applying to the student]); and,
	initialize a second language model of the same size using the determined set of weights of the trained first language model ([0024] The student model's architecture can be the same as or different from the teacher model's architecture. A model's architecture can include the type, number, and dimensionality of layers, among other characteristics, [Wherein weights reasonably track to characteristics of layers, keeping the same number and/or dimensionality of layers indicates the same size between the first, i.e. teacher, and second, i.e. student, models]), 	the first and second language model being different model types ([0021] In the dual training technique, a teacher language model and a student language model have different vocabularies and incompatible tokenizations for the same sequence [Incompatibilities between different models for the same input indicates difference between the two models]), 
wherein the first language model is a generation model type and the second language model is a representational model type, or the first language model is a representation model and the second language model is a generation model ([0026] Example language models include ELMo, GPT, BERT, XLNet, and derivatives thereof. A language model can but is not required to include one or more transformer layers. Another example language model is a neural network (e.g., a recurrent neural network such as an LSTM network) [Song provides example models of both representational, i.e. BERT, and generational, i.e. unidirectional LSTM, and the inclusion of these indicates that any language model disclosed could be used for any language model defined in the instant application, i.e. generative or representational, regardless of order]).
Song does not disclose:
initialize the second language model by updating an attention mechanism and a loss mechanism to adapt the determined set of weights for the second language model.
Luong discloses:
initialize the second language model by updating an attention mechanism and a loss mechanism to adapt the determined set of weights for the second language model ([0047] one example approach trains two models (e.g., neural networks), a generator G 14 and a discriminator D 12. Each one can be or include, for example, an encoder… each encoder can be or include a Transformer network or other network that includes self-attention, [0071] a loss function can be backpropagated through the model(s) to update one or more parameters of the model(s) [Defining models, i.e. either of which can be representative of a second model, to be trained both on self-attention and loss indicates a combination of these techniques in not outside the scope of this disclosure. Further, backpropagation indicates an “initialization” through updated, i.e. adapted, mechanisms based on calculated loss and attention to update model parameters for reduced loss (see minimized combined loss learning objective between generator and discriminator of Luong, [0051])]).
Song and Luong are considered analogous art within language task training. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Song to incorporate the teachings of Luong, because of the novel way to maintain a high level of accuracy by keeping the same large number of parameters between two models (Luong, [0004]).
Song in view of Luong does not disclose:
the updated attention mechanism and loss mechanism being different from those used for training the first language model.
Zhang discloses:
the updated attention mechanism and loss mechanism being different from those used for training the first language model ([Fig. 10, Gradient Feedback and Parameter Update], [0104] The self-attention mechanism is an attention mechanism that occurs between elements in a source or between elements in a target, and may also be understood as an attention calculation mechanism in a special case of Target=Source. A specific calculation process of the self-attention mechanism is the same except that a calculation object changes, [0157] Gradient feedbackward and parameter update are performed on the second neural network model based on the target loss, [0160] To improve model processing precision of the third neural network model, the target loss may be constructed based on the outputs of the first neural network model and the third neural network model, and a model parameter of the second neural network model may be updated based on the target loss, [0165] A second target loss may be determined based on the third output and the fourth output, and the updated second neural network model may be updated based on the second target loss, to obtain a fourth neural network model, [Performing model parameter updates, wherein the models are neural networks containing self-attention, indicating applying self-attention and loss to models (an updated model parameter effectively makes that model distinct from its previous iteration of parameters) with differing parameters indicates the outputs from the self-attention and loss, i.e. reasonably understood to be mechanisms, e.g. for determining whether further training is required, are different for different updates of the model(s) in view of the plurality of models of Fig. 10 (the first/third neural networks correspond to a first language model and the second neural network corresponds to a second language model) in view of the model types of Luong both containing loss and attention mechanisms. Further, the loss mechanism for the third neural network is gathered from itself, the loss mechanism for the second neural network is a combination of the first and third losses, indicating two distinct loss mechanisms similar to two distinct self-attention mechanisms for distinct models.]).
 Song, Luong, and Zhang are considered analogous art within language task training. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Song in view of Luong to incorporate the teachings of Zhang, because of the novel way to improve model precision through updating parameters of a target model at a small granularity based upon a target loss gathered from training models, reducing required data processing precision at the target model (Zhang, [0161]).
Song further discloses:
	train the initialized second language model ([0033] train a general-purpose student language model, in some implementations, the teacher model's original training objective can be re-used to optimize the student model, [Re-optimization indicates an original optimization tracks to an initialization in view of the backpropagation for optimization of Luong]); and,
	apply, by the computing device, the second language model to perform an operation ([0090] At 408, the computing system can receive a student output generated by the student language model based on the second sub-word version of the natural language training input [Output generation tracks to an operation being performed]).
	
	Regarding claim 21, Song in view of Luong, further in view of Zhang discloses: the system of claim 12.
	Song further discloses:
	instructions to train the first language model ([0034] The teacher can optionally be pre-trained), store the first language model ([0054] the user computing device 102 can store or include one or more machine-learned models 120 [In view of the teacher model]), and retrieve the first language model for use in initializing the second model ([0055] then used or otherwise implemented by the one or more processors 112 [Models stored to then be later implemented indicates retrieval, in view of the teacher parameter distillation for student optimization [0020] indicating use for initializing the second model]).

	Regarding claim 22, Song in view of Luong, further in view of Zhang discloses: the system of claim 12.
	Song further discloses:
	instructions to transmit the second language model to perform the operation ([0054] the user computing device 102 can store or include one or more machine-learned models 120, where, [0090] At 408, the computing system can receive a student output generated by the student language model [Models in storage, i.e. student language model, to also be used to perform output generating operations 406, see Fig. 4, indicates that the second model is retrieved and transmitted from storage into an active system to perform the operation]).

	Regarding claim 23, Song in view of Luong, further in view of Zhang discloses: the system of claim 12.
	Song further discloses:
	instructions to:
	store the second language model ([0054] the user computing device 102 can store or include one or more machine-learned models 120 [In view of the student language model]); and,
	retrieve the second language model for use in the operation ([0090] At 408, the computing system can receive a student output generated by the student language model [Models in storage, i.e. student language model, to also be used to perform output generating operations 406, see Fig. 4, indicates that the second model is retrieved to perform the operation]).

	Regarding claim 24, Song in view of Luong, further in view of Zhang discloses: the computer readable medium of claim 20.
Song further discloses:
	instructions to train the first language model ([0034] The teacher can optionally be pre-trained), store the first language model ([0054] the user computing device 102 can store or include one or more machine-learned models 120 [In view of the teacher model]), and retrieve the first language model for use in initializing the second model ([0055] then used or otherwise implemented by the one or more processors 112 [Models stored to then be later implemented indicates retrieval, in view of the teacher parameter distillation for student optimization [0020] indicating use for initializing the second model]).

Regarding claim 25, Song in view of Luong, further in view of Zhang discloses: the computer readable medium of claim 20.
	Song further discloses:
	instructions to transmit the second language model to perform the operation ([0054] the user computing device 102 can store or include one or more machine-learned models 120, where, [0090] At 408, the computing system can receive a student output generated by the student language model [Models in storage, i.e. student language model, to also be used to perform output generating operations 406, see Fig. 4, indicates that the second model is retrieved and transmitted from storage into an active system to perform the operation]).

	Regarding claim 26, Song in view of Luong, further in view of Zhang discloses: the computer readable medium of claim 20.
	Song further discloses:
	instructions to:
	store the second language model ([0054] the user computing device 102 can store or include one or more machine-learned models 120 [In view of the student language model]); and,
	retrieve the second language model for use in the operation ([0090] At 408, the computing system can receive a student output generated by the student language model [Models in storage, i.e. student language model, to also be used to perform output generating operations 406, see Fig. 4, indicates that the second model is retrieved to perform the operation]).

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Wang et al. (JP-2019215861-A) discloses “To provide a knowledge transfer method, an information processing apparatus, and a storage medium. SOLUTION: A knowledge transfer method includes: acquiring a first model trained in advance for a predetermined task; and performing training for a second model for the predetermined task by using an overall loss function to cause the second model to have knowledge of the first model. The overall loss function is on the basis of a first loss function and a second loss function that are weighted with accuracy of an output result of the predetermined task for a training sample of the first model; the first loss function represents a difference in result of processing on the training sample between the second model and first model, and the second loss function represents the accuracy of an output result of the predetermined task for a training sample of the second model” (abstract). See entire document.
	Clement et al. (US-20210357762-A1) discloses “A transfer learning system is used for the development of neural transformer models pertaining to software engineering tasks. The transfer learning system trains source code domain neural transformer models with attention in various configurations on a large corpus of unsupervised training dataset of source code programs and/or source code-related natural language text. A web service provides the trained models for use in developing a model that may be fine-tuned on a supervised training dataset associated with a software engineering task thereby generating a tool to perform the software engineering task” (abstract). See entire document.
	Clement et al. (US-20230161567-A1) discloses “Custom source code generation models are generated by tuning a pre-trained deep learning model by freezing the model parameters and optimizing a prefix. The tuning process is distributed across a user space and a model space where the embedding and output layers are performed in the user space and the execution of the model is performed in a model space that is isolated from the user space. The tuning process updates the embeddings of the prefix across the separate execution spaces in a manner that preserves the privacy of the data used in the tuning process” (abstract). See entire document.
	Li et al. (“Dynamic Knowledge Distillation for Pre-trained Language Models”) discloses “we explore whether a dynamic knowledge distillation that empowers the student to adjust the learning procedure according to its competency, regarding the student performance and learning efficiency. We explore the dynamical adjustments on three aspects: teacher model adoption, data selection, and KD objective adaptation” (abstract). See entire document.
	Wu et al. (“Learning to Teach with Dynamic Loss Functions”) discloses “we explore the possibility of imitating human teaching behaviors by dynamically and automatically outputting appropriate loss functions to train machine learning models. Different from typical learning settings in which the loss function of a machine learning model is predefined and fixed, in our framework, the loss function of a machine learning model (we call it student) is defined by another machine learning model (we call it teacher). The ultimate goal of teacher model is cultivating the student to have better performance measured on development dataset. Towards that end, similar to human teaching, the teacher, a parametric model, dynamically outputs different loss functions that will be used and optimized by its student model at different training stages.” (abstract). See entire document.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to THEODORE JOHN WITHEY whose telephone number is (703)756-1754. The examiner can normally be reached Monday - Friday, 8am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on (571) 272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/THEODORE WITHEY/Examiner, Art Unit 2655   

/ANDREW C FLANDERS/Supervisory Patent Examiner, Art Unit 2655
Read full office action
Prosecution Timeline

Show 2 earlier events
Feb 04, 2025
Response Filed
Mar 20, 2025
Final Rejection mailed — §103, §112
May 20, 2025
Response after Non-Final Action
Jun 16, 2025
Request for Continued Examination
Jun 17, 2025
Response after Non-Final Action
Sep 11, 2025
Non-Final Rejection mailed — §103, §112
Jan 12, 2026
Response Filed
Feb 17, 2026
Final Rejection mailed — §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/179,842
Patent 12632670
Natural Language Processing for Identifying Bias in a Span of Text
3y 2m to grant Granted May 19, 2026
17/655,770
Patent 12591744
METHOD FOR TRAINING SEMANTIC REPRESENTATION MODEL, DEVICE AND STORAGE MEDIUM
4y 0m to grant Granted Mar 31, 2026
18/113,192
Patent 12536994
APPARATUS FOR CLASSIFYING SOUNDS BASED ON NEURAL CODE IN SPIKING NEURAL NETWORK AND METHOD THEREOF
2y 9m to grant Granted Jan 27, 2026
17/956,558
Patent 12475330
METHOD FOR IDENTIFYING NOISE SAMPLES, ELECTRONIC DEVICE, AND STORAGE MEDIUM
3y 1m to grant Granted Nov 18, 2025
17/813,944
Patent 12417759
SPEECH RECOGNITION USING CADENCE PATTERNS
3y 1m to grant Granted Sep 16, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

5-6
Expected OA Rounds
44%
Grant Probability
95%
With Interview (+51.3%)
2y 11m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 25 resolved cases by this examiner. Grant probability derived from career allowance rate.