Last updated: April 19, 2026
Application No. 17/313,772
Model Processing Method, Apparatus, Storage Medium, and Processor

Non-Final OA §101§103
Filed
May 06, 2021
Examiner
GERMICK, JOHNATHAN R
Art Unit
2122
Tech Center
2100 — Computer Architecture & Software
Assignee
Alibaba Group Holding Limited
OA Round
3 (Non-Final)
This examiner grants 47% of cases after interview

— +32.1% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 91 resolved cases, 2023–2026
Examiner Intelligence

GERMICK, JOHNATHAN R View full profile →
Grants 47% of resolved cases
Career Allow Rate
43 granted / 91 resolved
-7.7% vs TC avg
Strong +32% interview lift
Without
With
+32.1%
Interview Lift
resolved cases with interview
Typical timeline
4y 2m
Avg Prosecution
28 currently pending
Career history
119
Total Applications
across all art units
Statute-Specific Performance

§101
29.0%
-11.0% vs TC avg
§103
38.5%
-1.5% vs TC avg
§102
17.3%
-22.7% vs TC avg
§112
14.3%
-25.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 91 resolved cases
Office Action

§101 §103
DETAILED ACTION
This action is responsive to the claims filed 10/23/2025, Claims 1, 2, 5-14, 18 and 19 are pending in the case.  Claims 1, 12, and 19 are independent claims. Claims 1, 2, 5, 8, 10, 12, 18 and 19. 
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 10/23/2025 has been entered.

Response to Arguments
Applicant's arguments filed 10/23/2025 have been fully considered but they are not persuasive. 
With respect to 35 U.S.C. 101:
Applicant argues the claim includes features that are significantly more and integrate a practical application and cites the specification to suggest the improvements over large scale models noting that obtaining a target language model that reduces the computational cost improves the speed by decreasing the number of parameters compared to the original model.
Examiner disagrees. The MPEP notes that the improvement should be reflected in the additional elements. A mere description in the specification that the invention solves a problem does not make the requisite claims eligible. Obtaining and converting a model to obtain a model with fewer parameters does not reflect and improvement. Critically, an improvement should be to the functioning or inner workings of the technology. The claim recites a mental step (searching for a model-based loss features) without reciting additional elements which describe any specific functioning which imparts the improvement. The improvement cannot be a result of a recited abstract idea alone. See MPEP 2106.05(a). 
For example, the cited paragraph (0019) of the specification suggests that automatic compression for a target task enables the improvement. However, the claims do not describe the compression as an additional element which particularly enables the improvement, but rather the result of applying judicial exceptions (determining, extraction, searching) to obtained and trained models. The claims recite that the compression is a result directly of the “performing a search…[based on features]”
Therefore, the rejection is maintained and updated in view of the amendments.
With respect to prior art:
Applicant traverses the 102 and 103 for similar reasons. Namely, Applicant notes that Strickland does not teach the claims, and that Hou does not resolve the deficiencies. Applicant’s principal argument appears to be that the art does not teach the claims because “Strickland merely describes obtaining a separate model for each task from a base pre-trained model (BERT model)…Strickland, however, fails to disclose…  obtaining a target language model based on two different language models…”.
Applicant rdcontinues to note the art does not teach the cited amended claims without any further specific argument.
Examiner disagrees. Firstly, none of the independent claims disallow a separate model for each task. In fact, claim 1 highlights the target language model is for a particular determined task, rather than a plurality of parallel tasks. Strickland having task specific models for each task does not indicate that the target language model is not based on two different language models as required by the claims. In the claims, the two different language models are the “original language model” and “first language model” which are used in part to obtain the target language model. As clearly noted in the rejection the original model is the original pre-trained received BERT model, while the ”first language model” is the BERT+PAL layers, and the target language model is the fine-tuned modal which is based on the two different language models.
Examiner noted Applicant provided no further explanation as to why the final obtained model in Strickland cannot correspond to the claimed “target language model”

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1, 2, 5-14, 18 and 19 are rejected under 35 U.S.C. 101 because the claims are directed to an abstract idea without significantly more.

Regarding Claim 1
Under step 1, the claim is directed to A method implemented by a computing device, which is directed to a process, one of the statutory categories. 
Under Step 2A Prong 1, the claim recites the following limitations which are considered mental evaluations “determining a task that needs to be processed by the original language model; … extracting common knowledge in the original language model as a first knowledge loss, and extracting knowledge corresponding to the task in the first language model as a second knowledge loss;… performing a search in a neural architecture search based at least in part on the first knowledge loss and the second knowledge loss to obtain the target language model,”. Determinations and compression are decisions about data that can be performed in the human mind. The specification does not describe how any compression is performed, instead it describes the result of compressing. Further, extracting knowledge and performing a search based on data features is an activity performed in the mind as they describe generalized data analysis. These steps are not described as computer confined analysis such as computer confined memory manipulation.
Step 2A Prong Two Analysis: The judicial exception in not integrated into a practical application. The claims recite the additional element(s) “… the target language model being a compressed version of the original language model with a fewer number of parameters … training the original language model based at least in part on features of the task to obtain a first language model, the first language model being a fine-tuned version of the original language model;” describes an application at a high degree of generality which makes use of the recited exception, see MPEP 2106.05(f). In addition, the claim recites additional element(s) “obtaining a target language model to be deployed in a real-time application that has strict limitations on computing resources and inference times, obtaining the target language model…obtaining an original language model, the original language model being a pre- trained context characterization encoder;” that amounts to adding insignificant extra-solution activity to the judicial exception. See MPEP 2106.05(g). Accordingly, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. Further, the additional elements, “obtaining a target language model to be deployed in a real-time application that has strict limitations on computing resources and inference times, obtaining the target language model…obtaining an original language model” are insignificant extra-solution activities that are considered well-understood, routine, conventional activities, for the following reasons. Examiner notes this amounts to receiving or transmitting data over a network (MPEP 2106.05(d)(II)(i). According to MPEP 2106.05(d)(II)(i), “The courts have recognized the following computer functions as well‐understood, routine, and conventional functions when they are claimed in a merely generic manner”. As such, the insignificant extra-solution activities are considered well-understood, routine, conventional activities. Therefore, the claim is not patent eligible.

Regarding Claim 2
The claim is directed to a process. The claim recites the following limitations “…and determining the target language model based on the search result.” Under Step 2A Prong 1, these limitations describe a step performed in the mind. The claim sets no limits on the neural architecture search and includes an architecture search based entirely on decisions made in the human mind.
Step 2A Prong Two Analysis: The judicial exception in not integrated into a practical application. In particular, the claims recite the additional element(s) the limitations “inputting features of the task into the neural architecture search to obtain a search result” amounts to mere instructions to apply a computer technology to an abstract idea, see MPEP 2106.05(f). In this case, the training is applied in order to determine a search result. 
Accordingly, the recited additional elements, when taken alone or in combination, do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea, nor do they amount to significantly more than the judicial exception because they do not impose any meaningful limits on practicing the abstract idea.

Regarding Claim 5-8
The claim is directed to a process, claim 5 recites the following limitations:
	“determining prompt information based on the first knowledge loss and the second knowledge loss; searching for a model indicated by the prompt information in an architecture search space corresponding to the neural architecture search; and determining the model indicated by the prompt information as the target language model.”
	Similarly, claim 6 recites the following limitations:
	“establishing cross-task relationships based on the first knowledge loss and the second knowledge loss in a knowledge aggregator, wherein the cross-task relationships are used to indicate relationships among multiple tasks; and determining the prompt information based on the cross-task relationships.”
Similarly, claim 7 recites the following limitations:
	“recording a first knowledge loss sequence of the original language model and a second knowledge loss sequence of the first language model in the knowledge aggregator, wherein the first knowledge loss sequence includes a knowledge loss of the original language model at at least one moment of training, the second knowledge loss sequence includes a second knowledge loss of the first language model at the at least one moment of training; clustering multiple tasks to obtain at least one meta-task group based on the first knowledge loss sequence of the original language model and the second knowledge loss sequence of the first language model, wherein the meta-task group includes at least two tasks whose similarity degree is greater than a first threshold; performing normalization based on a target value of the meta-task group to obtain a weight of the meta-task group, wherein the target value is used to indicate an average classification performance of the meta-task group; and establishing the cross-task relationships based on the weight of the meta-task group.”  
Similarly, claim 8 recites the following limitations:
	“extracting the common knowledge in the original language model as the first knowledge loss in a knowledge decomposer; and extracting the knowledge corresponding to the task in the first language model as the second knowledge loss including extracting the knowledge corresponding to the task in the first language model as the second knowledge loss in the knowledge decomposer.”
Under Step 2A Prong 1, these limitations describe a step performed in the mind. The claim sets no limits on the neural architecture search and includes a architecture search based entirely on decisions made in the human mind.
Furthermore, under step 2A Prong 2 and 2B, the claim does not recite additional elements to consider other than those considered in the independent claim. Accordingly, the recited additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea, nor do they amount to significantly more than the judicial exception because they do not impose any meaningful limits on practicing the abstract idea.

Regarding Claim 9
The claim is directed to a process. Each of the limitations described in the claim, under Step 2A Prong 1, do not recite any additional abstract ideas beyond those described in the independent claim
Furthermore, under step 2A Prong 2 and 2B: The judicial exception in not integrated into a practical application or provide significantly more. In particular, the claims recite the additional element(s) “wherein the knowledge decomposer comprises a set of probe classifiers obtained by training the original language model and the first language model.” which is generally linking the use of the judicial exception to a particular technological environment or field of use, see MPEP 2106.05(h). The limitation merely limits the language model to classification tasks. Accordingly, the recited additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea, nor do they amount to significantly more than the judicial exception because they do not impose any meaningful limits on practicing the abstract idea. Accordingly, the recited additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea, nor do they amount to significantly more than the judicial exception because they do not impose any meaningful limits on practicing the abstract idea.

Regarding Claim 10
The claim is directed to a process. Each of the limitations described in the claim, under Step 2A Prong 1, do not recite any additional abstract ideas beyond those described in the independent claim
Furthermore, under step 2A Prong 2 and 2B: The judicial exception in not integrated into a practical application or provide significantly more. In particular, the claims recite the additional element(s) “adding target task parameters of the task to the original language model; 
and training the target task parameters on a newly added corpus of the task to obtain the first language model.” which is generally linking the use of the judicial exception to a particular technological environment or field of use, see MPEP 2106.05(h). The limitation merely limits the language model to classification tasks of a particular corpus. Accordingly, the recited additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea, nor do they amount to significantly more than the judicial exception because they do not impose any meaningful limits on practicing the abstract idea. Accordingly, the recited additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea, nor do they amount to significantly more than the judicial exception because they do not impose any meaningful limits on practicing the abstract idea.

Regarding Claim 11
The claim is directed to a process. Each of the limitations described in the claim, under Step 2A Prong 1, do not recite any additional abstract ideas beyond those described in the independent claim
Furthermore, under step 2A Prong 2 and 2B: The judicial exception in not integrated into a practical application or provide significantly more. In particular, the claims recite the additional element(s) “wherein parameters of the original language model remain unchanged when training the target task parameters on the newly added corpus of the task” which is generally linking the use of the judicial exception to a particular technological environment or field of use, see MPEP 2106.05(h). The limitation merely limits the language model to training particular parameters at a time. Accordingly, the recited additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea, nor do they amount to significantly more than the judicial exception because they do not impose any meaningful limits on practicing the abstract idea. Accordingly, the recited additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea, nor do they amount to significantly more than the judicial exception because they do not impose any meaningful limits on practicing the abstract idea.

Regarding Claim 12
Under step 1, the claim is directed to One or more computer readable media storing executable instructions that, when executed by one or more processors, cause the one or more processors to perform acts, which is directed to an article of manufacture, one of the statutory categories. 
Under Step 2A Prong 1, the claim recites the following limitations which are considered mental evaluations “determining a task corresponding to the textual information… extracting common knowledge in the original language model as a first knowledge loss, and extracting knowledge corresponding to the task in the first language model as a second knowledge loss; and performing a search in a neural architecture search based at least in part on the first knowledge loss and the second knowledge loss to obtain the target language model;” As previously noted, such steps can be performed in the mind for the reasons described in the rejection of claim 1.
Step 2A Prong Two Analysis: The judicial exception in not integrated into a practical application. The claims recite the additional element(s) “wherein the task is processed by an original language model the original language model being a pre-trained context characterization encoder and a target language model is obtained by… processing the textual information based on the target language model to obtain a textual processing result… training the original language model based at least in part on features of the task to obtain a first language model, the first language model being a fine-tuned version of the original language model;” describes an application at a high degree of generality which makes use of the recited exception, see MPEP 2106.05(f). In addition, the claim recites additional element(s) “obtaining textual information uploaded to a target platform… and outputting the textual processing result to the target platform.” that amounts to adding insignificant extra-solution activity to the judicial exception. See MPEP 2106.05(g). Accordingly, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. Further, the additional elements, obtaining and outputting information are insignificant extra-solution activities that are considered well-understood, routine, conventional activities, for the following reasons. Examiner notes this amounts to receiving or transmitting data over a network (MPEP 2106.05(d)(II)(i). According to MPEP 2106.05(d)(II)(i), “The courts have recognized the following computer functions as well‐understood, routine, and conventional functions when they are claimed in a merely generic manner”. As such, the insignificant extra-solution activities are considered well-understood, routine, conventional activities. Therefore, the claim is not patent eligible.

Regarding Claim 13-14
The claim is directed to an article of manufacture. Claim 13 recites the following limitations:
“wherein the textual information comprises textual transaction information that is uploaded to a transaction platform when the target platform is the transaction platform” 
Similarity claim 14 recites:	“wherein the textual transaction information comprises at least one of: textual query information for querying a transaction object; textual information associated with a transaction operation performed by the transaction object; textual evaluation information for evaluating the transaction object; and textual search information for querying an associated object related to the transaction object.”
Under Step 2A Prong 1, these limitations only serve to describe the abstract idea addressed in the independent claim.
Furthermore, under step 2A Prong 2 and 2B, the claim does not recite additional elements to consider other than those considered in the independent claim. Accordingly, the claim does not provide a practical application and is not considered to be significantly more.

Regarding Claim 18
The claim is directed to an article of manufacture. Each of the limitations described in the claim, under Step 2A Prong 1, do not recite any additional abstract ideas beyond those described in the independent claim
The claim is rejected for the reasons set forth in the rejection of claim 10

Regarding Claim 19
Under step 1, the claim is directed to an apparatus for using a target language model deployed in a real-time application that has strict limitations on computing resources and inference times, which is directed to an machine, one of the statutory categories. 
Under Step 2A Prong 1, the claim recites the following limitations which are considered mental evaluations “determining a task corresponding to the textual input information … extracting common knowledge in the original language model as a first knowledge loss, and extracting knowledge corresponding to the task in the first language model as a second knowledge loss; and performing a search in a neural architecture search based at least in part on the first knowledge loss and the second knowledge loss to obtain the target language model;” As previously noted, such steps can be performed in the mind for the reasons described in the rejection of claim 1.
Step 2A Prong Two Analysis: The judicial exception in not integrated into a practical application. the claims recite the additional element(s) the limitations “and memory storing executable instructions that, when executed by the one or more processors, cause the one or more processors to perform acts comprising:” amounts to mere instructions to apply a computer technology to an abstract idea, see MPEP 2106.05(f). In addition, the limitations, “the target language model is obtained by…processing the textual input information based on the target language model that is read to obtain a textual processing result;… wherein the task is processed by an original language model… the original language model being a pre-trained context characterization encoder … training the original language model based at least in part on features of the task to obtain a first language model, the first language model being a fine- tuned version of the original language model;” describes an application at a high degree of generality which makes use of the recited exception, see MPEP 2106.05(f). In addition, the claim recites additional element(s) “receiving textual input information, wherein the textual input information is collected based on at least one text collector associated with a textual processing system;… and reading a target language model… and outputting the textual processing result.” that amounts to adding insignificant extra-solution activity to the judicial exception. See MPEP 2106.05(g). Accordingly, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. Further, the additional elements, obtaining and outputting information are insignificant extra-solution activities that are considered well-understood, routine, conventional activities, for the following reasons. Examiner notes this amounts to receiving or transmitting data over a network (MPEP 2106.05(d)(II)(i). According to MPEP 2106.05(d)(II)(i), “The courts have recognized the following computer functions as well‐understood, routine, and conventional functions when they are claimed in a merely generic manner”. As such, the insignificant extra-solution activities are considered well-understood, routine, conventional activities. Therefore, the claim is not patent eligible.

Claim Rejections - 35 U.S.C. § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. §§ 102 and 103 (or as subject to pre-AIA  35 U.S.C. §§ 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. § 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.


Claim(s) 1, 2, 5-6, 8-14, 18 and 19 are rejected under 35 U.S.C. § 103 as being unpatentable over Stickland further in view of Hou “DynaBERT: Dynamic BERT with Adaptive Width and Depth”

Regarding Claim 1
Stickland teaches, A method implemented by a computing device, the method comprising obtaining an original language model, the original language model being a pre- trained context characterization encoder; (Introduction pg 1 “We use the BERT model (Bidirectional Encoder Representations … as our base pre-trained model. Pre-trained BERT representations can be fine-tuned” Section 3 pg 3 “The BERT model we are adapting is a multi-layer bidirectional Transformer encoder based on the original model” the BERT model is the original, it is pretrained encoder for characterizing context)) determining a task that needs to be processed by the original language model. (Section 3.1 pg 3 “The entire BERT model is simply a stack of 12 BERT layers, followed by (in our case) a transformation to take us to the output space for a NLU task.” The NLU task converts the original BERT model into a target language model for processing a determined NLU task via a transformation.) training the original language model based at least in part on features of the task to obtain a first language model, the first language model being a fine-tuned version of the original language model (Section 3.2 “The simplest way to add parameters to a model is to add them at the ‘top’ of the model…
    PNG
    media_image1.png
    36
    258
    media_image1.png
    Greyscale
…where TS(·) is a task-specific function that can potentially operate on a single vector, but depends on the entire sequence when it contains attention layers…To avoid transformations requiring O(d2m) parameters, we propose using task-specific functions of the form…” pg 3 Section 2.2 “This work also keeps the BERT model fixed while training adapter modules. We concentrated on jointly finetuning the entire BERT model on all tasks” Section 4.1 “A simple way to train a model on several tasks is to select a batch of training examples from each task, cycling through them in a fixed order. We refer to this as ‘round-robin’ sampling”… where E is the total number of epochs. Since we used multiple datasets we chose a somewhat arbitrary ‘epoch’ of 2400 training steps” the original language model is trained based on features of the task resulting in a fine tuned version of the model which is the first language model.) performing a search in a neural architecture search based at least in part on the first knowledge loss and the second knowledge loss to obtain the target language model; ( “We based our experiments on the PyTorch implementation of BERT 3 and open-source our code4 . No matter how we sampled tasks, we (unless stated otherwise) trained for 60,000 steps, with a minibatch size of 32, and a maximum sequence length of 128 tokens, choosing the best model from within that training time based on average development set score…We experimented briefly with freezing the BERT-base parameters and fine-tuning only the PALs and alternatives, but concentrated on training all of the parameters, finding it took less parameters to approach matching fine-tuned BERT” the model architecture or “best model” is searched for via training which is based on the outputs from the model which includes the outputs which are the common knowledge in the original model and knowledge of the first language model. ) obtaining the target language model… the target language model being a compressed version of the original language model with a fewer number of parameters (Section 3.2 “The simplest way to add parameters to a model is to add them at the ‘top’ of the model, i.e. just before the classification layer…. where TS(·) is a task-specific function that can potentially operate on a single vector… adding an extra BERT layer for each task, results in approximately a 1.67× increase in number of parameters, or 73 million new parameters…. A one or two layer feed-forward network followed by a residual connection and layer-norm, such that it has the same number of parameters as the previous form” it is noted that these various layer additions are all exemplary target models derived from compression knowledge or information from the original BERT model. These layers are based on features of the task as they are trained to be task specific functions. It is noted that the total set of parameters is larger necessarily because they include both the additional layers and the BERT model itself. As an example, the additional target layers in one variant account for .67 parameters which is fewer number of parameters than a the 1x included in the BERT itself.)
Stickland does not explicitly teach, extracting common knowledge in the original language model as a first knowledge loss, and extracting knowledge corresponding to the task in the first language model as a second knowledge loss; …obtaining a target language model to be deployed in a real-time application that has strict limitations on computing resources and inference times
Hou however when addressing distillation of BERT models into an adaptive model teaches, extracting common knowledge in the original language model as a first knowledge loss; (pg 6 Figure 3 caption “Using knowledge distillation (dashed lines) to transfer the knowledge from a fixed teacher model to student sub-networks at different widths in DynaBERTW. Distillation loss is computed by matching the logits” first loss is extracted from the teacher model or original language model.) extracting knowledge corresponding to the task in the first language model as a second knowledge loss (pg 7 Section 3.2 “: Using knowledge distillation (dashed lines) to transfer the knowledge from a fixed width adaptive teacher model with the maximum depth to student sub-networks at different depths in the both width- and depth-adaptive model. Distillation loss is computed by matching the logits, embedding, and hidden states between the teacher and student” additional loss terms is extracted from the first language model which is fixed) obtaining a target language model to be deployed in a real-time application that has strict limitations on computing resources and inference times ( pg 21 Section E “To make subnetworks of DynaBERT the same size as those small models, for width, we also adapt the hidden state size H = 128, 256, 512, 768 besides attention heads and intermediate layer neurons… After pre-training DynaBERT, we fine-tune each separate sub-network with the original task-specific data on MNLI-m and report the development set results in Table 15. We compare with separately pre-trained small models in Google BERT repository. As can be seen, sub-networks of the pre-trained DynaBERT outperform separately pre-trained small networks.” Abstract pg 1 “They can not fully satisfy the requirements of different edge devices with various hardware performances. In this paper, we propose a novel dynamic BERT model (abbreviated as DynaBERT), which can flexibly adjust the size and latency by selecting adaptive width and depth” pg 5 Section 3 “We evaluate the efficacy of our proposed DynaBERT and DynaRoBERTa under different efficiency constraints, including #parameters, FLOPs, the latency on Nvidia K40 GPU and Kirin 810 A76 ARM CPU” Small models, or target models, were deployed with the limitation of being the same size as other small models, i.e strict limitation on computing resources. The inference time is a function of latency which is adjusted via the depth to suit edge devices.)
Accordingly, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to modify augmented BERT model in Stickland to comprise the Dynamic width and depth BERT models described by Hou. One would have been motivated to make such a combination because Stickland and Hou describe methods for modifying the original BERT model to better suit specific tasks. Further, Hou notes “we offer flexibility in both width and depth of BERT to enable a significantly larger number of architectural configurations. This also enables better exploration of the balance between model accuracy and model size.” (pg 2 Hou)

Regarding Claim 2
	Stickland/Hou teaches claim 1
	Further Stickland teaches, inputting features of the task into a neural architecture search to obtain a search result; and determining the target language model based on the search result.  (Section 3.2 “The simplest way to add parameters to a model is to add them at the ‘top’ of the model…
    PNG
    media_image1.png
    36
    258
    media_image1.png
    Greyscale
…where TS(·) is a task-specific function that can potentially operate on a single vector, but depends on the entire sequence when it contains attention layers…To avoid transformations requiring O(d2m) parameters, we propose using task-specific functions of the form…” pg 3 Section 2.2 “This work also keeps the BERT model fixed while training adapter modules. We concentrated on jointly finetuning the entire BERT model on all tasks” fine tuning is a neural architecture search for the “best” parameters for the adapter modules which are for the tasks. The Fine tuned model is the target language model. Section 4.1 “A simple way to train a model on several tasks is to select a batch of training examples from each task, cycling through them in a fixed order. We refer to this as ‘round-robin’ sampling”… where E is the total number of epochs. Since we used multiple datasets we chose a somewhat arbitrary ‘epoch’ of 2400 training steps” training examples are features input to the training or “neural architecture search” the results of the training is the search result. In machine learning technology, it is understood that at the end of training a final set of model parameters are determined.)

Regarding Claim 5
	Stickland/Hou teaches claim 1
	Further Hou teaches, determining prompt information based on the first knowledge loss and the second knowledge loss; searching for a model indicated by the prompt information in an architecture search space corresponding to the neural architecture search; and determining the model indicated by the prompt information as the target language model.  (pg 8 “After the unsupervised training, one can further fine-tune the network using the cross-entropy loss between the predicted labels and the ground-truth labels. This step improves the performance on some data sets empirically (details can be found in the Section 4.3). In this paper, we use the model with the higher average validation accuracy over all widths and depths between with and without fine-tuning.” The labels correspond to the prompt information, the predicted lables are based in part on the calculated losses during unsupervised training addressed above. The final model determined target language model with highest accuracy and its corresponding is selected over the space of all widths and depts corresponding to the search space.)
	
Regarding Claim 6
	Stickland/Hou teaches claim 5
Further Hou teaches, establishing cross-task relationships based on the first knowledge loss and the second knowledge loss in a knowledge aggregator, wherein the cross-task relationships are used to indicate relationships among multiple tasks; (pg 2 “In this paper, we propose a novel dynamic BERT, or DynaBERT for short, which can be executed at different widths and depths for specific tasks” pg 10 “We also show the test set results in Table 2. Again, the proposed DynaBERT achieves comparable accuracy than BERTBASE with the same size. Interestingly, the proposed DynaRoBERTa outperforms RoBERTaBASE on seven out of eight tasks” the improvement of the fine tuned models for each task after training (i.e based on the loss) establishes a relationship of the models across multiple tasks.) and determining the prompt information based on the cross-task relationships. ( pg 10 “A possible reason is that allowing adaptive width and depth increases the training difficulty and acts as regularization, and so contributes positively to the performance” the shared increase in training difficulty across tasks, i.e cross task relationship, in part impacts the determined labels which characterize the performance, thus based on the cross task relationship.) 

Regarding Claim 8
	Stickland/Hou teaches claim 1
Further Hou teaches, extracting the common knowledge in the original language model as the first knowledge loss in a knowledge decomposer; (pg 6 Figure 3 caption “Using knowledge distillation (dashed lines) to transfer the knowledge from a fixed teacher model to student sub-networks at different widths in DynaBERTW. Distillation loss is computed by matching the logits” first loss is extracted from the teacher model or original language model, nominally a knowledge decomposer) and extracting the knowledge corresponding to the task in the first language model as the second knowledge loss including extracting the knowledge corresponding to the task in the first language model as the second knowledge loss in the knowledge decomposer.  (pg 7 Section 3.2 “: Using knowledge distillation (dashed lines) to transfer the knowledge from a fixed width adaptive teacher model with the maximum depth to student sub-networks at different depths in the both width- and depth-adaptive model. Distillation loss is computed by matching the logits, embedding, and hidden states between the teacher and student” additional loss terms is extracted from the first language model which is fixed. Nominally a knowledge decomposer.)

Regarding Claim 9
	Stickland/Hou teaches claim 8
Further Hou teaches, wherein the knowledge decomposer comprises a set of probe classifiers obtained by training the original language model and the first language model.  (Section 3.1.3 pg 6 and Figure 4 “Specifically, we use the rewired BERT model as the fixed teacher network, and to initialize DynaBERTW. Then we distill the knowledge from the fixed teacher model to student sub-networks at different widths in DynaBERTW (Figure 3).” Here the probe networks of different widths are the set of probe classifiers. Figure 4 depicts these classifiers.)

Regarding Claim 10
	Stickland/Hou teaches claim 3
Further Stickland teaches, adding target task parameters of the task to the original language model;( pg 4 3.3. Adding Parameters within BERT “Instead of adding parameters to the top of the model, we may want to modify the BERT(·) function…Specifically, we wish to add task-specific parameters to each layer of the BERT model”) and training the target task parameters on a newly added corpus of the task to obtain the first language model.  (pg 6 Section 4.2 “We …concentrated on training all of the parameters, finding it took less parameters to approach matching fine-tuned BERT...” Section 4.1 “A simple way to train a model on several tasks is to select a batch of training examples from each task, cycling through them in a fixed order. We refer to this as ‘round-robin’ sampling.” Section 4.2 pg 6 “No matter how we sampled tasks, we (unless stated otherwise) trained for 60,000 steps” training all parameters on sampled tasks from a corpus amounts to training the target tasks parameters on a newly added corpus. Sampling involves Sampling from a plurality of tasks of the model.))

Regarding Claim 11
	Stickland/Hou teaches claim 10
Further Hou teaches, wherein parameters of the original language model remain unchanged when training the target task parameters on the newly added corpus of the task.  
(pg 2 Section 1 “In this paper, we propose a novel dynamic BERT, or DynaBERT for short, which can be executed at different widths and depths for specific tasks. The training process of DynaBERT includes first training a width-adaptive BERT…Then we distill knowledge from a fixed teacher network to student sub-networks at equal or smaller widths in DynaBERTw”  pg 8 “After the unsupervised training, one can further fine-tune the network using the cross-entropy loss between the predicted labels and the ground-truth labels” the teacher or original network is fixed or unchanged while training or distilling the task parameters. Ground truth labels are the newly added corpus.)

Regarding Claim 12
Stickland teaches, One or more computer readable media storing executable instructions that, when executed by one or more processors, cause the one or more processors to perform acts comprising: obtaining textual information uploaded to a target platform; (pg 1 Introduction “We experiment on a set of eight NLU tasks from the GLUE benchmark … which include question answering, sentiment analysis, and textual entailment. The number of training examples varies widely across the tasks,” pg 7 Table 2 caption “Table 2. GLUE Test results, scored by the GLUE evaluation server.” the 8 tasks are descriptions of textual information which necessarily had been uploaded in order to use in an experiment.)
training the original language model based at least in part on features of the task to obtain a first language model, the first language model being a fine-tuned version of the original language model (Section 3.2 “The simplest way to add parameters to a model is to add them at the ‘top’ of the model…
    PNG
    media_image1.png
    36
    258
    media_image1.png
    Greyscale
…where TS(·) is a task-specific function that can potentially operate on a single vector, but depends on the entire sequence when it contains attention layers…To avoid transformations requiring O(d2m) parameters, we propose using task-specific functions of the form…” pg 3 Section 2.2 “This work also keeps the BERT model fixed while training adapter modules. We concentrated on jointly finetuning the entire BERT model on all tasks” Section 4.1 “A simple way to train a model on several tasks is to select a batch of training examples from each task, cycling through them in a fixed order. We refer to this as ‘round-robin’ sampling”… where E is the total number of epochs. Since we used multiple datasets we chose a somewhat arbitrary ‘epoch’ of 2400 training steps” the original language model is trained based on features of the task resulting in a fine tuned version of the model which is the first language model.)
performing a search in a neural architecture search based at least in part on the first knowledge loss and the second knowledge loss to obtain the target language model; ( “We based our experiments on the PyTorch implementation of BERT 3 and open-source our code4 . No matter how we sampled tasks, we (unless stated otherwise) trained for 60,000 steps, with a minibatch size of 32, and a maximum sequence length of 128 tokens, choosing the best model from within that training time based on average development set score…We experimented briefly with freezing the BERT-base parameters and fine-tuning only the PALs and alternatives, but concentrated on training all of the parameters, finding it took less parameters to approach matching fine-tuned BERT” the model architecture or “best model” is searched for via training which is based on the outputs from the model which includes the outputs which are the common knowledge in the original model and knowledge of the first language model. )
determining a task corresponding to the textual information wherein the task is processed by an original language model, the original language model being a pre-trained context characterization encoder (Section 4.1 pg 5 “A simple way to train a model on several tasks is to select a batch of training examples from each task, cycling through them in a fixed order… Concretely, we select a batch of examples from task i with probability pi at each training step” task i is determined according to the current batch of examples selected. As noted previously training involves processing the training examples with the original BERT model, which is a pre-trained context characterization encoder.)
and a target language model is obtained by; processing the textual information based on the target language model to obtain a textual processing result; and outputting the textual processing result to the target platform.   ( Section 3.3 “We can add a task-specific function ‘in parallel’ with each BERT layer as follows: 
    PNG
    media_image2.png
    27
    275
    media_image2.png
    Greyscale
… where l indexes the layer. This means we recover the original BERT model if TS(·) outputs a zero vector. Alternatively, we can add a ‘serial’ connection where we transform the output of a BERT layer:” the modified task specific function is part of the target model which processes the example or textual information. The outputs art the processing result. Section 5 “Table 2 lists our results on GLUE for our best-performing PAL mode” the Table depicts the results acquired from any such target platform which produces the result.)
Stickland does not explicitly teach, extracting common knowledge in the original language model as a first knowledge loss, and extracting knowledge corresponding to the task in the first language model as a second knowledge loss; 
Hou however when addressing distillation of BERT models into an adaptive model teaches, extracting common knowledge in the original language model as a first knowledge loss; (pg 6 Figure 3 caption “Using knowledge distillation (dashed lines) to transfer the knowledge from a fixed teacher model to student sub-networks at different widths in DynaBERTW. Distillation loss is computed by matching the logits” first loss is extracted from the teacher model or original language model.) extracting knowledge corresponding to the task in the first language model as a second knowledge loss (pg 7 Section 3.2 “: Using knowledge distillation (dashed lines) to transfer the knowledge from a fixed width adaptive teacher model with the maximum depth to student sub-networks at different depths in the both width- and depth-adaptive model. Distillation loss is computed by matching the logits, embedding, and hidden states between the teacher and student” additional loss terms is extracted from the first language model which is fixed)
Accordingly, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to modify augmented BERT model in Stickland to comprise the Dynamic width and depth BERT models described by Hou. One would have been motivated to make such a combination because Stickland and Hou describe methods for modifying the original BERT model to better suit specific tasks. Further, Hou notes “we offer flexibility in both width and depth of BERT to enable a significantly larger number of architectural configurations. This also enables better exploration of the balance between model accuracy and model size.” (pg 2 Hou)

Regarding Claim 13
Stickland/Hou teaches claim 12
Stickland teaches, wherein the textual information comprises textual transaction information that is uploaded to a transaction platform when the target platform is the transaction platform.  (Introduction pg 1 “We experiment on a set of eight NLU tasks from the GLUE benchmark…which include question answering, sentiment analysis, and textual entailment.” A question and answering data set includes question answering textual information of a NLU or “natural language understanding task”. Questions and answers are transaction information because they describe an exchange of information. For the experiment described to be performed such information is received via upload. Nominally a system which processes such information is a transaction platform.)

Regarding Claim 14
Stickland/Hou teaches claim 13
Stickland teaches, wherein the textual transaction information comprises at least one of: textual query information for querying a transaction object; textual information associated with a transaction operation performed by the transaction object; textual evaluation information for evaluating the transaction object; and textual search information for querying an associated object related to the transaction object.  (Introduction pg 1 “We experiment on a set of eight NLU tasks from the GLUE benchmark…which include question answering, sentiment analysis, and textual entailment.” A question and answering data set includes question answering textual information of a NLU or “natural language understanding task”. A question is a textual query of a answer which is nominally the transaction object.)

Regarding Claim 18
	Stickland/Hou teaches claim 12
Further Stickland teaches, adding target task parameters of the task to the original language model;( pg 4 3.3. Adding Parameters within BERT “Instead of adding parameters to the top of the model, we may want to modify the BERT(·) function…Specifically, we wish to add task-specific parameters to each layer of the BERT model”) and training the target task parameters on a newly added corpus of the task to obtain the first language model.  (pg 6 Section 4.2 “We …concentrated on training all of the parameters, finding it took less parameters to approach matching fine-tuned BERT...” Section 4.1 “A simple way to train a model on several tasks is to select a batch of training examples from each task, cycling through them in a fixed order. We refer to this as ‘round-robin’ sampling.” Section 4.2 pg 6 “No matter how we sampled tasks, we (unless stated otherwise) trained for 60,000 steps” training all parameters on sampled tasks from a corpus amounts to training the target tasks parameters on a newly added corpus. Sampling involves Sampling from a plurality of tasks of the model.))


Regarding Claim 19
Stickland teaches, An apparatus comprising: one or more processors; and memory storing executable instructions that, when executed by the one or more processors, cause the one or more processors to perform acts comprising … receiving textual input information, wherein the textual input information is collected based on at least one text collector associated with a textual processing system; (pg 1 Introduction “We experiment on a set of eight NLU tasks from the GLUE benchmark … which include question answering, sentiment analysis, and textual entailment. The number of training examples varies widely across the tasks,” pg 7 Table 2 caption “Table 2. GLUE Test results, scored by the GLUE evaluation server.” the 8 tasks are descriptions of textual information which necessarily had been uploaded in order to use in an experiment. The GLUE benchmark and test data is the textual input collected based on a text collector)
determining a task corresponding to the textual input information, and reading the target language model, wherein the task is processed by an original language model, the original language model being a pre-trained context characterization encoder, (Section 4.1 pg 5 “A simple way to train a model on several tasks is to select a batch of training examples from each task, cycling through them in a fixed order… Concretely, we select a batch of examples from task i with probability pi at each training step” task i is determined according to the current batch of examples selected. As noted previously training involves processing the training examples with the original BERT model, which as previously noted is a pre-trained context characterization encoder.)
training the original language model based at least in part on features of the task to obtain a first language model, the first language model being a fine-tuned version of the original language model (Section 3.2 “The simplest way to add parameters to a model is to add them at the ‘top’ of the model…
    PNG
    media_image1.png
    36
    258
    media_image1.png
    Greyscale
…where TS(·) is a task-specific function that can potentially operate on a single vector, but depends on the entire sequence when it contains attention layers…To avoid transformations requiring O(d2m) parameters, we propose using task-specific functions of the form…” pg 3 Section 2.2 “This work also keeps the BERT model fixed while training adapter modules. We concentrated on jointly finetuning the entire BERT model on all tasks” Section 4.1 “A simple way to train a model on several tasks is to select a batch of training examples from each task, cycling through them in a fixed order. We refer to this as ‘round-robin’ sampling”… where E is the total number of epochs. Since we used multiple datasets we chose a somewhat arbitrary ‘epoch’ of 2400 training steps” the original language model is trained based on features of the task resulting in a fine tuned version of the model which is the first language model.)
performing a search in a neural architecture search based at least in part on the first knowledge loss and the second knowledge loss to obtain the target language model; ( “We based our experiments on the PyTorch implementation of BERT 3 and open-source our code4 . No matter how we sampled tasks, we (unless stated otherwise) trained for 60,000 steps, with a minibatch size of 32, and a maximum sequence length of 128 tokens, choosing the best model from within that training time based on average development set score…We experimented briefly with freezing the BERT-base parameters and fine-tuning only the PALs and alternatives, but concentrated on training all of the parameters, finding it took less parameters to approach matching fine-tuned BERT” the model architecture or “best model” is searched for via training which is based on the outputs from the model which includes the outputs which are the common knowledge in the original model and knowledge of the first language model. )
and the target language model is obtained by processing the textual input information based on the target language model that is read to obtain a textual processing result; (Section 3.3 pg 4 “Instead of adding parameters to the top of the model, we may want to modify the BERT(·) function itself… We can add a task-specific function ‘in parallel’ with each BERT layer as follows” Section 3.2 “The simplest way to add parameters to a model is to add them at the ‘top’ of the model, i.e. just before the classification layer…. where TS(·) is a task-specific function that can potentially operate on a single vector… adding an extra BERT layer for each task, results in approximately a 1.67× increase in number of parameters, or 73 million new parameters…. A one or two layer feed-forward network followed by a residual connection and layer-norm, such that it has the same number of parameters as the previous form” it is noted that these various layer additions are all exemplary target models derived from compression knowledge or information from the original BERT model. These layers are based on features of the task as they are trained to be task specific functions. It is noted that the total set of parameters is larger necessarily because they include both the additional layers and the BERT model itself. As an example, the additional target layers in one variant account for .67 parameters which is fewer number of parameters than a the 1x included in the BERT itself. Section 3.3 “We can add a task-specific function ‘in parallel’ with each BERT layer as follows: 
    PNG
    media_image2.png
    27
    275
    media_image2.png
    Greyscale
… where l indexes the layer. This means we recover the original BERT model if TS(·) outputs a zero vector. Alternatively, we can add a ‘serial’ connection where we transform the output of a BERT layer:” the modified task specific function is part of the target model which processes the example or textual information. The outputs art the processing result.)
and outputting the textual processing result.   (Section 5 “Table 2 lists our results on GLUE for our best-performing PAL mode” the Table depicts the results acquired from any such target platform which produces the result.)
Stickland does not explicitly teach, extracting common knowledge in the original language model as a first knowledge loss, and extracting knowledge corresponding to the task in the first language model as a second knowledge loss …for using a target language model to be deployed in a real-time application that has strict limitations on computing resources and inference times
Hou however when addressing distillation of BERT models into an adaptive model teaches, extracting common knowledge in the original language model as a first knowledge loss; (pg 6 Figure 3 caption “Using knowledge distillation (dashed lines) to transfer the knowledge from a fixed teacher model to student sub-networks at different widths in DynaBERTW. Distillation loss is computed by matching the logits” first loss is extracted from the teacher model or original language model.) extracting knowledge corresponding to the task in the first language model as a second knowledge loss (pg 7 Section 3.2 “: Using knowledge distillation (dashed lines) to transfer the knowledge from a fixed width adaptive teacher model with the maximum depth to student sub-networks at different depths in the both width- and depth-adaptive model. Distillation loss is computed by matching the logits, embedding, and hidden states between the teacher and student” additional loss terms is extracted from the first language model which is fixed)
for using a target language model to be deployed in a real-time application that has strict limitations on computing resources and inference times ( pg 21 Section E “To make subnetworks of DynaBERT the same size as those small models, for width, we also adapt the hidden state size H = 128, 256, 512, 768 besides attention heads and intermediate layer neurons… After pre-training DynaBERT, we fine-tune each separate sub-network with the original task-specific data on MNLI-m and report the development set results in Table 15. We compare with separately pre-trained small models in Google BERT repository. As can be seen, sub-networks of the pre-trained DynaBERT outperform separately pre-trained small networks.” Abstract pg 1 “They can not fully satisfy the requirements of different edge devices with various hardware performances. In this paper, we propose a novel dynamic BERT model (abbreviated as DynaBERT), which can flexibly adjust the size and latency by selecting adaptive width and depth” pg 5 Section 3 “We evaluate the efficacy of our proposed DynaBERT and DynaRoBERTa under different efficiency constraints, including #parameters, FLOPs, the latency on Nvidia K40 GPU and Kirin 810 A76 ARM CPU” Small models, or target models, were deployed with the limitation of being the same size as other small models, i.e strict limitation on computing resources. The inference time is a function of latency which is adjusted via the depth to suit edge devices. Examiner notes that this limitation is understood to describe intended use.)
Accordingly, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to modify augmented BERT model in Stickland to comprise the Dynamic width and depth BERT models described by Hou. One would have been motivated to make such a combination because Stickland and Hou describe methods for modifying the original BERT model to better suit specific tasks. Further, Hou notes “we offer flexibility in both width and depth of BERT to enable a significantly larger number of architectural configurations. This also enables better exploration of the balance between model accuracy and model size.” (pg 2 Hou)


Claim(s) 7 are rejected under 35 U.S.C. § 103 as being unpatentable over Stickland/Hou further in view of Misra “Cross-stitch Networks for Multi-task Learning”.
Regarding Claim 7
	Stickland/Hou teaches claim 6
Hou further teaches, recording a first knowledge loss sequence of the original language model and a second knowledge loss sequence of the first language model in the knowledge aggregator, wherein the first knowledge loss sequence includes a knowledge loss of the original language model at least one moment of training, the second knowledge loss sequence includes a second knowledge loss of the first language model at the at least one moment of training; (pg 5-6 Section 3.1.3 “After the connections of the BERT model are rewired according to Algorithm 1, we use knowledge distillation to train DynaBERTW…we transfer the knowledge…of all L Transformer layers from the teacher model to student sub-networks… “Combining (4) - (6), the objective is…
    PNG
    media_image3.png
    37
    525
    media_image3.png
    Greyscale
” here the first 3 loss terms are the loss sequence of the original model in the first stage of training, which transforms layers from teacher to student. Pg 7-8 Section 3.2 “After the training of DynaBERTW using Algorithm 2, we further use knowledge distillation to train DynaBERT with both adaptive width and depth….Thus the loss can be written as…
    PNG
    media_image4.png
    25
    686
    media_image4.png
    Greyscale
” after the first training phase the second knowledge loss sequence is recorded/extracted the three terms of the second sequence is shown in the equation.)
Stickland and Hou are combined for the reasons set forth in the rejection of claim 4
Stickland/Hou does not explicitly teach, clustering multiple tasks to obtain at least one meta-task group based on the first knowledge loss sequence of the original language model and the second knowledge loss sequence of the first language model…wherein the meta-task group includes at least two tasks whose similarity degree is greater than a first threshold…performing normalization based on a target value of the meta-task group to obtain a weight of the meta-task group…wherein the target value is used to indicate an average classification performance of the meta-task group…and establishing the cross-task relationships based on the weight of the meta-task group.
Misra however when addressing neural networks for multi task learning teaches, clustering multiple tasks to obtain at least one meta-task group based on the first knowledge loss sequence of the original language model and the second knowledge loss sequence of the first language model, (pg 3 Figure 3 “We model shared representations by learning a linear combination of input activation maps. At each layer of the network, we learn such a linear combination of the activation maps from both the tasks.” The learned linear combination describes a process of clustering by learning the relationship between abstractions of the tasks in a meta-task group, including at least two tasks in the art. Training as in X/Y is based on the two loss sequences claimed.)
wherein the meta-task group includes at least two tasks whose similarity degree is greater than a first threshold; ( pg 2 “We first pair semantic segmentation (SemSeg) and surface normal prediction (SN). We believe the two tasks are closely related to each other since segmentation boundaries also correspond to surface normal boundaries. For this pair of tasks” the pair of tasks in the meta-task group a closely related, therefore the similarity is greater than some threshold as claimed.)
performing normalization based on a target value of the meta-task group to obtain a weight of the meta-task group, ( pg 4 “The α values of a cross-stitch unit model linear combinations of feature maps. Their initialization in the range [0, 1] is important for stable learning, as it ensures that values in the output activation map (after cross-stitch unit) are of the same order of magnitude as the input values before linear combination” the weights which describe the relationship between tasks are normalized to the range [0,1], normalize is interpreted as meaning to put into a normal or standardized condition. This normalization is based on the stabilizing the output activation map which includes the target value.)
wherein the target value is used to indicate an average classification performance of the meta-task group; (pg 4 figure 4 caption “Cross-stitch units can model shared representations as a linear combination of input activation maps. This network tries to learn representations that can help with both tasks A and B.” the learned output activations are indicative of the average classification performance of the two tasks in the group. Figure 5 pg 7 caption “Change in performance for attribute categories over the baseline is indicated by blue bars” the collection of activation maps of the neural network indicates the average activation for classification which is a measure of performance of the tasks.)
and establishing the cross-task relationships based on the weight of the meta-task group. (Section 3.3 pg 3 “We refer to this the cross-stitch operation, and the unit that models it for each layer l as the cross-stitch unit. The network can decide to make certain layers task specific by setting αAB or αBA to zero, or choose a more shared representation by assigning a higher value to them” the α terms are weights which establish the cross task relationships.) 
Accordingly, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to modify augmented BERT model in Stickland/Hou to comprise the cross-stich architecture as described by Misra. One would have been motivated to make such a combination because Stickland/Hou and Misra describe methods for modifying the original BERT model to better suit specific tasks. Further, Misra notes “our performance is better than the best Split architecture network found using brute force search. This shows that the cross-stitch units can effectively search for optimal amount of sharing in multi-task networks.” (Section 6.2 pg 7 Misra)

Conclusion
Prior art:
	Tang et al. “Distilling Task-Specific Knowledge from BERT into Simple Neural Networks” addresses task specific learning using a BERT representation model. Which has been compressed from the larger original model.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JOHNATHAN R GERMICK whose telephone number is (571)272-8363. The examiner can normally be reached M-F 7:30-4:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on 571-272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/J.R.G./
Examiner, Art Unit 2122            
                                                                                                                                                                                            /KAKALI CHAKI/Supervisory Patent Examiner, Art Unit 2122
Read full office action
Prosecution Timeline

May 06, 2021
Application Filed
May 07, 2025
Non-Final Rejection — §101, §103
Aug 05, 2025
Response Filed
Aug 15, 2025
Final Rejection — §101, §103
Oct 23, 2025
Request for Continued Examination
Oct 25, 2025
Response after Non-Final Action
Jan 13, 2026
Non-Final Rejection — §101, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

16/240,514
Patent 12566962
DITHERED QUANTIZATION OF PARAMETERS DURING TRAINING WITH A MACHINE LEARNING TOOL
2y 5m to grant Granted Mar 03, 2026
17/121,871
Patent 12566983
MACHINE LEARNING CLASSIFIERS PREDICTION CONFIDENCE AND EXPLANATION
2y 5m to grant Granted Mar 03, 2026
17/025,845
Patent 12554977
DEEP NEURAL NETWORK FOR MATCHING ENTITIES IN SEMI-STRUCTURED DATA
2y 5m to grant Granted Feb 17, 2026
16/537,752
Patent 12443829
NEURAL NETWORK PROCESSING METHOD AND APPARATUS BASED ON NESTED BIT REPRESENTATION
2y 5m to grant Granted Oct 14, 2025
17/029,290
Patent 12443868
QUANTUM ERROR MITIGATION USING HARDWARE-FRIENDLY PROBABILISTIC ERROR CORRECTION
2y 5m to grant Granted Oct 14, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
47%
Grant Probability
79%
With Interview (+32.1%)
4y 2m
Median Time to Grant
High
PTA Risk
Based on 91 resolved cases by this examiner. Grant probability derived from career allow rate.