Last updated: April 19, 2026
Application No. 18/309,092
SYSTEMS AND METHODS FOR TRAINING AND LEVERAGING A MULTI-HEADED MACHINE LEARNING MODEL FOR PREDICTIVE ACTIONS IN A COMPLEX PREDICTION DOMAIN

Non-Final OA §101§102§103
Filed
Apr 28, 2023
Examiner
BOSTWICK, SIDNEY VINCENT
Art Unit
2124
Tech Center
2100 — Computer Architecture & Software
Assignee
UNITEDHEALTH GROUP, INCORPORATED
OA Round
1 (Non-Final)
Interview Optional

— +38.2% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 136 resolved cases, 2023–2026
Examiner Intelligence

BOSTWICK, SIDNEY VINCENT View full profile →
Grants 52% of resolved cases
Career Allow Rate
71 granted / 136 resolved
-2.8% vs TC avg
Strong +38% interview lift
Without
With
+38.2%
Interview Lift
resolved cases with interview
Typical timeline
4y 7m
Avg Prosecution
68 currently pending
Career history
204
Total Applications
across all art units
Statute-Specific Performance

§101
24.4%
-15.6% vs TC avg
§103
40.9%
+0.9% vs TC avg
§102
12.0%
-28.0% vs TC avg
§112
21.9%
-18.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 136 resolved cases
Office Action

§101 §102 §103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Detailed Action 
This action is in response to the claims filed 04/28/2023:
Claims 1 – 20 are pending.
Claims 1, 13, and 19 are independent.

Drawings
The drawings filed 4/28/2023 and 7/21/2023 are objected to because FIGs. 3-5, 7, and 8 are low quality scans with illegible elements.  Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. The figure or figure number of an amended drawing should not be labeled as “amended.” If a drawing figure is to be canceled, the appropriate figure must be removed from the replacement sheet, and where necessary, the remaining figures must be renumbered and appropriate changes made to the brief description of the several views of the drawings for consistency. Additional replacement sheets may be necessary to show the renumbering of the remaining figures. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.

Claim Rejections - 35 USC § 101
101 Rejection
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-20 are rejected under 35 USC § 101 because the claimed invention is directed to non-statutory subject matter.

Regarding Claim 1:  Claim 1 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 1 is directed to a method, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis:  Claim 1 under its broadest reasonable interpretation is a series of mental processes.  For example, but for the generic computer components language, the above limitations in the context of this claim encompass machine learning processing, including the following: 
generating, […], a plurality of teacher models corresponding to the plurality of predictive categories based on the plurality of training datasets, wherein each teacher model is trained by optimizing a triplet loss for a particular training dataset of the plurality of training datasets (mathematical calculations and relationships.  The claim explicitly recites optimizing a triplet loss which is a mathematical combination.  This claim limitation is highly analogous to Example 47 of the 2024 SME guidance),
 generating, […], a multi-headed composite model based on a respective plurality of trained parameters for each of the plurality of teacher models, wherein the multi-headed composite model comprises a plurality of model heads that correspond to the plurality of teacher models (observation, evaluation and judgement)
Therefore, claim 1 recites an abstract idea which is a judicial exception.
Step 2A Prong Two Analysis:  Claim 1 recites additional elements “by one or more processors”. However, these additional features are computer components recited at a high-level of generality, such that they amount to no more than mere instructions to apply the judicial exception using a generic computer component.  An additional element that merely recites the words “apply it” (or an equivalent) with the judicial exception, or merely includes instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea, does not integrate the judicial exception into a practical application (See MPEP 2106.05(f)).  Claim 1 also recites additional elements “receiving, by one or more processors, a plurality of training datasets corresponding to a plurality of predictive categories” which amounts to gathering data which is insignificant extra-solution activity (See MPEP 2106.05(g)).  Therefore, claim 1 is directed to a judicial exception.
Step 2B Analysis:  Claim 1 does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the lack of integration of the abstract idea into a practical application, the additional elements recited in claim 1 amount to no more than mere instructions to apply the judicial exception using a generic computer component and insignificant extra-solution activity.  The gathering and outputting of data is considered well-understood, routine, and conventional in the art (see MPEP 2106.05(d)(II)(i)).
For the reasons above, claim 1 is rejected as being directed to non-patentable subject matter under §101. This rejection applies equally to independent claims 13 and 19, which recite a system and a computer program product, respectively, as well as to dependent claims 14-18 and 20. 
Independent claim 13 recites additional instructions to apply the judicial exception using generic computer components “A system comprising memory and one or more processors communicatively coupled to the memory, the one or more processors configured to” which does not integrate the judicial exception into a practical application (see MPEP 2106.05(f)).
Independent claim 19 also recites additional instructions to apply the judicial exception using generic computer components “One or more non-transitory computer-readable storage media including instructions that, when executed by one or more processors, cause the one or more processors to” which does not integrate the judicial exception into a practical application (see MPEP 2106.05(f)).
The additional limitations of the dependent claims are addressed briefly below:
Dependent claim 2 recites additional insignificant extra-solution activity of gathering and outputting data “the plurality of training datasets comprise a respective training dataset for each predictive category of the plurality of predictive categories, and the plurality of teacher models comprise a respective teacher model for each predictive category of the plurality of predictive categories” which amounts to selection of a data type (see MPEP 2106.05(g)) which is well-understood, routine, and conventional in the art (See MPEP 2106.05(d)(II)(iii))
Dependent claim 3 recites additional observation, evaluation, and judgement “wherein a predictive category is indicative of an ontology category for a prediction domain, and wherein a training dataset for the predictive category comprises a plurality of mapped text sequences for the ontology category”
Dependent claim 4 recites additional observation, evaluation, and judgement “each of the plurality of mapped text sequences comprises a text sequence and a training label corresponding to the text sequence”
Dependent claim 5 recites additional observation, evaluation, and judgement “the plurality of training datasets are previously generated based on a semantic similarity between a third-party category and a predictive category”
Dependent claim 6 recites additional instructions to apply the judicial exception using generic computer components “each of the plurality of teacher models is a deep neural network comprising a plurality of attention layers” (See MPEP 2106.05(f))
Dependent claim 7 recites additional insignificant extra-solution activity “the particular training dataset for a teacher model of the plurality of teacher models comprises a plurality of text sequences and a plurality of training labels” which amounts to selection of a data type (see MPEP 2106.05(g)) which is well-understood, routine, and conventional in the art (See MPEP 2106.05(d)(II)(iii)).  Claim 7 also recites additional mathematical calculations and relationships “the triplet loss is based on (i) a first distance between an anchor text sequence of the plurality of text sequences and a positive training label and (ii) a second distance between the anchor text sequence and a negative training label”
Dependent claim 8 recites additional mathematical calculations and relationships “optimizing the triplet loss comprises minimizing the first distance and maximizing the second distance”
Dependent claim 9 recites additional observation, evaluation, and judgement “generating, using a machine learning encoder model, a plurality of text embeddings for the plurality of text sequences and the plurality of training labels” as well as additional mathematical calculations and relationships “generating the first distance based on a first cosine similarity distance between an anchor embedding corresponding to the anchor text sequence and a positive embedding corresponding to the positive training label” and “generating the second distance based on a second cosine similarity distance between the anchor embedding and a negative embedding corresponding to the negative training label”
Dependent claim 10 recites additional observation, evaluation, and judgement “identifying a teacher model from the plurality of teacher models based on the plurality of training datasets, wherein the teacher model is identified based on a number of mapped text sequences in a training dataset that corresponds to the teacher model; and generating the model body based on a plurality of trained parameters for the teacher model”
Dependent claim 11 recites additional observation, evaluation, and judgement “generating a model head of the plurality of model heads based on the plurality of trained parameters for the teacher model”
Dependent claim 12 recites additional observation, evaluation, and judgement “generating, using the teacher model, a first output embedding for a mapped text sequence of the training dataset; generating, using the multi-headed composite model, a second output embedding for the mapped text sequence; and updating one or more parameters of the multi-headed composite model based on a comparison between the first output embedding and the second output embedding”
Therefore, when considering the elements separately and in combination, they do not add significantly more to the inventive concept. Accordingly, claims 1-20 are rejected under 35 U.S.C. § 101. 

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

	Claims 1, 6, 13, 18,  and 19 are rejected under U.S.C. §102(a)(1) as being anticipated by Wu (“Person Re-Identification by Context-Aware Part Attention and Multi-Head Collaborative Learning”, 2021).  

	 Regarding claim 1, Wu teaches A computer-implemented method, the computer-implemented method comprising:([p. 9] "Efficiency Analysis. We adopt the floating-point operations (FLOPs) in number of multiply-adds and the required GPU memory during training to measure the computational cost of CNN model")
	receiving, by one or more processors, a plurality of training datasets corresponding to a plurality of predictive categories;([p. 6 §IV] "we thoroughly analyze the effectiveness of our method on four challenging video person re-ID datasets, including PRID2011, iLIDS-VID, MARS and Duke VideoReID, as shown in Fig. 5" [p. 5] "Nid is the total category number of person identities")
	generating, by the one or more processors, a plurality of teacher models corresponding to the plurality of predictive categories based on the plurality of training datasets, ([p. 2] "each head can transfer its knowledge to each other, which collaboratively improves the accuracy but without extra model architecture design" [p. 5] "we propose to use multiple supervision heads rather than single head to guide the feature learning.  In our multi-head collaborative learning framework, each head has the same design and supervision but with different parameters.  Each head consists of two fully connected layers named embedding layer and classification layer respectively.  The multi-head collaborative learning scheme enables diversity predictions" [p. 5] "the embedding feature fh is fed into the classification layer to obtain classification prediction logit zh, which is supervised by an identity classification loss.  The identity loss for head h is denoted by [Eqn. 6]" each head subnetwork (model backbone + respective head) in Wu interpreted as a teacher, each with its own parameters, trained on the datasets to produce identity-category predictions corresponding to the plurality of predictive categories Ni.)
	wherein each teacher model is trained by optimizing a triplet loss for a particular training dataset of the plurality of training datasets; and([p. 5] "the learning objective of multi-head collaborative learning contains two main parts: 1) baseline loss: triplet loss(Ltri) […] the loss function for each head h is formulated as follows [Eqn. 5]" [p. 6] "The total loss L for our multi-head collaborative learning objective framework is defined by L=(∑(h=1,...,H)(Ltrih+...)" See also FIG. 4.  Wu explicitly trains each head with its own triplet loss term and includes it in the summed objective across heads.  Table VI shows the results of training the teachers for a particular dataset.)
	generating, by the one or more processors, a multi-headed composite model based on a respective plurality of trained parameters for each of the plurality of teacher models, ([p. 5] "each head has the same design and supervision but with different parameters" [p. 6] "during inference, we concatenate the fh from all the heads together as video feature descriptors")
	wherein the multi-headed composite model comprises a plurality of model heads that correspond to the plurality of teacher models.([p. 5] "FIG. 4. Illustration of multi-head consistency loss.  The prediction logits z from all other learning heads are averaged to obtained a soft label"  See FIG. 3 and 4.).

    PNG
    media_image1.png
    200
    400
    media_image1.png
    Greyscale

FIG. 3 of Wu

    PNG
    media_image2.png
    200
    400
    media_image2.png
    Greyscale

Markup showing interpreted teacher of Wu

    PNG
    media_image3.png
    200
    400
    media_image3.png
    Greyscale

Markup showing interpreted teacher of Wu


	 Regarding claim 6, Wu teaches The computer-implemented method of claim 1, wherein each of the plurality of teacher models is a deep neural network comprising a plurality of attention layers.(Wu [p. 8] "Multi-scale Spatial-Temporal Attention (MSTA) [74]. Our multi-level CPA model performs significantly better than other attention mechanisms. The experimental results demonstrate our multi-level CPA has stronger capability to model spatial-temporal information and learn discriminative features" [p. 2] "Moreover, the informative low-level features (the output of shallow convolutional layers) [22] are ignored in their attention module, but it has been shown that different layers capture different kinds of discriminative features in Fig. 1. Therefore, it motivates us to investigate a solution to simultaneously capture the robust and discriminative multi-level part attention cues in different layers" Wu shows the model path of each teacher subnetwork being a deep neural network comprising multiple layers in FIG. 2 and 3 especially).
	
	 Regarding claim 13, claim 13 is directed towards a system for performing the method of claim 1.  Therefore the rejection applied to claim 1 also applies to claim 13.  Claim 13 also recites additional elements A system comprising memory and one or more processors communicatively coupled to the memory, the one or more processors configured to (Wu [p. 9] "Efficiency Analysis. We adopt the floating-point operations (FLOPs) in number of multiply-adds and the required GPU memory during training to measure the computational cost of CNN model").
	Similarly, regarding claim 18, claim 18 is directed towards a system for performing the method of claim 6.  Therefore, the rejection applied to claim 6 also applies to claim 18.  	

	 Regarding claim 19, claim 19 is directed towards non-transitory computer-readable storage media for performing the method of claim 1.  Therefore, the rejection applied to claim 1 also applies to claim 19.  Claim 19 also recites additional elements One or more non-transitory computer-readable storage media including instructions that, when executed by one or more processors, cause the one or more processors to (Wu [p. 9] "Efficiency Analysis. We adopt the floating-point operations (FLOPs) in number of multiply-adds and the required GPU memory during training to measure the computational cost of CNN model").	

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.


	Claims 2, 5, 14,  and 17 are rejected under U.S.C. §103 as being unpatentable over the combination of Wu and Lambert (“MSeg: A Composite Dataset for Multi-domain Semantic Segmentation”, 2020). 

	 Regarding claim 2, Wu teaches The computer-implemented method of claim 1.
	However, Wu doesn't explicitly teach, wherein: the plurality of training datasets comprise a respective training dataset for each predictive category of the plurality of predictive categories, and the plurality of teacher models comprise a respective teacher model for each predictive category of the plurality of predictive categories.

	Lambert, in the same field of endeavor, teaches The computer-implemented method of claim 1, wherein: the plurality of training datasets comprise a respective training dataset for each predictive category of the plurality of predictive categories, and the plurality of teacher models comprise a respective teacher model for each predictive category of the plurality of predictive categories.([p. 3] "A computer vision professional will likely resort to multiple models, each trained on a different dataset." Each dataset specific domain interpreted as a predictive category corresponding to a respective dataset, each model explicitly trained for the respective domain).

	Wu as well as Lambert are directed towards machine learning.  Therefore, Wu as well as Lambert are reasonably pertinent analogous art.  It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Wu with the teachings of Lambert by using the model in Wu as the model architecture for the multiple models trained on different domains/datasets in Lambert.  Lambert provides as additional motivation for combination ([p. 3] "A computer vision professional will likely resort to multiple models, each trained on a different dataset.").  This motivation for combination also applies to the remaining claims which depend on this combination.

	 Regarding claim 5, Wu teaches The computer-implemented method of claim 1.
	However, Wu doesn't explicitly teach wherein the plurality of training datasets are previously generated based on a semantic similarity between a third-party category and a predictive category.

	Lambert, in the same field of endeavor, teaches the plurality of training datasets are previously generated based on a semantic similarity between a third-party category and a predictive category. ([p. 8] "The ‘Naive merge’ baseline is a model trained on a composite dataset that uses a naively merged taxonomy in which the classes are a union of all training classes, and each test class is only mapped to an universal class if they share the same name. The ‘MSeg (w/o relabeling)’ base line uses the unified MSeg taxonomy").

	Wu as well as Lambert are directed towards machine learning.  Therefore, Wu as well as Lambert are reasonably pertinent analogous art.  It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Wu with the teachings of Lambert by using the model in Wu as the model architecture for the multiple models trained on different domains/datasets in Lambert.  Lambert provides as additional motivation for combination ([p. 3] "A computer vision professional will likely resort to multiple models, each trained on a different dataset.").  This motivation for combination also applies to the remaining claims which depend on this combination.

	Regarding claims 14 and 17, claims 14 and 17 are directed towards a system for performing the methods of claims 2 and 5, respectively.  Therefore, the rejections applied to claims 2 and 5 also apply to claims 14 and 17.
	
	Claims 3, 4, 15,  and 16 are rejected under U.S.C. §103 as being unpatentable over the combination of Wu, Lambert, and He (“BERTMap: a BERT-based ontology alignment system”, 2022). 

	 Regarding claim 3, the combination of Wu, and Lambert teaches The computer-implemented method of claim 2, wherein a predictive category is indicative of an ontology category for a prediction domain,(Lambert [p. 2] "different datasets have different taxonomies: that is, they have different definitions of what constitutes a ‘cate gory’ or ‘class’ of visual entities" [p. 2] "we reconcile the taxonomies, merging and splitting classes to arrive at a unified taxonomy with 194 categories. To bring the pixel-level annotations in conformance with the unified taxonomy" unified taxonomy interpreted as an ontology of categories for the domain).
	However, the combination of Wu, and Lambert doesn't explicitly teach and wherein a training dataset for the predictive category comprises a plurality of mapped text sequences for the ontology category.

	He, in the same field of endeavor, teaches and wherein a training dataset for the predictive category comprises a plurality of mapped text sequences for the ontology category.([p. 2] "The corpora for BERT fine tuning are composed of pairs of such synonymous labels (i.e., “synonyms”) and pairs of such non-synonymous labels (i.e.,“non-synonyms”) […] For each named class c in an input ontology, we derive all its synonyms which are pairs").

	The combination of Wu, and Lambert as well as He are directed towards machine learning.  Therefore, the combination of Wu, and Lambert as well as He are reasonably pertinent analogous art.  It would have been obvious before the effective filing date of the claimed invention to combine the teachings of the combination of Wu, and Lambert with the teachings of He by using the domain (corpora) specific text sequence corpus and mapping in He as the input data for the architecture in Wu trained as separate models for a plurality of dataset specific domains as suggested by Lambert.  He provides as additional motivation for combination ([p. 7] “The mapping extension and repair modules further improve the recall and precision”).

	 Regarding claim 4, the combination of Wu, Lambert, and He teaches The computer-implemented method of claim 3, wherein each of the plurality of mapped text sequences comprises a text sequence and a training label corresponding to the text sequence.(He [p. 2] "we denote a label after preprocessing3 by ω, and denote the set of all the preprocessed labels of a class c […] For each named class c in an input ontology, we derive all its synonyms which are pairs (ω1,ω2)").
	
Regarding claims 15 and 16, claims 15 and 16 are directed towards a system for performing the methods of claims 3 and 4, respectively.  Therefore, the rejections applied to claims 3 and 4 also apply to claims 15 and 16.	

	Claims 7, 8, 9,  and 20 are rejected under U.S.C. §103 as being unpatentable over the combination of Wu and Fisch (“StarSpace: Embed All The Things!”, 2018). 

	 Regarding claim 7, Wu teaches The computer-implemented method of claim 1, wherein: the particular training dataset for a teacher model of the plurality of teacher models comprises a plurality of [text] sequences and a plurality of training labels, and(Wu [p. 5] "Illustration of multi-head consistency loss.  The prediction logits z from all other learning heads are averaged to obtained a soft label, which presents the prediction consensus of other heads […] the identity loss for head h [...] where yi is the one-hot ground truth label [...] Similar to the identity classification loss, we use the soft label as the supervision by computing the softmax cross entropy between soft label yhi and the identity prediction" [p. 6] "All the identities in dataset are randomly split into 50% for training and 50% for testing" Wu FIG. 3 shows the input is a sequence of frames and explicitly states that the output is soft labels compared against training labels)
	the triplet loss is based on (i) a first distance between an anchor text sequence of the plurality of [text] sequences and a positive training label and (ii) a second distance between the anchor [text] sequence and a negative training label.(Wu [p. 5] "typically, the loss function for each head is formulated as follows: [See Eqn. 5] where […] a is the margin between positive and negative pairs […] the feature embeddings of the anchor, positive and negative samples, respectively" Wu eqn. 5 shows an anchor-positive distance term and an anchor-negative distance term (hardest positive minus hardest negative with margin)).
	However, Wu doesn't explicitly teach The particular training dataset for a teacher model of the plurality of teacher models comprises a plurality of [text] sequences and a plurality of training labels.

	Fisch, in the same field of endeavor, teaches The particular training dataset for a teacher model of the plurality of teacher models comprises a plurality of [text] sequences and a plurality of training labels([p. 2] "Multiclass Classification (e.g. Text Classification) The positive pair generator comes directly from a training set of labeled data specifying (a,b) pairs where a are documents (bags-of-words) and b are labels (singleton features). Negative entities b− are sampled from the set of possible labels").

	Wu as well as Fisch are directed towards machine learning for calculating pairwise similarity.  Therefore, Wu as well as Fisch are reasonably pertinent analogous art.  It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Wu with the teachings of Fisch by using the text similarity data as the input for the model in Wu.  Fisch provides as additional motivation for combination ([p. 3] “The intuition is that semantic similarity between sentences is shared within a document (one can also only select sentences within a certain distance of each other if documents are very long). Further, the embeddings will automatically be optimized for sets of words of sentence length”).

	 Regarding claim 8, the combination of Wu, and Fisch teaches The computer-implemented method of claim 7, wherein optimizing the triplet loss comprises minimizing the first distance and maximizing the second distance.(Wu [p. 5] "typically, the loss function for each head is formulated as follows: [See Eqn. 5] where […] a is the margin between positive and negative pairs […] the feature embeddings of the anchor, positive and negative samples, respectively" Wu explicitly maximizes the hardest positive term (the second distance) and minimizes the hardest negative term (the first distance)).
	
	 Regarding claim 9, the combination of Wu, and Fisch teaches The computer-implemented method of claim 7 further comprising: generating, using a machine learning encoder model, a plurality of text embeddings for the plurality of text sequences and the plurality of training labels;(Fisch [p. 1] "We present StarSpace, a general-purpose neural embedding model that can solve a wide variety of problems: labeling tasks such as text classification, ranking tasks such as in formation retrieval/web search, collaborative filtering-based or content-based recommendation, embedding of multi relational graphs, and learning word, sentence or document level embeddings" [p. 2] "An entity such as a document or a sentence can be described by a bag of words […] a user entity can be compared with an item entity (recommendation), or a document entity with label entities (text classification)")
	generating the first distance based on a first cosine similarity distance between an anchor embedding corresponding to the anchor text sequence and a positive embedding corresponding to the positive training label; and(Fisch [p. 2] "The similarity function sim(·,·). In our system, we have implemented both cosine similarity and inner product […] compares the positive pair (a, b) with the negative pairs (a,b− i)" Starspace explicitly uses the cosine similarity as the similarity metric between embedded entities. The first distance is the cosine similarity for positive embedding similarities.)
	generating the second distance based on a second cosine similarity distance between the anchor embedding and a negative embedding corresponding to the negative training label. (Fisch [p. 2] "The similarity function sim(·,·). In our system, we have implemented both cosine similarity and inner product […] compares the positive pair (a, b) with the negative pairs (a,b− i)" The second distance is the cosine similarity for negative embedding similarities.).
	
	 Regarding claim 20, claim 20 is directed towards a non-transitory computer readable media for performing the method of claim 7.  Therefore, the rejection applied to claim 7 also applies to claim 20.

	Claims 10, 11,  and 12 are rejected under U.S.C. §103 as being unpatentable over the combination of Wu and Wei (“A Flexible Multi-Task Model for BERT Serving”, 2022).  

	 Regarding claim 10, Wu teaches The computer-implemented method of claim 1, wherein the multi-headed composite model comprises a model body and the plurality of model heads, and wherein generating the multi-headed composite model comprises:(Wu [p. 4] "Fig. 3. The overview of our approach. It is mainly comprised of two parts: multi-level context-aware part attention feature network and multi-head collaborative learning scheme. The CPA module is seamlessly plugged into different stages of the backbone network to learn multi-level context-aware part attention. SAP represents the spatial average pooling and TAP represents the temporal average pooling. Several supervision heads are applied to the video-level feature simultaneously to provide more robust supervision" FIG. 3 shows model body and plurality of model heads).
	However, Wu doesn't explicitly teach identifying a teacher model from the plurality of teacher models based on the plurality of training datasets, 
	wherein the teacher model is identified based on a number of mapped text sequences in a training dataset that corresponds to the teacher model; and
	generating the model body based on a plurality of trained parameters for the teacher model..

	Wei, in the same field of endeavor, teaches identifying a teacher model from the plurality of teacher models based on the plurality of training datasets, ([Abstract] "For each task, we train independently a single-task (ST) model using partial fine-tuning")
	wherein the teacher model is identified based on a number of mapped text sequences in a training dataset that corresponds to the teacher model; and([p. 2] "We propose to experiment for each task with different value of L within range Nmin < L < Nmax, and select the one that gives the best validation performance")
	generating the model body based on a plurality of trained parameters for the teacher model.([Abstract] "we compress the task specific layers in each ST model using knowledge distillation. Those compressed ST models are finally merged into one MT model so that the frozen layers of the former are shared across the tasks").

	Wu as well as Wei are directed towards attention based knowledge distillation.  Therefore, Wu as well as Wei are analogous art in the same field of endeavor.  It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Wu with the teachings of Wei by using the model in Wu as the architecture for merging/knowledge distillation in Wei enabling multiple levels of knowledge distillation (where Wu enabled intra-model distillation and Wei enables inter-model distillation).  Wei provides as additional motivation for combination ([p. 8] “In the box plots of Figure 2 above we report the performance of the student models initialized from pre-trained BERT and from the teacher. It can be clearly seen that the latter initialization scheme generally outperforms the former” [p. 4 §4.3] "The results are summarized in Table 2.  From the table it can be seen that the proposed method Ours (mixed) outperforms all KD methods while being more efficient.”).  This motivation for combination also applies to the remaining claims which depend on this combination.

	 Regarding claim 11, the combination of Wu, and Wei teaches The computer-implemented method of claim 10 further comprising: generating a model head of the plurality of model heads based on the plurality of trained parameters for the teacher model.(Wei [p. 1] "Our method is based on the idea of partial fine tuning, i.e. only fine-tuning some topmost layers of BERT depending on the task and keeping the remaining bottom layers frozen" [p. 3] "Figure 1: Pipeline of the proposed method.  (a) For each task we train separately a task-specific model with partial fine-tuning, i.e. only the weights from some topmost layers (blue and red blocks) of the pre-trained model are updated while the rest are kept frozen (gray blocks)" BERT is a multi-head attention model, where each layer has multi-head attention used for particular tasks.).
	
	 Regarding claim 12, the combination of Wu, and Wei teaches The computer-implemented method of claim 11, wherein generating the model head of the multi-headed composite model comprises: generating, using the teacher model, a first output embedding for a mapped text sequence of the training dataset;(Wei [p. 8] "STS-B (The Semantic Textual Similarity Benchmark). A regression task where the goal is to predict whether two sentences are similar in terms of semantic meaning as measured by a score from 1 to 5" See FIG. 1.  Each layer of each model outputs an output embedding for a mapped text sequence (the input).  Wei explicitly states that the model is used for STS-B which is explicitly a text sequence mapping problem.  See also Table 2 and 3 for results on STS-B)
	generating, using the multi-headed composite model, a second output embedding for the mapped text sequence; and(Wei [p. 8] "STS-B (The Semantic Textual Similarity Benchmark). A regression task where the goal is to predict whether two sentences are similar in terms of semantic meaning as measured by a score from 1 to 5" See FIG. 1.  Each layer of each model outputs an output embedding for a mapped text sequence (the input).  Wei explicitly states that the model is used for STS-B which is explicitly a text sequence mapping problem.  See also Table 2 and 3 for results on STS-B)
	updating one or more parameters of the multi-headed composite model based on a comparison between the first output embedding and the second output embedding.(Wei [p. 4 §4.3] "The results are summarized in Table 2.  From the table it can be seen that the proposed method Ours (mixed) outperforms all KD methods while being more efficient.  Compared to the single-task fine-tuning baseline, our method reduces up to around two thirds of the total overhead while achieves 99.6% of its performance" See FIG. 1 which explicitly shows merging two models outputting first and second output embeddings).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SIDNEY VINCENT BOSTWICK whose telephone number is (571)272-4720. The examiner can normally be reached M-F 7:30am-5:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571)270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/SIDNEY VINCENT BOSTWICK/Examiner, Art Unit 2124                                                                                                                                                                                                        

/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124
Read full office action
Prosecution Timeline

Apr 28, 2023
Application Filed
Jan 02, 2026
Non-Final Rejection — §101, §102, §103
Mar 18, 2026
Examiner Interview Summary
Mar 18, 2026
Applicant Interview (Telephonic)
Precedent Cases

Applications granted by this same examiner with similar technology

17/373,021
Patent 12561604
SYSTEM AND METHOD FOR ITERATIVE DATA CLUSTERING USING MACHINE LEARNING
2y 5m to grant Granted Feb 24, 2026
18/486,534
Patent 12547878
Highly Efficient Convolutional Neural Networks
2y 5m to grant Granted Feb 10, 2026
16/902,547
Patent 12536426
Smooth Continuous Piecewise Constructed Activation Functions
2y 5m to grant Granted Jan 27, 2026
18/607,777
Patent 12518143
FEEDFORWARD GENERATIVE NEURAL NETWORKS
2y 5m to grant Granted Jan 06, 2026
16/940,293
Patent 12505340
STASH BALANCING IN MODEL PARALLELISM
2y 5m to grant Granted Dec 23, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
52%
Grant Probability
90%
With Interview (+38.2%)
4y 7m
Median Time to Grant
Low
PTA Risk
Based on 136 resolved cases by this examiner. Grant probability derived from career allow rate.