Last updated: April 19, 2026
Application No. 18/305,034
MACHINE LEARNING MODEL GRAFTING AND INTEGRATION

Non-Final OA §101§103§112
Filed
Apr 21, 2023
Examiner
MAIDO, MAGGIE T
Art Unit
2129
Tech Center
2100 — Computer Architecture & Software
Assignee
Raytheon Company
OA Round
1 (Non-Final)
This examiner grants 64% of cases after interview

— +20.7% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 36 resolved cases, 2023–2026
Examiner Intelligence

MAIDO, MAGGIE T View full profile →
Grants 64% of resolved cases
Career Allow Rate
23 granted / 36 resolved
+8.9% vs TC avg
Strong +21% interview lift
Without
With
+20.7%
Interview Lift
resolved cases with interview
Typical timeline
4y 3m
Avg Prosecution
51 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
25.6%
-14.4% vs TC avg
§103
56.1%
+16.1% vs TC avg
§102
2.6%
-37.4% vs TC avg
§112
15.3%
-24.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 36 resolved cases
Office Action

§101 §103 §112
DETAILED ACTION

This action is responsive to claims filed on 21 April 2023.
Claims 1-20 are pending for examination.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Objections
Claim 1 and analogous claim 6 is objected to because of the following informalities: “its backbone” in line 4 should be “a backbone of the first machine learning model”. Appropriate correction is required. 
Claim 14 and analogous claim 19 is objected to because of the following informalities: “its backbone” in line 2 should be “the backbone of the first machine learning model”. Appropriate correction is required. 

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.



Claims 5, 10, 14-15, 19-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 14 and analogous claim 19 recites the limitation "the backbone of the second machine learning model" in line 4. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, the term "the backbone of the second machine learning model" has been construed to be “a backbone of the second machine learning model”. Claim 15 and analogous claim 20, which are dependent on claims 14, 19, are similarly rejected.
The term “similar” in line 3 of claim 5 and analogous claim 10 is a relative term which renders the claim indefinite. The term “similar” is not defined by the claim, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention. In such a product, any degree of likeness can be considered "similar”. For examination purposes, the term "similar" has been construed to be any degree of likeness between a plurality of environments.
The term “dissimilar” in line 7 of claim 5 and analogous claim 10 is a relative term which renders the claim indefinite. The term “dissimilar” is not defined by the claim, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention. In such a product, any degree of difference can be considered "dissimilar”. For examination purposes, the term "dissimilar" has been construed to be any degree of difference between a plurality of environments.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-10, 14-15, 19-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception, abstract idea, without significantly more.
Step 1: This part of the eligibility analysis evaluates whether the claim(s) falls within any statutory
category. MPEP 2106.03:
According to the first part of the Alice analysis, in the instant case, the claims were determined
to be directed to one of the four statutory categories: an article of manufacture, a method/process (Claims 1-5, 11-15), a machine/system/product (Claims 6-10, 16-20), and a composition of matter. Based on the claims being determined to be within of the four categories (i.e., process, machine, manufacture, or composition of matter), (Step 1), it must be determined if the claims are directed to a judicial exception (i.e., law of nature, natural phenomenon, and abstract idea).
Step 2A Prong One: This part of the eligibility analysis evaluates whether the claim(s) recites a
judicial exception. 
Regarding independent claims 1, 6, the claims recite a judicial exception (i.e., an abstract idea enumerated in the 2019 PEG) without significantly more (Step-2A: Prong One). The applicant's claim limitations under broadest reasonable interpretation covers activities classified under mental processes - concepts performed in the human mind (including an observation, evaluation, judgment, opinion) (see MPEP § 2106.04(a)(2), subsection Ill) and the 2019 PEG. As evaluated below:

Claims 1, 6:
“selecting a first of the machine learning models to retain its backbone” (mental process of judgement)
If the identified limitation(s) falls within at least one of the groupings of abstract ideas, it is
reasonable to conclude that the claim(s) recites an abstract idea in Step 2A Prong One.
Step 2A Prong Two: This part of the eligibility analysis evaluates whether the claim(s) as a whole integrates the recited judicial exception into a practical application of the exception. As evaluated below:
“obtaining multiple machine learning models, each machine learning model comprising a backbone and a head”
“back-propagating error terms for synthetic activation data through at least a portion of the backbone of a second of the machine learning models to generate an inception basis set”
These recitations are deemed insufficient to transform the judicial exception to a patentable invention because the recitation is directed to instructions for mere data gathering or data output, see MPEP 2106.05(g).
“configuring a bridge using the inception basis set, the bridge configured to translate features generated by the backbone of the first machine learning model into features for use by the head of the second machine learning model”
These recitations are deemed insufficient to transform the judicial exception to a patentable invention because the recitation is directed to instructions merely indicating a field of use or technological environment in which to apply a judicial exception, see MPEP 2106.05(h).
Accordingly, these additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea when considered as an ordered combination and as a whole.
Step 2B: This part of the eligibility analysis evaluates whether the claim, as a whole, amounts to
significantly more than the recited exception, i.e., whether any additional element, or combination of
additional elements, adds an inventive concept to the claim. MPEP 2106.05.
First, the additional elements considered as part of the preamble and the additional elements
directed to the use of computer technology are deemed insufficient to transform the judicial exception
to a patentable invention to a patentable invention because they generally link the judicial exception to
the technology environment, see MPEP 2106.05(h).
Second, the additional elements directed to mere application of the abstract idea or mere instructions to implement an abstract idea on a computer are deemed insufficient to transform the judicial exception to a patentable invention to a patentable invention because the limitations generally apply the use of a generic computer and/or process with the judicial exception, see MPEP 2106.05(f).
Third, the claims are directed to instructions merely indicating a field of use or technological environment in which to apply a judicial exception. The courts have found these types of limitations insufficient to transform the judicial exception to a patentable invention, see MPEP 2106.05(g).
Lastly, the claims directed to data gathering activity as noted above, are deemed directed to an insignificant extra-solution activity. The courts have found these types of limitations insufficient to
qualify as "significantly more", see MPEP 2106.05(g).
Furthermore, when considering evidence in view of Berkheimer v. HP, Inc., 881 F.3d 1360, 1368, 125 USPQ2d 1649, 1654 (Fed. Cir. 2018), see USPTO Berkheimer Memorandum (April 2018). Examiner notes Berkheimer: Option 2 - A citation to one or more of the court decisions discussed in MPEP § 2106.05(d}(II} as noting the well understood, routine, conventional nature of the additional element (s) (e.g., limitations directed to mere data gathering):
The courts have recognized the following computer functions as well understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity, see MPEP 2106.05(d).
The additional limitations, as analyzed, failed to integrate a judicial exception into a practical application at Step 2A and provide an inventive concept in Step 2B, per the analysis above. Thus, considering the additional elements individually and in combination and the claims as a whole, the additional elements do not provide significantly more than the abstract idea. This claim is not patent eligible. Therefore, in examining elements as recited by the limitations individually and as an ordered combination, as a whole, claims 1, 6 do not recite what the courts have identified as "significantly more".

Furthermore, regarding dependent claims 2-5, which depend from claim 1, claims 7-10, which depend from claim 6, the claims are directed to a judicial exception (i.e., an abstract idea enumerated in the 2019 PEG, a law of nature, or a natural phenomenon) without significantly more as highlighted below in the claim limitations by evaluating the claim limitations under the Step2A and 2B:

Claims 2, 7:
Incorporates the rejections of claims 1, 6, respectively.
“collecting the backbone and the head of the first machine learning model, the bridge, and the head of the second machine learning model into a machine learning architecture while omitting the backbone of the second machine learning model from the machine learning architecture”
These recitations are deemed insufficient to transform the judicial exception to a patentable invention because the recitation is directed to instructions for mere data gathering or data output, see MPEP 2106.05(g).
“storing, outputting, or using the machine learning architecture”
The recitation is directed to mere instructions to implement an abstract idea on a computer, or
merely uses a computer as a tool to perform an abstract idea and are considered to adding the words "apply it" (or an equivalent) with the judicial exception, See MPEP 2106.05(f). 
Limitations directed to instructions for mere data gathering or data output or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea and are considered adding the words "apply it" (or an equivalent) with the judicial exception cannot integrate a judicial exception into a practical application at Step 2A or provide an inventive concept in Step 2B.

Claims 3, 8:
Incorporates the rejections of claims 2, 7, respectively.
“back-propagating additional error terms for additional synthetic activation data through at least a portion of the backbone of a third of the machine learning models to generate an additional inception basis set”
These recitations are deemed insufficient to transform the judicial exception to a patentable invention because the recitation is directed to instructions for mere data gathering or data output, see MPEP 2106.05(g).
“configuring a second bridge using the additional inception basis set”
“the second bridge configured to translate the features generated by the backbone of the first machine learning model into features for use by the head of the third machine learning model”
“wherein the machine learning architecture further includes the second bridge and the head of the third machine learning model while omitting the backbone of the third machine learning model”
These recitations are deemed insufficient to transform the judicial exception to a patentable invention because the recitation is directed to instructions merely indicating a field of use or technological environment in which to apply a judicial exception, see MPEP 2106.05(h).
Limitations directed to instructions for mere data gathering or data output or instructions merely indicating a field of use or technological environment in which to apply a judicial exception cannot integrate a judicial exception into a practical application at Step 2A or provide an inventive concept in Step 2B.

Claims 4, 9:
Incorporates the rejections of claims 1, 6, respectively.
“configuring the bridge comprises training the bridge to translate the features generated by the backbone of the first machine learning model into the features for use by the head of the second machine learning model”
These recitations are deemed insufficient to transform the judicial exception to a patentable invention because the recitation is directed to instructions merely indicating a field of use or technological environment in which to apply a judicial exception, see MPEP 2106.05(h).
“the bridge is trained without using training data associated with training of the second machine learning model”
The recitation is directed to mere instructions to implement an abstract idea on a computer, or
merely uses a computer as a tool to perform an abstract idea and are considered to adding the words "apply it" (or an equivalent) with the judicial exception, See MPEP 2106.05(f). 
Limitations directed to instructions merely indicating a field of use or technological environment in which to apply a judicial exception or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea and are considered to adding the words "apply it" (or an equivalent) with the judicial exception cannot integrate a judicial exception into a practical application at Step 2A or provide an inventive concept in Step 2B.

Claims 5, 10:
Incorporates the rejections of claims 1, 6, respectively.
“further comprising one of when the first and second machine learning models were trained using training data from similar environments, back-propagating error terms for target outputs of the inception basis set through a lesser number of layers of the backbone of the first machine learning model”
“when the first and second machine learning models were trained using training data from dissimilar environments, back-propagating the error terms for the target outputs of the inception basis set through a greater number of layers of the backbone of the first machine learning model”
These recitations are deemed insufficient to transform the judicial exception to a patentable invention because the recitation is directed to instructions merely indicating a field of use or technological environment in which to apply a judicial exception, see MPEP 2106.05(h).
Limitations directed to mere instructions indicating a field of use or technological environment in which to apply a judicial exception cannot integrate a judicial exception into a practical application at Step 2A or provide an inventive concept in Step 2B.

The dependent claims as analyzed above, do not recite limitations that integrated the judicial exception into a practical application. In addition, the claim limitations do not include additional elements that are sufficient to amount to significantly more than the judicial exception (Step-2B). Therefore, the claims do not recite any limitations, when considered individually or as a whole, that recite what have the courts have identified as "significantly more", see MPEP 2106.05; and therefore, as a whole the claims are not patent eligible. As shown above, the dependent claims do not provide any additional elements that when considered individually or as an ordered combination, amount to significantly more than the abstract idea identified. Therefore, as a whole, the dependent claims do not recite what have the courts have identified as "significantly more" than the recited judicial exception. Therefore, claims 2-5, 7-10 are rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception and does not recite, when claim elements are examined individually and as a whole, elements that the courts have identified as "significantly more" than the recited judicial exception.

Claims 14, 19:
Incorporates claims 11, 16, respectively.
“selecting the first machine learning model to retain its backbone” (mental process of judgement)
If the identified limitation(s) falls within at least one of the groupings of abstract ideas, it is
reasonable to conclude that the claim(s) recites an abstract idea in Step 2A Prong One.
Step 2A Prong Two: This part of the eligibility analysis evaluates whether the claim(s) as a whole integrates the recited judicial exception into a practical application of the exception. As evaluated below:
“back-propagating error terms for synthetic activation data through at least a portion of the backbone of the second machine learning model to generate an inception basis set”
“obtaining a machine learning architecture comprising a backbone and a head of a first machine learning model, a bridge, and a head of a second machine learning model, the machine learning architecture lacking a backbone of the second machine learning model”
“providing input data to the backbone of the first machine learning model”
“generating extracted features based on the input data using the backbone of the first machine learning model”
These recitations are deemed insufficient to transform the judicial exception to a patentable invention because the recitation is directed to instructions for mere data gathering or data output, see MPEP 2106.05(g).
“configuring the bridge based on the inception basis set”
“processing the extracted features using the head of the first machine learning model”
“translating the extracted features using the bridge to generate translated features”
“processing the translated features using the head of the second machine learning model”
These recitations are deemed insufficient to transform the judicial exception to a patentable invention because the recitation is directed to instructions merely indicating a field of use or technological environment in which to apply a judicial exception, see MPEP 2106.05(h).
Accordingly, these additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea when considered as an ordered combination and as a whole.
Step 2B: This part of the eligibility analysis evaluates whether the claim, as a whole, amounts to
significantly more than the recited exception, i.e., whether any additional element, or combination of
additional elements, adds an inventive concept to the claim. MPEP 2106.05.
First, the additional elements considered as part of the preamble and the additional elements
directed to the use of computer technology are deemed insufficient to transform the judicial exception
to a patentable invention to a patentable invention because they generally link the judicial exception to
the technology environment, see MPEP 2106.05(h).
Second, the additional elements directed to mere application of the abstract idea or mere instructions to implement an abstract idea on a computer are deemed insufficient to transform the judicial exception to a patentable invention to a patentable invention because the limitations generally apply the use of a generic computer and/or process with the judicial exception, see MPEP 2106.05(f).
Third, the claims are directed to instructions merely indicating a field of use or technological environment in which to apply a judicial exception. The courts have found these types of limitations insufficient to transform the judicial exception to a patentable invention, see MPEP 2106.05(g).
Lastly, the claims directed to data gathering activity as noted above, are deemed directed to an insignificant extra-solution activity. The courts have found these types of limitations insufficient to
qualify as "significantly more", see MPEP 2106.05(g).
Furthermore, when considering evidence in view of Berkheimer v. HP, Inc., 881 F.3d 1360, 1368, 125 USPQ2d 1649, 1654 (Fed. Cir. 2018), see USPTO Berkheimer Memorandum (April 2018). Examiner notes Berkheimer: Option 2 - A citation to one or more of the court decisions discussed in MPEP § 2106.05(d}(II} as noting the well understood, routine, conventional nature of the additional element (s) (e.g., limitations directed to mere data gathering):
The courts have recognized the following computer functions as well understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity, see MPEP 2106.05(d).
The additional limitations, as analyzed, failed to integrate a judicial exception into a practical application at Step 2A and provide an inventive concept in Step 2B, per the analysis above. Thus, considering the additional elements individually and in combination and the claims as a whole, the additional elements do not provide significantly more than the abstract idea. This claim is not patent eligible. Therefore, in examining elements as recited by the limitations individually and as an ordered combination, as a whole, claims 14, 19 do not recite what the courts have identified as "significantly more".

Furthermore, regarding dependent claim 15, which depends from claim 14, claim 20, which depends from claim 19, the claims are directed to a judicial exception (i.e., an abstract idea enumerated in the 2019 PEG, a law of nature, or a natural phenomenon) without significantly more as highlighted below in the claim limitations by evaluating the claim limitations under the Step2A and 2B:

Claims 15, 20:
Incorporates the rejections of claims 14, 19, respectively.
“configuring the bridge comprises training the bridge to translate the extracted features and generate the translated features”
These recitations are deemed insufficient to transform the judicial exception to a patentable invention because the recitation is directed to instructions merely indicating a field of use or technological environment in which to apply a judicial exception, see MPEP 2106.05(h).
“the bridge is trained without using training data associated with training of the second machine learning model
The recitation is directed to mere instructions to implement an abstract idea on a computer, or
merely uses a computer as a tool to perform an abstract idea and are considered to adding the words "apply it" (or an equivalent) with the judicial exception, See MPEP 2106.05(f). 
Limitations directed to instructions merely indicating a field of use or technological environment in which to apply a judicial exception or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea and are considered to adding the words "apply it" (or an equivalent) with the judicial exception cannot integrate a judicial exception into a practical application at Step 2A or provide an inventive concept in Step 2B.

The dependent claims as analyzed above, do not recite limitations that integrated the judicial exception into a practical application. In addition, the claim limitations do not include additional elements that are sufficient to amount to significantly more than the judicial exception (Step-2B). Therefore, the claims do not recite any limitations, when considered individually or as a whole, that recite what have the courts have identified as "significantly more", see MPEP 2106.05; and therefore, as a whole the claims are not patent eligible. As shown above, the dependent claims do not provide any additional elements that when considered individually or as an ordered combination, amount to significantly more than the abstract idea identified. Therefore, as a whole, the dependent claims do not recite what have the courts have identified as "significantly more" than the recited judicial exception. Therefore, claims 15, 20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception and does not recite, when claim elements are examined individually and as a whole, elements that the courts have identified as "significantly more" than the recited judicial exception.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1-4, 6-9, 14-15, 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Heller et al. (NPL: "Grafting Heterogeneous Neural Networks for a Hierarchical Object Classification", hereinafter ‘Heller'), in view of Peng et al. (NPL: "Reverse Graph Learning for Graph Neural Network", hereinafter 'Peng'). 

	Regarding claim 1 and analogous claim 6, Heller teaches A method comprising: obtaining multiple machine learning models, each machine learning model comprising a backbone and a head ([A. FUSION OF HETEROGENEOUS ARCHITECTURES, pg. 12931] The obtaining multiple machine learning models, each machine learning model comprising a backbone and a head global operation of the module is presented in Fig. 1. The first model is responsible for predicting one class among C1, C2 and C3. If class C3 is predicted and the information is considered sufficient, the inference is stopped. However, if the network predicts a superclass, for example C1 or C2, second-level networks are applied to provide finer predictions by proposing a classification among the classes of C1, i.e., C11, C12, etc., and of C2, i.e., C21, C22, etc.);
	selecting a first of the machine learning models to retain its backbone ([A. FUSION OF HETEROGENEOUS ARCHITECTURES, pg. 12931] The grafting should therefore be applied early in networks, before feature maps become too specific to the problem at hand, typically in the first layers of convolution. Thus, we extract the output of one of the first convolutional layers of a first-level network to benefit from relatively generic information. The selecting a first of the machine learning models to retain its backbone selected layer corresponds to a ‘‘grafting node (GN)’’ of the first-level network. To avoid comprehension problems, we use the term ‘‘branch node (BN)’’ for the layer of the level 2 network in which we reinject the output of the GN.);
	configuring a bridge using the inception basis set, the bridge configured to translate features generated by the backbone of the first machine learning model into features for use by the head of the second machine learning model ([A. FUSION OF HETEROGENEOUS ARCHITECTURES, pg. 12931-12932] Unlike the SS-HCNN, which reuses the same base for the different levels, here, we seek to combine very different architectures. However, it is possible that the feature maps extracted at the GN representing the N th layer of the level 1 network cannot be directly used as input of the BN representing the Mth layer of the level 2 network. It is necessary to propose a configuring a bridge using the inception basis set merging solution to respond to these problems of consistency between the models, but two major difficulties must be handled. First, the number of feature maps extracted from the first network may not match the number of maps expected by the second network. Second, the size of each of these maps may not be that expected at the input of the second network.; [E. DISCUSSION, pg. 12938] Through experiments conducted on different datasets, we have illustrated the appeal of our grafting solution whether close or highly heterogeneous networks are considered. We were able to manage both the differences in size and number between the networks while being faster and more accurate than the consecutive use of several complete networks. We saved time by removing the first layers of the second level networks and directly using the features extracted by the first-level network.; [1) NUMBER MANAGEMENT, pg.12932] Among the problems of the correspondence between the number of maps, we can again distinguish two subproblems. Indeed, depending on whether we need to increase or reduce the number of maps, the problem must be treated differently, even if we use quite similar solutions. Let us start with the following situation: we want to the bridge configured to translate features generated by the backbone of the first machine learning model into features for use by the head of the second machine learning model use the feature maps extracted by the N th convolutional layer of the level 1 network as input to our second network at the Mth layer of this level 2 network since all of its layers up to the Mth layer have been discarded. Let us assume that the number, N1, of maps extracted from the first model is greater than the number, N2, of maps expected by the second model (N1> N2). We therefore need to reduce the number of maps while providing as much information as possible. For this, one solution is to use unsupervised clustering techniques, such as K-means, or dimensionality reduction techniques, such as PCA. To apply these techniques, each of the N1 maps is first recoded into a vector of size L, and then K-means or PCA is performed on the N1*L matrix to obtain N2 vectors of size L. These vectors are then recoded into matrices so that they can be used as input to the second network. Both solutions are unsupervised and do not require an annotation phase on the learning set. For reasons of execution speed, we have favored the use of PCA, although the use of clustering remains viable (we will return to this aspect in section IV).).
	Heller fails to teach back-propagating error terms for synthetic activation data through at least a portion of the backbone of a second of the machine learning models to generate an inception basis set; and
	Peng teaches back-propagating error terms for synthetic activation data through at least a portion of the backbone of a second of the machine learning models to generate an inception basis set ([B. Reverse Graph Learning, pg.4532-4533] We denote X = {x1, x2,..., xn} ∈ Rn×D as the node feature matrix on the input space X where the ith node vi has a D-dimensional representation. We also denote the new representation of the ith node vi as zi ∈ R1×d in the embedding space. We further investigate the back-propagating error terms for synthetic activation data through at least a portion of the backbone of a second of the machine learning models to generate an inception basis set reverse graph embedding to preserve the local structure of the data in the intrinsic space by meeting two kinds of consistency, i.e., semantic consistency and structure consistency. First, semantic consistency is preserving original semantic information (e.g., principal components) in X while learning the reverse graph embedding in the intrinsic space. Specifically, the reverse graph embedding ZP should contain mainly semantic information in X. Second, structure consistency preserves the local structure of data points in the input space X. Specifically, the reverse graph embedding ziP and z jP should preserve the similarity between the ith node vi and the jth node v j. To do this, we assume that there exists an intrinsic space, where the new representation of X is represented by ZP via the transformation matrix P. As a result, a function fθ ∈ F can be found to map the embedding space Z to the intrinsic space of the data, and (5) can be transferred to the following objective functions: LRGL: (6a) (6b) where the matrix of PT is the transpose of P and I is an identity matrix. Equation (6a) achieves structure consistency by guiding the graph learning process, where a large distance ziP − z jP2 2 between vi and v j leads to a small value of si j. Equation (6b) achieves semantic consistency by training the reverse mapping function from Z to X. Note that X − ZP is similar to the Auto-Encoder method as they learn the new representation ZP for X. Moreover, the new representation is adjusted by the updated parameters until the value of the objective function becomes small. Using (6b) can lead to new representation ZP being focused on abstract latent factors (intrinsic properties), rather than details and noise. In this case, we say that it contains less noise than X.); and 
	Heller and Peng are considered to be analogous to the claimed invention because they are in the same field of machine learning. In view of the teachings of Heller, it would have been obvious for a person of ordinary skill in the art to apply the teachings of Peng to Heller before the effective filing date of the claimed invention in order to improve the quality of feature learning, applying the new method of out-of-sample extension for reverse GNN method in order to conduct supervised learning and semi-supervised learning (cf. Peng, [Abstract, pg. 4530] Graph neural networks (GNNs) conduct feature learning by taking into account the local structure preservation of the data to produce discriminative features, but need to address the following issues, i.e., 1) the initial graph containing faulty and missing edges often affect feature learning and 2) most GNN methods suffer from the issue of out-of-example since their training processes do not directly generate a prediction model to predict unseen data points. In this work, we propose a reverse GNN model to learn the graph from the intrinsic space of the original data points as well as to investigate a new out-of-sample extension method. As a result, the proposed method can output a high-quality graph to improve the quality of feature learning, while the new method of out-of-sample extension makes our reverse GNN method available for conducting supervised learning and semi-supervised learning. Experimental results on real-world datasets show that our method outputs competitive classification performance, compared to state-of-the-art methods, in terms of semi-supervised node classification, out-of-sample extension, random edge attack, link prediction, and image retrieval.).

Regarding claim 2 and analogous claim 7, Heller, as modified by Peng, teaches The method of claim 1 and The apparatus of claim 6, respectively.
	Heller teaches further comprising: collecting the backbone and the head of the first machine learning model, the bridge, and the head of the second machine learning model into a machine learning architecture while omitting the backbone of the second machine learning model from the machine learning architecture; and storing, outputting, or using the machine learning architecture ([A. FUSION OF HETEROGENEOUS ARCHITECTURES, pg. 12931] Rather than using a single kind of architecture, we seek to graft different architectures deprived of their first layers at strategic locations of the same level 1 network. Here, we mainly assume that the low-level features extracted by the level 1 network and those that would have been extracted by the second-level networks using their first layers are roughly the same. With this assumption, the features can therefore be used in level 2 networks without a significant impact on the final accuracy and with an improved inference time. We call the particular architectures used for level 2 networks ‘‘cut_models’’. The grafting should therefore be applied early in networks, before feature maps become too specific to the problem at hand, typically in the first layers of convolution. Thus, we collecting the backbone and the head of the first machine learning model, the bridge, and the head of the second machine learning model into a machine learning architecture while omitting the backbone of the second machine learning model from the machine learning architecture extract the output of one of the first convolutional layers of a first-level network to benefit from relatively generic information. The selected layer corresponds to a ‘‘grafting node (GN)’’ of the first-level network. To avoid comprehension problems, we use the term ‘‘branch node (BN)’’ for the layer of the level 2 network in which we reinject the output of the GN. The selected branch node must be used to follow a convolution layer in a traditional architecture, typically a batch normalization layer. To select the BN level in network 2, to reduce the transformation complexity and to keep computation time low, the information used should be of a level of detail close to what is usually available at this level of depth. In other words, this BN of the level 2 network will be located at a depth close to that of the GN on the level 1 network in their respective architectures.; [4) DISCUSSION ON TRAINING STRATEGIES, pg. 12934] Otherwise, we prefer to train the cut_models directly from the feature maps. In all cases, we have implemented a checkpoint system, and if the rate of prediction errors on the validation data decreases from the best saved configuration, it storing the machine learning architecture keeps the settings in a specific file. The next section will be devoted to the evaluation of the learning strategies as well as the proposed fusion module for different architectures on three public datasets to validate the performances in several different situations.; [B. MNIST RESULTS, pg. 12935] As we already explained, the using the machine learning architecture use of cut_models brings a speed up of the inference, so this first test allows us to validate our concept for very heterogeneous architectures where we reduce the number of feature maps while increasing their size. The grafting process was efficient with different grafting nodes and branch nodes.).
Heller and Peng are combinable for the same rationale as set forth above with respect to claim 1.

Regarding claim 3 and analogous claim 8, Heller, as modified by Peng, teaches The method of claim 2 and The apparatus of claim 7, respectively.
	Peng teaches further comprising: back-propagating additional error terms for additional synthetic activation data through at least a portion of the backbone of a third of the machine learning models to generate an additional inception basis set ([B. Reverse Graph Learning, pg.4532-4533] We denote X = {x1, x2,..., xn} ∈ Rn×D as the node feature matrix on the input space X where the ith node vi has a D-dimensional representation. We also denote the new representation of the ith node vi as zi ∈ R1×d in the embedding space. We further investigate the back-propagating additional error terms for additional synthetic activation data through at least a portion of the backbone of a third of the machine learning models to generate an additional inception basis set reverse graph embedding to preserve the local structure of the data in the intrinsic space by meeting two kinds of consistency, i.e., semantic consistency and structure consistency. First, semantic consistency is preserving original semantic information (e.g., principal components) in X while learning the reverse graph embedding in the intrinsic space. Specifically, the reverse graph embedding ZP should contain mainly semantic information in X. Second, structure consistency preserves the local structure of data points in the input space X. Specifically, the reverse graph embedding ziP and z jP should preserve the similarity between the ith node vi and the jth node v j . To do this, we assume that there exists an intrinsic space, where the new representation of X is represented by ZP via the transformation matrix P. As a result, a function fθ ∈ F can be found to map the embedding space Z to the intrinsic space of the data, and (5) can be transferred to the following objective functions: LRGL: (6a) (6b) where the matrix of PT is the transpose of P and I is an identity matrix. Equation (6a) achieves structure consistency by guiding the graph learning process, where a large distance ziP − z jP2 2 between vi and v j leads to a small value of si j. Equation (6b) achieves semantic consistency by training the reverse mapping function from Z to X. Note that X − ZP is similar to the Auto-Encoder method as they learn the new representation ZP for X. Moreover, the new representation is adjusted by the updated parameters until the value of the objective function becomes small. Using (6b) can lead to new representation ZP being focused on abstract latent factors (intrinsic properties), rather than details and noise. In this case, we say that it contains less noise than X.); and
	Heller teaches configuring a second bridge using the additional inception basis set, the second bridge configured to translate the features generated by the backbone of the first machine learning model into features for use by the head of the third machine learning model;
wherein the machine learning architecture further includes the second bridge and the head of the third machine learning model while omitting the backbone of the third machine learning model ([A. FUSION OF HETEROGENEOUS ARCHITECTURES, pg. 12931-12932] Unlike the SS-HCNN, which reuses the same base for the different levels, here, we seek to combine very different architectures. However, it is possible that the feature maps extracted at the GN representing the N th layer of the level 1 network cannot be directly used as input of the BN representing the Mth layer of the level 2 network. It is necessary to propose a configuring a bridge using the inception basis set merging solution to respond to these problems of consistency between the models, but two major difficulties must be handled. First, the number of feature maps extracted from the first network may not match the number of maps expected by the second network. Second, the size of each of these maps may not be that expected at the input of the second network.; [1) NUMBER MANAGEMENT, pg.12932] Among the problems of the correspondence between the number of maps, we can again distinguish two subproblems. Indeed, depending on whether we need to increase or reduce the number of maps, the problem must be treated differently, even if we use quite similar solutions. Let us start with the following situation: we want to configuring a second bridge using the additional inception basis set, the second bridge configured to translate the features generated by the backbone of the first machine learning model into features for use by the head of the third machine learning model use the feature maps extracted by the N th convolutional layer of the level 1 network as input to our second network at the Mth layer of this level 2 network since all of its layers up to the Mth layer have been discarded. Let us assume that the number, N1, of maps extracted from the first model is greater than the number, N2, of maps expected by the second model (N1> N2). We therefore need to reduce the number of maps while providing as much information as possible. For this, one solution is to use unsupervised clustering techniques, such as K-means, or dimensionality reduction techniques, such as PCA. To apply these techniques, each of the N1 maps is first recoded into a vector of size L, and then K-means or PCA is performed on the N1*L matrix to obtain N2 vectors of size L. These vectors are then recoded into matrices so that they can be used as input to the second network. Both solutions are unsupervised and do not require an annotation phase on the learning set. For reasons of execution speed, we have favored the use of PCA, although the use of clustering remains viable (we will return to this aspect in section IV).; [3) MULTITERM LOSS FUNCTION, pg. 12933-12934] To define a third strategy that avoids sequential learning, we proposed a multiterm loss function weighted for each level of the hierarchy (Eq. 1). We use cross entropy for each of our Li loss functions (binary or categorical depending on the problem). This solution allows us to train each network at the same time and can be wherein the machine learning architecture further includes the second bridge and the head of the third machine learning model while omitting the backbone of the third machine learning model extended to multiple networks, adding as many terms as there are networks. The use of this multiterm loss function is similar to the techniques used for single architectures, mainly by B-CNN, which, according to the advancement in learning, gives different importance to the terms used. Lf = X i αiLi (1).).
	Heller and Peng are combinable for the same rationale as set forth above with respect to claim 1.

Regarding claim 4 and analogous claim 9, Heller, as modified by Peng, teaches The method of claim 1 and The apparatus of claim 6, respectively.
	Heller teaches wherein: configuring the bridge comprises training the bridge to translate the features generated by the backbone of the first machine learning model into the features for use by the head of the second machine learning model; and the bridge is trained without using training data associated with training of the second machine learning model ([1) NUMBER MANAGEMENT, pg.12932] Among the problems of the correspondence between the number of maps, we can again distinguish two subproblems. Indeed, depending on whether we need to increase or reduce the number of maps, the problem must be treated differently, even if we use quite similar solutions. Let us start with the following situation: we want to configuring the bridge comprises training the bridge to translate the features generated by the backbone of the first machine learning model into the features for use by the head of the second machine learning model use the feature maps extracted by the N th convolutional layer of the level 1 network as input to our second network at the Mth layer of this level 2 network since all of its layers up to the Mth layer have been discarded. Let us assume that the number, N1, of maps extracted from the first model is greater than the number, N2, of maps expected by the second model (N1> N2). We therefore need to reduce the number of maps while providing as much information as possible. For this, one solution is to use unsupervised clustering techniques, such as K-means, or dimensionality reduction techniques, such as PCA. To apply these techniques, each of the N1 maps is first recoded into a vector of size L, and then K-means or PCA is performed on the N1*L matrix to obtain N2 vectors of size L. These vectors are then recoded into matrices so that they can be used as input to the second network. Both solutions are unsupervised and do not require an annotation phase on the learning set. For reasons of execution speed, we have favored the use of PCA, although the use of clustering remains viable (we will return to this aspect in section IV).; [2) DIRECT TRAINING ON THE FEATURE MAPS, pg. 12933] A second strategy used was to train the reduced version of the second network (called cut_model) directly on the feature maps generated by the first network. The data augmentation options are applied here on the inputs of the first network and not directly on the feature maps to bypass the difficulties linked to the correlation between the different maps. This strategy, similar to the SS-HCNN learning strategy, allows networks at different levels to focus on properties directly related to the problem they are addressing. This strategy the bridge is trained without using training data associated with training of the second machine learning model avoids training the first layers of level 2 networks that will not be used later. Since we can choose the location of the GN and BN, we have considerable flexibility until the start of the learning process.).
	Heller and Peng are combinable for the same rationale as set forth above with respect to claim 1.

Regarding claim 14 and analogous claim 19, Heller teaches The method of claim 11 and The apparatus of claim 16, respectively.
	Heller teaches further comprising: selecting the first machine learning model to retain its backbone ([A. FUSION OF HETEROGENEOUS ARCHITECTURES, pg. 12931] The grafting should therefore be applied early in networks, before feature maps become too specific to the problem at hand, typically in the first layers of convolution. Thus, we extract the output of one of the first convolutional layers of a first-level network to benefit from relatively generic information. The selecting the first machine learning model to retain its backbone selected layer corresponds to a ‘‘grafting node (GN)’’ of the first-level network. To avoid comprehension problems, we use the term ‘‘branch node (BN)’’ for the layer of the level 2 network in which we reinject the output of the GN.);
	configuring the bridge based on the inception basis set ([A. FUSION OF HETEROGENEOUS ARCHITECTURES, pg. 12931-12932] Unlike the SS-HCNN, which reuses the same base for the different levels, here, we seek to combine very different architectures. However, it is possible that the feature maps extracted at the GN representing the N th layer of the level 1 network cannot be directly used as input of the BN representing the Mth layer of the level 2 network. It is necessary to propose a configuring the bridge based on the inception basis set merging solution to respond to these problems of consistency between the models, but two major difficulties must be handled. First, the number of feature maps extracted from the first network may not match the number of maps expected by the second network. Second, the size of each of these maps may not be that expected at the input of the second network.; [E. DISCUSSION, pg. 12938] Through experiments conducted on different datasets, we have illustrated the appeal of our grafting solution whether close or highly heterogeneous networks are considered. We were able to manage both the differences in size and number between the networks while being faster and more accurate than the consecutive use of several complete networks. We saved time by removing the first layers of the secondlevel networks and directly using the features extracted by the first-level network.).
	Heller fails to teach back-propagating error terms for synthetic activation data through at least a portion of the backbone of the second machine learning model to generate an inception basis set; and
	Peng teaches back-propagating error terms for synthetic activation data through at least a portion of the backbone of the second machine learning model to generate an inception basis set ([B. Reverse Graph Learning, pg.4532-4533] We denote X = {x1, x2,..., xn} ∈ Rn×D as the node feature matrix on the input space X where the ith node vi has a D-dimensional representation. We also denote the new representation of the ith node vi as zi ∈ R1×d in the embedding space. We further investigate back-propagating error terms for synthetic activation data through at least a portion of the backbone of the second machine learning model to generate an inception basis set the reverse graph embedding to preserve the local structure of the data in the intrinsic space by meeting two kinds of consistency, i.e., semantic consistency and structure consistency. First, semantic consistency is preserving original semantic information (e.g., principal components) in X while learning the reverse graph embedding in the intrinsic space. Specifically, the reverse graph embedding ZP should contain mainly semantic information in X. Second, structure consistency preserves the local structure of data points in the input space X. Specifically, the reverse graph embedding ziP and z jP should preserve the similarity between the ith node vi and the jth node v j . To do this, we assume that there exists an intrinsic space, where the new representation of X is represented by ZP via the transformation matrix P. As a result, a function fθ ∈ F can be found to map the embedding space Z to the intrinsic space of the data, and (5) can be transferred to the following objective functions: LRGL: (6a) (6b) where the matrix of PT is the transpose of P and I is an identity matrix. Equation (6a) achieves structure consistency by guiding the graph learning process, where a large distance ziP − z jP2 2 between vi and v j leads to a small value of si j. Equation (6b) achieves semantic consistency by training the reverse mapping function from Z to X. Note that X − ZP is similar to the Auto-Encoder method as they learn the new representation ZP for X. Moreover, the new representation is adjusted by the updated parameters until the value of the objective function becomes small. Using (6b) can lead to new representation ZP being focused on abstract latent factors (intrinsic properties), rather than details and noise. In this case, we say that it contains less noise than X.); and 
	Heller and Peng are combinable for the same rationale as set forth above with respect to claim 1.

Regarding claim 15 and analogous claim 20, Heller, as modified by Peng, teaches The method of claim 14 and The apparatus of claim 19, respectively.
	Heller teaches wherein: configuring the bridge comprises training the bridge to translate the extracted features and generate the translated features; and the bridge is trained without using training data associated with training of the second machine learning model ([1) NUMBER MANAGEMENT, pg.12932] Among the problems of the correspondence between the number of maps, we can again distinguish two subproblems. Indeed, depending on whether we need to increase or reduce the number of maps, the problem must be treated differently, even if we use quite similar solutions. Let us start with the following situation: we want to configuring the bridge comprises training the bridge to translate the extracted features and generate the translated features use the feature maps extracted by the N th convolutional layer of the level 1 network as input to our second network at the Mth layer of this level 2 network since all of its layers up to the Mth layer have been discarded. Let us assume that the number, N1, of maps extracted from the first model is greater than the number, N2, of maps expected by the second model (N1> N2). We therefore need to reduce the number of maps while providing as much information as possible. For this, one solution is to use unsupervised clustering techniques, such as K-means, or dimensionality reduction techniques, such as PCA. To apply these techniques, each of the N1 maps is first recoded into a vector of size L, and then K-means or PCA is performed on the N1*L matrix to obtain N2 vectors of size L. These vectors are then recoded into matrices so that they can be used as input to the second network. Both solutions are unsupervised and do not require an annotation phase on the learning set. For reasons of execution speed, we have favored the use of PCA, although the use of clustering remains viable (we will return to this aspect in section IV).; [2) DIRECT TRAINING ON THE FEATURE MAPS, pg. 12933] A second strategy used was to train the reduced version of the second network (called cut_model) directly on the feature maps generated by the first network. The data augmentation options are applied here on the inputs of the first network and not directly on the feature maps to bypass the difficulties linked to the correlation between the different maps. This strategy, similar to the SS-HCNN learning strategy, allows networks at different levels to focus on properties directly related to the problem they are addressing. This strategy the bridge is trained without using training data associated with training of the second machine learning model avoids training the first layers of level 2 networks that will not be used later. Since we can choose the location of the GN and BN, we have considerable flexibility until the start of the learning process.).
	Heller and Peng are combinable for the same rationale as set forth above with respect to claim 1.

Claims 11-13, 16-18 are rejected under 35 U.S.C. 103 as being unpatentable over Heller. Although the invention is not identically disclosed or described as set forth in 35 U.S.C. 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a designer having ordinary skill in the art to which the claimed invention pertains, the invention is not patentable.

	Regarding claim 11 and analogous claim 16, Heller teaches A method comprising:
obtaining a machine learning architecture comprising a backbone and a head of a first machine learning model, a bridge, and a head of a second machine learning model, the machine learning architecture lacking a backbone of the second machine learning model; providing input data to the backbone of the first machine learning model; generating extracted features based on the input data using the backbone of the first machine learning model; processing the extracted features using the head of the first machine learning model; translating the extracted features using the bridge to generate translated features; and processing the translated features using the head of the second machine learning model ([A. FUSION OF HETEROGENEOUS ARCHITECTURES, pg. 12931] Rather than using a single kind of architecture, we seek to obtaining a machine learning architecture comprising a backbone and a head of a first machine learning model, a bridge, and a head of a second machine learning model, the machine learning architecture lacking a backbone of the second machine learning model graft different architectures deprived of their first layers at strategic locations of the same level 1 network. Here, we mainly assume that the low-level providing input data to the backbone of the first machine learning model features generating extracted features based on the input data using the backbone of the first machine learning model extracted by the level 1 network and those that would have been extracted by the second-level networks using their first layers are roughly the same. With this assumption, the features can therefore be used in level 2 networks without a significant impact on the final accuracy and with an improved inference time. We call the particular architectures used for level 2 networks ‘‘cut_models’’. The grafting should therefore be applied early in networks, before feature maps become too specific to the problem at hand, typically in the first layers of convolution. Thus, we processing the extracted features using the head of the first machine learning model extract the output of one of the first convolutional layers of a first-level network to benefit from relatively generic information. The selected layer corresponds to a ‘‘grafting node (GN)’’ of the first-level network. To avoid comprehension problems, we use the term ‘‘branch node (BN)’’ for the layer of the level 2 network in which we reinject the output of the GN. The selected branch node must be used to follow a convolution layer in a traditional architecture, typically a batch normalization layer. To select the BN level in network 2, to reduce the transformation complexity and to keep computation time low, the information used should be of a level of detail close to what is usually available at this level of depth. In other words, this BN of the level 2 network will be located at a depth close to that of the GN on the level 1 network in their respective architectures.; [1) NUMBER MANAGEMENT, pg.12932] Among the problems of the correspondence between the number of maps, we can again distinguish two subproblems. Indeed, depending on whether we need to increase or reduce the number of maps, the problem must be treated differently, even if we use quite similar solutions. Let us start with the following situation: we want to translating the extracted features using the bridge to generate translated features use the feature maps extracted by the N th convolutional layer of the level 1 network as input to our second network at the Mth layer of this level 2 network since all of its layers up to the Mth layer have been discarded. Let us assume that the number, N1, of maps extracted from the first model is greater than the number, N2, of maps expected by the second model (N1> N2). We therefore need to reduce the number of maps while providing as much information as possible. For this, one solution is to use unsupervised clustering techniques, such as K-means, or dimensionality reduction techniques, such as PCA. To apply these techniques, each of the N1 maps is first recoded into a vector of size L, and then K-means or PCA is performed on the N1*L matrix to obtain N2 vectors of size L. These vectors are then recoded into matrices so that they can be used as input to the second network. Both solutions are unsupervised and do not require an annotation phase on the learning set. For reasons of execution speed, we have favored the use of PCA, although the use of clustering remains viable (we will return to this aspect in section IV).; [3) CONDITIONS OF VALIDITY, pg. 12933] Although it is necessary to finish the inference of the first model to know whether to execute the following models, the processing the translated features using the head of the second machine learning model information we want to use as input to the level 2 network can be used before the end of the inference of the level 1 network. The transformation can be performed as soon as we have passed the GN layer concerned by the fusion, parallel to the end of the inference of the first network. If the first network provides enough information, these calculations will not be used subsequently. If not, we can theoretically save 100% of Time 3. Indeed, we can have the inputs of BN even before the end of the first network when the transformation time is less than the time needed to finish the inference of the first network. In practice, parallelization will affect performance, and we will not achieve 100% gain, but the time saved will still be significant. In this case, the time not to be exceeded by the transformation module (Time 5) is no longer the recalculation time of the first layers but the end time of the first network inference + the recalculation time (Time 2 + Time 3). The second-level inference can then be started as soon as the transformation and the first inference are both completed, resulting in a time equal to the max (Time 2, Time 5). We refer to this approach as PG, for parallel grafting. The transformation for number and size is only slightly longer than the completion of the first inference, so the time savings are much larger if we do so.).
In view of the teachings of Heller, it would have been obvious for a person of ordinary skill in the art to apply the teachings of Heller to Heller before the effective filing date of the claimed invention in order to merge heterogeneous CNNs by following a hierarchical approach in which the information extracted by first-level networks can be fed back at any location into second-level networks, eliminating the computational redundancy induced by the recalculation of low-level features and reducing the inference time of the second network without impacting the accuracy (cf. Heller, [Abstract, pg. 12927] Convolutional neural networks (CNNs) are deep learning architectures used for image classification that have been improved in recent years to increase their accuracies and reduce their computation times. Hierarchical approaches are based on a step-by-step strategy and aim to optimize performance on difficult tasks by solving successive subtasks. The gain provided by these solutions must be relativized with the explosion in the number of parameters they imply, which makes their implementation on embedded systems difficult. New constraints also appear in the choice of the architectures of the branches when one seeks to have a global network providing predictions at different levels. We propose a strategy that allows the merging of heterogeneous CNNs by following a hierarchical approach in which the information extracted by first-level networks can be fed back at any location into second-level networks. Despite the differences in the number and size of the feature maps, such grafting can be done by using clustering, dimension reduction, and interpolation techniques. This strategy eliminates the computational redundancy induced by the recalculation of low-level features. The proposed grafting approach significantly reduces the inference time of the second network without impacting the accuracy. Tests performed on MNIST, CIFAR-10, and PlantVillage datasets with several CNNs illustrate the possibility of implementation in various situations. Our solution allows us to consider in an innovative way the implementation of hierarchical solutions on devices with limited capacities.).

Regarding claim 12 and analogous claim 17, Heller teaches The method of claim 11 and The apparatus of claim 16, respectively.
	Heller teaches wherein: the machine learning architecture further comprises a second bridge and a head of a third machine learning, the machine learning architecture lacking a backbone of the third machine learning model ([3) MULTITERM LOSS FUNCTION, pg. 12933-12934] To define a third strategy that avoids sequential learning, we proposed a multiterm loss function weighted for each level of the hierarchy (Eq. 1). We use cross entropy for each of our Li loss functions (binary or categorical depending on the problem). This solution allows us to train each network at the same time and can be the machine learning architecture further comprises a second bridge and a head of a third machine learning, the machine learning architecture lacking a backbone of the third machine learning model extended to multiple networks, adding as many terms as there are networks. The use of this multiterm loss function is similar to the techniques used for single architectures, mainly by B-CNN, which, according to the advancement in learning, gives different importance to the terms used. Lf = X i αiLi (1).); and 
	the method further comprises: translating the extracted features using the second bridge to generate second translated features ([A. FUSION OF HETEROGENEOUS ARCHITECTURES, pg. 12931-12932] Unlike the SS-HCNN, which reuses the same base for the different levels, here, we seek to combine very different architectures. However, it is possible that the feature maps extracted at the GN representing the N th layer of the level 1 network cannot be directly used as input of the BN representing the Mth layer of the level 2 network. It is necessary to the second bridge to generate second translated features propose a merging solution to respond to these problems of consistency between the models, but two major difficulties must be handled. First, the number of feature maps extracted from the first network may not match the number of maps expected by the second network. Second, the size of each of these maps may not be that expected at the input of the second network.; [1) NUMBER MANAGEMENT, pg.12932] Among the problems of the correspondence between the number of maps, we can again distinguish two subproblems. Indeed, depending on whether we need to increase or reduce the number of maps, the problem must be treated differently, even if we use quite similar solutions. Let us start with the following situation: we want to translating the extracted features using the second bridge to generate second translated features use the feature maps extracted by the N th convolutional layer of the level 1 network as input to our second network at the Mth layer of this level 2 network since all of its layers up to the Mth layer have been discarded. Let us assume that the number, N1, of maps extracted from the first model is greater than the number, N2, of maps expected by the second model (N1> N2). We therefore need to reduce the number of maps while providing as much information as possible. For this, one solution is to use unsupervised clustering techniques, such as K-means, or dimensionality reduction techniques, such as PCA. To apply these techniques, each of the N1 maps is first recoded into a vector of size L, and then K-means or PCA is performed on the N1*L matrix to obtain N2 vectors of size L. These vectors are then recoded into matrices so that they can be used as input to the second network. Both solutions are unsupervised and do not require an annotation phase on the learning set. For reasons of execution speed, we have favored the use of PCA, although the use of clustering remains viable (we will return to this aspect in section IV).); and
	processing the second translated features using the head of the third machine learning model ([A. FUSION OF HETEROGENEOUS ARCHITECTURES, pg. 12931] Rather than using a single kind of architecture, we seek to graft different architectures deprived of their first layers at strategic locations of the same level 1 network. Here, we mainly assume that the low-level features extracted by the level 1 network and those that would have been extracted by the second-level networks using their first layers are roughly the same. With this assumption, the processing the second translated features using the head of the third machine learning model features can therefore be used in level 2 networks without a significant impact on the final accuracy and with an improved inference time. We call the particular architectures used for level 2 networks ‘‘cut_models’’. The grafting should therefore be applied early in networks, before feature maps become too specific to the problem at hand, typically in the first layers of convolution. Thus, we extract the output of one of the first convolutional layers of a first-level network to benefit from relatively generic information. The selected layer corresponds to a ‘‘grafting node (GN)’’ of the first-level network. To avoid comprehension problems, we use the term ‘‘branch node (BN)’’ for the layer of the level 2 network in which we reinject the output of the GN. The selected branch node must be used to follow a convolution layer in a traditional architecture, typically a batch normalization layer. To select the BN level in network 2, to reduce the transformation complexity and to keep computation time low, the information used should be of a level of detail close to what is usually available at this level of depth. In other words, this BN of the level 2 network will be located at a depth close to that of the GN on the level 1 network in their respective architectures.).
	Heller is combinable for the same rationale as set forth above with respect to claim 11.

Regarding claim 13 and analogous claim 18, Heller teaches The method of claim 11 and The apparatus of claim 16, respectively.
	Heller teaches wherein the bridge is configured to translate between a first feature space associated with the first machine learning model and a second feature space associated with the second machine learning model, the second feature space representing a transformed version of the first feature space ([A. FUSION OF HETEROGENEOUS ARCHITECTURES, pg. 12931-12932] Unlike the SS-HCNN, which reuses the same base for the different levels, here, we seek to combine very different architectures. However, it is possible that the feature maps extracted at the GN representing the N th layer of the level 1 network cannot be directly used as input of the BN representing the Mth layer of the level 2 network. It is necessary to propose a wherein the bridge merging solution to respond to these problems of consistency between the models, but two major difficulties must be handled. First, the number of feature maps extracted from the first network may not match the number of maps expected by the second network. Second, the size of each of these maps may not be that expected at the input of the second network.; [1) NUMBER MANAGEMENT, pg.12932] Among the problems of the correspondence between the number of maps, we can again distinguish two subproblems. Indeed, depending on whether we need to increase or reduce the number of maps, the problem must be treated differently, even if we use quite similar solutions. Let us start with the following situation: we want to is configured to translate between a first feature space associated with the first machine learning model and a second feature space associated with the second machine learning model use the feature maps extracted by the N th convolutional layer of the level 1 network as input to our second network at the Mth layer of this level 2 network since all of its layers up to the Mth layer have been discarded. Let us assume that the number, N1, of maps extracted from the first model is greater than the number, N2, of maps expected by the second model (N1> N2). We therefore need to reduce the number of maps while providing as much information as possible. For this, one solution is to use unsupervised clustering techniques, such as K-means, or dimensionality reduction techniques, such as PCA. To apply these techniques, each of the N1 maps is first recoded into a vector of size L, and then K-means or PCA is performed on the N1*L matrix to the second feature space representing a transformed version of the first feature space obtain N2 vectors of size L. These vectors are then recoded into matrices so that they can be used as input to the second network. Both solutions are unsupervised and do not require an annotation phase on the learning set. For reasons of execution speed, we have favored the use of PCA, although the use of clustering remains viable (we will return to this aspect in section IV).).
	Heller is combinable for the same rationale as set forth above with respect to claim 11.

Claims 5, 10 are rejected under 35 U.S.C. 103 as being unpatentable over Heller, in view of Peng, and further in view of Yosinski et al. (NPL: "How transferable are features in deep neural networks?", hereinafter 'Yosinski'). 

Regarding claim 5 and analogous claim 10, Heller, as modified by Peng, teaches The method of claim 1 and The apparatus of claim 6, respectively.
	Heller, as modified by Peng, fails to teach further comprising one of when the first and second machine learning models were trained using training data from similar environments, 
back-propagating error terms for target outputs of the inception basis set through a lesser number of layers of the backbone of the first machine learning model; and when the first and second machine learning models were trained using training data from dissimilar environments, back-propagating the error terms for the target outputs of the inception basis set through a greater number of layers of the backbone of the first machine learning model.
	Yosinski teaches further comprising one of when the first and second machine learning models were trained using training data from similar environments, back-propagating error terms for target outputs of the inception basis set through a lesser number of layers of the backbone of the first machine learning model; and when the first and second machine learning models were trained using training data from dissimilar environments, back-propagating the error terms for the target outputs of the inception basis set through a greater number of layers of the backbone of the first machine learning model ([1 Introduction, pg. 2] The usual transfer learning approach is to train a base network and then copy its first n layers to the first n layers of a target network. The remaining layers of the target network are then randomly initialized and trained toward the target task. back-propagating error terms for target outputs of the inception basis set through a lesser number of layers of the backbone of the first machine learning model One can choose to backpropagate the errors from the new task into the base (copied) features to fine-tune them to the new task, or the transferred feature layers can be left frozen, meaning that they do not change during training on the new task. The choice of back-propagating the error terms for the target outputs of the inception basis set through a greater number of layers of the backbone of the first machine learning model whether or not to fine-tune the first n layers of the target network depends on the size of the target dataset and the number of parameters in the first n layers.; When generalizing to the other dataset, we would expect that the new high-level felid detectors trained on top of old low-level felid detectors would work well. comprising one of when the first and second machine learning models were trained using training data from similar environments Thus A and B are similar when created by randomly assigning classes to each, and we expect that transferred features will perform better when the first and second machine learning models were trained using training data from dissimilar environments than when A and B are less similar. Fortunately, in ImageNet we are also provided with a hierarchy of parent classes. This information allowed us to create a special split of the dataset into two halves that are as semantically different from each other as possible: with dataset A containing only man-made entities and B containing natural entities. The split is not quite even, with 551 classes in the man-made group and 449 in the natural group. Further details of this split and the classes in each half are given in the supplementary material. In Section 4.2 we will show that features transfer more poorly (i.e. they are more specific) when the datasets are less similar.).
	Heller, Peng, and Yosinski are considered to be analogous to the claimed invention because they are in the same field of machine learning. In view of the teachings of Heller and Peng, it would have been obvious for a person of ordinary skill in the art to apply the teachings of Yosinski to Heller before the effective filing date of the claimed invention in order to quantify the degree to which a particular layer is general or specific, namely, how well features at that layer transfer from one task to another (cf. Yosinski, [1 Introduction, pg. 2] 1. We define a way to quantify the degree to which a particular layer is general or specific, namely, how well features at that layer transfer from one task to another (Section 2). We then train pairs of convolutional neural networks on the ImageNet dataset and characterize the layer-by-layer transition from general to specific (Section 4), which yields the following four results.).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Hassan et al. (U.S. Pre-Grant Publication No. 20220398405) teaches a method comprising inputting a plurality of labeled examples into a multi-task network, the multi-task network comprising: a backbone network, the backbone network generating one or more feature vectors corresponding to each of the labeled examples, and a plurality of prediction heads coupled to the backbone network; minimizing a joint loss based on outputs of the plurality of prediction heads, the minimizing the joint loss causing a change in parameters of the backbone network; and storing a distraction classification model after minimizing the joint loss, the distraction classification model comprising the parameters of the backbone network and parameters of at least one of the prediction heads.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MAGGIE MAIDO whose telephone number is (703) 756-1953. The examiner can normally be reached M-Th: 6am - 4pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael Huntley can be reached on (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/MM/Examiner, Art Unit 2129                                                                                                                                                                                              
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129
Read full office action
Prosecution Timeline

Apr 21, 2023
Application Filed
Jan 14, 2026
Non-Final Rejection — §101, §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/330,099
Patent 12602603
MULTI-AGENT INFERENCE
2y 5m to grant Granted Apr 14, 2026
17/392,319
Patent 12596933
CONTEXT-AWARE ENTITY LINKING FOR KNOWLEDGE GRAPHS TO SUPPORT DECISION MAKING
2y 5m to grant Granted Apr 07, 2026
17/062,058
Patent 12579463
GENERATIVE REASONING FOR SYMBOLIC DISCOVERY
2y 5m to grant Granted Mar 17, 2026
17/659,028
Patent 12579452
EVALUATION SCORE DETERMINATION MACHINE LEARNING MODELS WITH DIFFERENTIAL PERIODIC TIERS
2y 5m to grant Granted Mar 17, 2026
17/212,022
Patent 12566941
EXTENSION OF EXISTING NEURAL NETWORKS WITHOUT AFFECTING EXISTING OUTPUTS
2y 5m to grant Granted Mar 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
64%
Grant Probability
85%
With Interview (+20.7%)
4y 3m
Median Time to Grant
Low
PTA Risk
Based on 36 resolved cases by this examiner. Grant probability derived from career allow rate.