Last updated: April 19, 2026
Application No. 17/942,992
INFORMATION PROCESSING APPARATUS FOR GENERATING MACHINE LEARNING MODEL BY COUPLING ONE FEATURE EXTRACTOR TO A PLURALITY OF PREDICTORS, TRAINING MODEL, AND EXTRACTING FEATURE EXTRACTOR WITHOUT PREDICTORS AS TRAINED MODEL OPERABLE FOR INFERENCE, AND METHOD AND NON-TRANSITORY COMPUTER READABLE MEDIUM HAVING PROGRAM STORED THEREON FOR SAME

Final Rejection §102§103
Filed
Sep 12, 2022
Examiner
BALAKRISHNAN, VIJAY MURALI
Art Unit
2143
Tech Center
2100 — Computer Architecture & Software
Assignee
Kabushiki Kaisha Toshiba
OA Round
2 (Final)
Interview Optional

— +85.7% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 14 resolved cases, 2023–2026
Examiner Intelligence

BALAKRISHNAN, VIJAY MURALI View full profile →
Grants 43% of resolved cases
Career Allow Rate
6 granted / 14 resolved
-12.1% vs TC avg
Strong +86% interview lift
Without
With
+85.7%
Interview Lift
resolved cases with interview
Typical timeline
3y 12m
Avg Prosecution
26 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
26.4%
-13.6% vs TC avg
§103
31.5%
-8.5% vs TC avg
§102
13.2%
-26.8% vs TC avg
§112
24.3%
-15.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 14 resolved cases
Office Action

§102 §103
DETAILED ACTION
	This final action is in response to the amendment and remarks filed on 10/30/2025 for application 17/942,992. 
	Claims 1, 5, 8, and 10-12 have been amended. Claim 6 is cancelled. Claims 13-18 are newly added claims.
	Claims 1-5 and 7-18 thereby remain pending in the application. Claims 1, 11, and 12 are the pending independent claims.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
	The amendment filed 10/30/2025 has been entered.
	Applicant’s amendment to the specification with respect to resolving title and specification objections has been considered, and the objections set forth in the office action mailed 07/30/2025 are consequently withdrawn.
	Applicant’s amendment to the claims with respect to resolving claim objections and indefiniteness rejections under 35 U.S.C. 112(b) has been considered, and the objections and 112(b) rejections set forth in the office action mailed 07/30/2025 are consequently withdrawn.
Claim Interpretation
	As recited in MPEP § 2111, during patent examination, “the pending claims must
be given their broadest reasonable interpretation consistent with the specification”.
Under a broadest reasonable interpretation (BRI), claim terms must be given their plain
and ordinary meaning (i.e., the meaning that the term would have to a person of
ordinary skill in the art), unless applicant sets forth a special definition of a claim term
within the specification. The plain and ordinary meaning of a term “may be evidenced by
a variety of sources, including the words of the claims themselves, the specification,
drawings, and prior art”.
	Independent claim 1 (and corresponding claims 11 and 12) recites the limitations “wherein the feature extractor comprises a plurality of convolutional layers configured to generate a shared feature map from input data” and “whereby the trained model is constituted by the extracted feature extractor and excludes the plurality of predictors, the trained model being operable to execute inference on new input data to perform the specific task”.
	Dependent claim 13 (and corresponding claims 14 and 15) recite the limitation “wherein the specific task comprises classifying input image data into one of a plurality of categories”.
	Dependent claim 16 (and corresponding claims 17 and 18) recite the limitation “wherein the specific task comprises performing object detection on input image data”.
	As best understood in light of the instant specification, the examiner has interpreted the transitional term “constituted by” to be open-ended (i.e., synonymous with “comprises”, “includes”, or “contains” (see MPEP § 2111.03)), such that the “trained model” may be inclusive of additional components to the “extracted feature extractor”, and is only expressly exclusive of “the plurality of predictors”.
	The examiner further notes that per a plain understanding of a feature extractor comprising convolutional layers, such an architecture would not be ordinary capable of performing the recited tasks of “classifying input image data” or “performing input image data” on its own. A feature extractor is typically understood in the art to extract and output a compressed feature representation from provided input data, and the specification [¶ 0014] and claims (generate a shared feature map from input data) appear to be in concordance with such an interpretation. It would be thereby understood by one of ordinary skill in the art that at minimum, a subsequent classifier (e.g., SVM, softmax) and/or fully connected layer taking extracted features as input would be necessary in order to satisfy the tasks of image classification (see also Yamashita et al., “Convolutional neural networks: an overview and application in radiology” [page 612 What is CNN: the big picture and page 613 Fig. 1 – An overview of a convolutional neural network (CNN) architecture and the training process]) or object detection (see also Zhao et al., “Object Detection with Deep Learning: A Review”, [page 1 Introduction]) as recited in the dependent claims. Because applicant’s specification also does not provide any further description explaining the technical implementation of these downstream procedures (and at best, solely recites “The extracted features extractor can be used in downstream tasks such as classification and object detection” [¶ 0019]), the scope of claimed implementation of these procedures in the dependent claims is limited to what is well-known in the art, and ultimately necessitates an open-ended interpretation of the recited trained model as comprising more than the feature extractor itself to be capable of performing the recited tasks.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-3, 6-7, and 11-18 are rejected under 35 U.S.C. 103 as being unpatentable over Chen et al., ("One-Shot Neural Ensemble Architecture Search by Diversity-Guided Search Space Shrinking", published 2021), hereinafter Chen, further in view of Babagholami Mohamadabadi et al., (Pub No. US 20220301296 A1, “Multi-Expert Adversarial Regularization for Robust and Data-Efficient Deep Supervised Learning”, effectively filed 03/12/2021), hereinafter Mohamadabadi, and Yamashita et al., (“Convolutional neural networks: an overview and application in radiology”, published 22 June 2018), hereinafter Yamashita.
Regarding claim 1, Chen teaches An information processing apparatus comprising a processor (“To further build models with better generalization capability and performance, model ensemble is usually adopted and performs better than stand-alone models. Inspired by the merits of model ensemble, we propose to search for multiple diverse models simultaneously as an alternative way to find powerful models. Searching for ensembles is non-trivial and has two key challenges: enlarged search space and potentially more complexity for the searched model. In this paper, we propose a one-shot neural ensemble architecture search (NEAS) solution that addresses the two challenges. For the first challenge, we introduce a novel diversity-based metric to guide search space shrinking, considering both the potentiality and diversity of candidate operators. For the second challenge, we enable a new search dimension to learn layer sharing among different models for efficiency purposes” [Chen Abstract]; “To further evaluate the generalization ability of the architectures found by NEAS, we transfer the architectures to the downstream COCO [22] object detection task. We use the NEAS-S (pre-trained 500 epochs on ImageNet) as a drop-in replacement for the backbone feature extractor in RetinaNet [21] and compare it with other backbone networks. We perform training on the train2017 set (around 118k images) and evaluation on the val2017 set (5k images) with 32 batch sizes using 8 V100 GPUs” [Chen page 8 Generalization Ability and Robustness]; The one-shot neural ensemble architecture search (NEAS) method can process image data and is executable on GPU processors) configured to: 
generate a machine learning model by coupling one feature extractor to each of a plurality of predictors, each predictor being coupled to the same feature extractor, the feature extractor being configured to extract a feature amount of data; (“…we consider to share the shallow layers of different ensemble components. We propose to search for diverse ensemble components with shared shallow layers and different deep layers to reduce the computation cost. To automatically find which layers should be shared, we design a new search dimension called split point. The split point defines where the ensemble model will have heterogeneous architectures” [Chen page 5 Layer Sharing Among Ensemble Components]; see Figure 3 including 3(b) Model Searched by NEAS – 3(b) includes shallow layers DS 4 3x3 SE, MB 6 7X7 SE, and MB 4 3x3 SE above Split Point [Chen page 5]; “Search Space. Consistent with previous NAS methods [10, 6, 37], our search space includes a stack of mobile inverted bottleneck residual blocks (MBConv)… For details, there are 7 basic operators for each layer, including MBConv” [Chen page 6 Implementation Details]; see MBConv/SkipConnect and Depthwise Separable Conv rows in Table 7, The structure of the supernet [Chen page 12 A-II: Supernet Structure and Search Space]; “…selected paths make predictions independently, and our ensemble network’s output is the average of predictions from all paths” [Chen page 5 Neural Ensemble Architecture Search]; As shown in 3(b), the search space of NEAS is comprised of convolution operations (MB/MBConv, DS/Depthwise Separable Conv) which perform feature extraction; the convolution operations are part of the shallow layers architecture (i.e., feature extractor) that is shared among the different deep layers (i.e., predictors), wherein deep layers form the different paths that make predictions) and 
train the machine learning model for a specific task using a result of ensembling a plurality of outputs from the predictors, (“Given the search space Ω of single deep neural networks, denote A = {φk ∈ Ω : k = 1,...,K} as a set of K architectures with corresponding parameters W = {ωk : k = 1,...,K}, Φ(·; A, W) as the ensemble model, and S = ΩK as the search space of ensemble models…In our work we specify Φ(·; A, W) as: [equation 1] We then formulate NEAS as a two-stage optimization problem like other one-shot methods (e.g., [10]). The first-stage is to optimize the weight of the supernet by: [equation 2] where Ltrain is the loss function on the training set, W(A) means architectures in A inherit weights from W. This step is done by uniformly sampling an ensemble architecture Φ from S and performing backpropagation to update the weight of the corresponding blocks in the supernet for each iteration” [Chen page 3 NEAS Formulation]; “The loss Li of each path φi is computed independently while the backpropagation is performed using the combined loss L = PK i Li to update the weights of corresponding blocks in the supernet. Following this updating process, the whole network is still trained in an end-to-end style…selected paths make predictions independently, and our ensemble network’s output is the average of predictions from all paths” [Chen page 5 Neural Ensemble Architecture Search]; “To further evaluate the generalization ability of the architectures found by NEAS, we transfer the architectures to the downstream COCO [22] object detection task. We use the NEAS-S (pre-trained 500 epochs on ImageNet) as a drop-in replacement for the backbone feature extractor in RetinaNet [21] and compare it with other backbone network” [Chen page 8 Generalization Ability and Robustness]; In each training iteration, end-to-end backpropagation is performed based on loss function Ltrain that takes Φ(·; A, W) as a parameter [see equation 2], wherein the output of Φ(·; A, W) is an average (i.e., ensembled) result of path (i.e, predictor) outputs [see equation 1]. The resulting trained architecture may then be applied for a given downstream transfer learning task (e.g., object detection)).
wherein the feature extractor comprises a plurality of convolutional layers configured to generate a shared feature map from input data, and each of the plurality of predictors comprises at least one convolutional layer configured to receive the shared feature map and output a respective prediction, (see Figure 3 including 3(b) Model Searched by NEAS – 3(b) details shared shallow convolutional layers (i.e., feature extractor)), e.g., layers DS 4 3x3 SE, MB 6 7X7 SE, and MB 4 3x3 SE in this instance, above Split Point, as well as each path (i.e. predictor) below Split Point, each having at least one convolutional layer (e.g., MB, Conv), receiving output (i.e., shared feature map) from last shared layer above Split Point (MB 4 3x3 SE in this instance) [Chen page 5]; “…selected paths make predictions independently, and our ensemble network’s output is the average of predictions from all paths” [Chen page 5 Neural Ensemble Architecture Search]; As shown in 3(b), the search space of NEAS is comprised of convolution operations (MB/MBConv, DS/Depthwise Separable Conv) which perform feature extraction; the convolution operations are part of the shallow layers architecture (i.e., feature extractor) that is shared among the different deep layers (i.e., predictors), wherein deep layers form the different paths that make individual predictions)
wherein the processor is configured to ensemble the plurality of predictions to generate a training signal used to update parameters of the feature extractor (“The loss Li of each path φi is computed independently while the backpropagation is performed using the combined loss L = PK i Li to update the weights of corresponding blocks in the supernet. Following this updating process, the whole network is still trained in an end-to-end style…selected paths make predictions independently, and our ensemble network’s output is the average of predictions from all paths” [Chen page 5 Neural Ensemble Architecture Search]; In each training iteration, end-to-end backpropagation (i.e., updating entire model parameters, including shared layers) is performed based on loss function Ltrain that takes Φ(·; A, W) as a parameter (i.e., signal) [see equation 2], wherein the output of Φ(·; A, W) is an average (i.e., ensembled) result of path (i.e, predictor) outputs [see equation 1])), 
wherein the processor is further configured to extract, upon completion of training of the machine learning model, a feature extractor included in the machine learning model as a trained model (“In specific, we randomly sample the split point s, the architecture of sharing layers Asharing = {o1, o2, ··· , os}, and the operator combinations Asplit = {hs+1, hs+2, ··· , hd} for the rest of layers from the shrunk search space…Following this updating process, the whole network is still trained in an end-to-end style…After obtaining the trained supernet, we perform evolution search on it to obtain an optimal ensemble model. At the beginning of the evolution search, we pick Nseed random architecture as seeds. The top k architectures are picked as parents to generate the next generation by crossover and mutation. In one crossover, two randomly selected candidates are picked and crossed to produce a new one during each generation…In one mutation, a candidate mutates its split point with a probability Ps. If the split point increases, the number of sharing layers increases with the same number. We randomly pick one path and move its corresponding architectures to the sharing architecture. Otherwise, if the split point decreases, we cut the sharing architecture and add it to each path’s architecture” [Chen page 5 Neural Ensemble Architecture Search]; “The split point space is set to range (9, 20) to handle different complexity constrains” [Chen page 6 Implementation Details]; The split point of a trained ensemble model is mutable within a wide range and the shared layer count can be adjusted at will, i.e., it is suggested the processor is capable of extracting the shared layers architecture Asharing (i.e., layers of feature extractor) above the split point from the overall trained network).
However, Chen does not expressly teach separate utilization of the feature extractor, and therefore does not expressly teach or suggest the processor [being] further configured to extract the feature extractor included in the machine learning model as a trained model without extracting the plurality of predictors, whereby the trained model is constituted by the extracted feature extractor and excludes the plurality of predictors, the trained model being operable to execute inference on new input data to perform the specific task, wherein the processor is further configured to execute inference using the trained model on the new input data to thereby perform the specific task.
In the same field of endeavor, Mohamadabadi teaches a system of training an ensemble architecture comprising a shared feature extractor (“The subject matter disclosed herein provides a deep learning model, referred to herein as a Multi-Expert Adversarial Regularization (MEAR) learning model, that is an effective tool for two important computer-vision tasks, image classification and semantic segmentation. The MEAR learning model involves a limited computational overhead and thereby improves generalization and robustness of deep-supervised learning models. In one embodiment, the MEAR learning model may include a single feature extractor and multiple classifier heads (experts). The MEAR learning model aims to learn the extractor in an adversarial manner by leveraging complementary information from the multiple classifier heads and ensemble to be more robust for an unseen test domain.” [Mohamadabadi ¶ 0029]; “To reduce computations and promote information sharing, all experts share the same CNN feature extractor denoted by f(⋅) with parameter θ.sub.f, the neural network weights of the feature extractor” [Mohamadabadi ¶ 0035]) that is configured for extract[ing] the feature extractor included in the machine learning model as a trained model without extracting the plurality of predictors, whereby the trained model is constituted by the extracted feature extractor and excludes the plurality of predictors, the trained model being operable to execute inference on new input data (“The MEAR learning model aims to learn the extractor in an adversarial manner by leveraging complementary information from the multiple classifier heads and ensemble to be more robust for an unseen test domain.” [Mohamadabadi ¶ 0029]; “The MEAR learning model 100 may improve the generalization and robustness of the ensemble by using the knowledge of each expert to teach the feature extractor and other experts in an adversarial fashion” [Mohamadabadi ¶ 0041]; “In order to ensure the ensemble prediction for strongly augmented samples is consistent with the target labels, a multi-expert consensus loss may be used in which f(⋅) may be encouraged to generate features so that the ensemble predictions for strongly augmented examples may be close to the target annotations/labels. Such a design may explicitly teach the feature extractor how to handle data from unseen domains (mimicked by strong augmentation), thereby improving robustness to new domains” [Mohamadabadi ¶ 0041]; The disclosed MEAR ensemble model expressly discloses training the feature extractor itself, via leveraging knowledge from expert heads through multi-consensus loss in an adversarial framework, to handle data from unseen test domains, and thereby suggests the feature extractor being usable as its own trained model, separate from the expert heads, capable of generating robust feature representations (i.e., executing inference) from test data of different domains (i.e., new input data)) 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated extract[ing] the feature extractor included in the machine learning model as a trained model without extracting the plurality of predictors, whereby the trained model is constituted by the extracted feature extractor and excludes the plurality of predictors, the trained model being operable to execute inference on new input data as taught by Mohamadabadi into Chen because they are both directed towards training an ensemble architecture comprising a shared feature extractor. Incorporating the adversarial regularization training techniques taught by Mohamadabadi into a modification of the NEAS framework of Chen would improve the overall robustness and generalization ability of the final architecture to downstream tasks [Mohamadabadi ¶ 0029-0030, 0032], which was already established by Chen as being a desirable feature [Chen page 8 Generalization Ability and Robustness].
However, although Mohamadabadi suggests separate utilization of the feature extractor for unseen test data, the combination does not expressly teach the trained model being operable to execute inference on new input data to perform [a] specific task, wherein the processor is further configured to execute inference using the trained model on the new input data to thereby perform the specific task.
In the same field of endeavor, Yamashita teaches a review of training convolutional neural network (CNN) architectures comprising feature extraction layers followed by classifiers (“This article focuses on the basic concepts of CNN and their application to various radiology tasks, and discusses its challenges and future directions” [Yamashita Abstract]; “CNN is a mathematical construct that is typically composed of three types of layers (or building blocks): convolution, pooling, and fully connected layers. The first two, convolution and pooling layers, perform feature extraction, whereas the third, a fully connected layer, maps the extracted features into final output, such as classification” [Yamashita page 612 What is CNN: the big picture (Fig. 1)]) wherein [a] trained model (i.e., pre-trained feature extractor) is operable to execute inference on new input data to perform a specific task, wherein the processor is further configured to execute inference using the trained model on the new input data to thereby perform the specific task (see Fig. 10 including Pretrained convolutional base and Pretrained FC layers in Pretrained network (left column), wherein the Pretrained FC layers are then substituted with New classifier in Fixed feature extraction method (middle column) – “Transfer learning is a common and effective strategy to train a network on a small dataset, where a network is pretrained on an extremely large dataset, such as ImageNet, then reused and applied to the given task of interest. A fixed feature extraction method is a process to remove FC layers from a pretrained network and while maintaining the remaining network, which consists of a series of convolution and pooling layers, referred to as the convolutional base, as a fixed feature extractor. In this scenario, any machine learning classifier, such as random forests and support vector machines, as well as the usual FC layers, can be added on top of the fixed feature extractor, resulting in training limited to the added classifier on a given dataset of interest” [Yamashita page 11];  The reference discloses the fixed feature extraction method as a commonly known transfer learning technique, wherein a pre-trained convolutional feature extractor is operable, via substitution of a subsequent classifier (see also Claim Interpretation above), to execute inference on a given task of interest (i.e., specific task) for a given dataset of interest (i.e., new input data)).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the trained model being operable to execute inference on new input data to perform [a] specific task, wherein the processor is further configured to execute inference using the trained model on the new input data to thereby perform the specific task as taught by Yamashita into the combination of Chen and Mohamadabadi because they are all directed towards training convolutional neural network (CNN) architectures comprising feature extraction layers followed by classifiers. Given that the combination of Chen and Mohamadabadi already expressly suggests separate utility of the robust and generalizable pre-trained feature extractor, and that the fixed feature extraction method, as explained in Yamashita, is a well-known and efficient transfer learning technique for leveraging the power of a pre-trained model, a person of ordinary skill in the art would recognize the value of incorporating the trained feature extractor of the final model into downstream learning tasks, thereby reducing amount of training data and time necessary during inference for specific domains (“The underlying assumption of transfer learning is that generic features learned on a large enough dataset can be shared among seemingly disparate datasets. This portability of learned generic features is a unique advantage of deep learning that makes itself useful in various domain tasks with small datasets” [Yamashita page 620 Training on a small dataset]).
Regarding claim 2, the combination of Chen, Mohamadabadi, and Yamashita teaches the limitations of parent claim 1, and Chen further teaches wherein the plurality of predictors differ in configuration (see 3(b) Model Searched by NEAS in Figure 3 – the paths of deep layers (i.e., predictors) below Split Point vary in configuration [Chen page 5])
Regarding claim 3, the combination of Chen, Mohamadabadi, and Yamashita teaches the limitations of parent claim 1, and Chen further teaches wherein the plurality of predictors differ in at least one of weight coefficient, number of layers, number of nodes, or network structure. (see 3(b) Model Searched by NEAS in Figure 3 – the paths of deep layers (i.e., predictors) below Split Point vary in number of layers [Chen page 5])
Regarding claim 7, the combination of Chen, Mohamadabadi, and Yamashita teaches the limitations of parent claim 1, and Chen further teaches wherein the processor trains the machine learning model based on a loss function using an additive average or a weighted average of the outputs of the plurality of predictors ([Chen page 3 NEAS Formulation] and [Chen page 5 Neural Ensemble Architecture Search] as detailed above; In each training iteration, backpropagation is performed based on loss function Ltrain that takes Φ(·; A, W) as a parameter [see equation 2], wherein the output of Φ(·; A, W) is an average of path (i.e, predictor) outputs [see equation 1]).
Regarding claim 11, it is a method claim that corresponds to the apparatus of claim 1. Consequently, claim 11 is rejected for the same reasons as claim 1.
Regarding claim 12, it is a product claim that corresponds to the apparatus of claim 1. Chen further teaches A non-transitory computer readable medium including computer executable instructions, wherein the instructions, when executed by a processor, cause the processor to perform a method comprising: the claimed functions (“To further evaluate the generalization ability of the architectures found by NEAS, we transfer the architectures to the downstream COCO [22] object detection task. We use the NEAS-S (pre-trained 500 epochs on ImageNet) as a drop-in replacement for the backbone feature extractor in RetinaNet [21] and compare it with other backbone networks. We perform training on the train2017 set (around 118k images) and evaluation on the val2017 set (5k images) with 32 batch sizes using 8 V100 GPUs” [Chen page 8 Generalization Ability and Robustness]; The one-shot neural ensemble architecture search (NEAS) method is executable on GPU processors). Consequently, claim 12 is rejected for the same reasons as claim 1.
Regarding claim 13, the combination of Chen, Mohamadabadi, and Yamashita teaches the limitations of parent claim 1, and Yamashita further teaches the specific task comprising classifying input image data into one of a plurality of categories (see Fig. 10 – “a network is pretrained on an extremely large dataset, such as ImageNet, then reused and applied to the given task of interest” [Yamashita page 621]; “The most established algorithm among various deep learning models is convolutional neural network (CNN), a class of artificial neural networks that has been a dominant method in computer vision tasks since the astonishing results were shared on the object recognition competition known as the ImageNet Large Scale Visual Recognition Competition (ILSVRC) in 2012” [Yamashita page 612 Introduction]; “Although deep learning has become a dominant method in a variety of complex tasks such as image classification and object detection, it is not a panacea.” [Yamashita page 627 Conclusion]; It is well understood in the art that the disclosed fixed feature extraction technique (via substitution of a subsequent classifier – also see rejection of claim 1 and Claim Interpretation above) may be applied for performing a variety of downstream tasks of interest, including computer vision tasks such as image classification)
Regarding claims 14 and 15, they are method and product claims that correspond to the apparatus of claim 13. Consequently, they are rejected for the same reasons as claim 13.
Regarding claim 16, the combination of Chen, Mohamadabadi, and Yamashita teaches the limitations of parent claim 1, and Yamashita further teaches wherein the specific task comprises performing object detection on input image data (see Fig. 10 – “a network is pretrained on an extremely large dataset, such as ImageNet, then reused and applied to the given task of interest” [Yamashita page 621]; “The most established algorithm among various deep learning models is convolutional neural network (CNN), a class of artificial neural networks that has been a dominant method in computer vision tasks since the astonishing results were shared on the object recognition competition known as the ImageNet Large Scale Visual Recognition Competition (ILSVRC) in 2012” [Yamashita page 612 Introduction]; “Although deep learning has become a dominant method in a variety of complex tasks such as image classification and object detection, it is not a panacea.” [Yamashita page 627 Conclusion]; It is well understood in the art that the disclosed fixed feature extraction technique (via substitution of a subsequent classifier – also see rejection of claim 1 and Claim Interpretation above) may be applied for performing a variety of downstream tasks of interest, including computer vision tasks such as object detection).
Regarding claims 17 and 18, they are method and product claims that correspond to the apparatus of claim 16. Consequently, they are rejected for the same reasons as claim 16.
Claims 4-5 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Chen, Mohamadabadi, and Yamashita, as applied to claim 1 above, further in view of Herron et al., ("Ensembles of Networks Produced from Neural Architecture Search", published 2020), hereinafter Herron.
Regarding claim 4, the combination of Chen, Mohamadabadi, and Yamashita teaches the limitations of parent claim 1, and Chen further teaches wherein the plurality of predictors include dropouts (“We retrain the discovered architectures for 350 epochs on ImageNet using similar settings as EfficientNet [37]: RMSProp optimizer with momentum 0.9 and decay 0.9, weight decay 1e-5, dropout ratio 0.2, initial learning rate 0.064 with a warmup in the first 10 epochs and a cosine annealing” [Chen page 6 Implementation Details]).
However, the combination does not explicitly teach wherein the plurality of predictors include dropouts so as to differ in network structure when training, or differ in at least one of number of dropouts, dropout position, or regularization value.
In the same field of endeavor, Herron teaches a system that applies neural architecture search (NAS) techniques to ensemble learning (“Much of the focus has been on how to produce a single best network to solve a machine learning problem, but as NAS methods produce many networks that work very well, this affords the opportunity to ensemble these networks to produce an improved result” [Herron Abstract]; “In this work, we will study the results of ensembling networks produced by one such NAS method. The NAS method used is Multi-node Evolutionary Neural Networks for Deep Learning (MENNDL). It produces a variety of deep learning networks that perform well on the given dataset… In this report, we create and evaluate the performances of ensembles of the best performing networks produced by one or more runs of MENNDL” [Herron page 2 Introduction]) wherein the plurality of predictors include dropouts so as to differ in network structure when training, or differ in at least one of number of dropouts, dropout position, or regularization value (“We created ensembles of the top networks across one or more runs of MENNDL against two different datasets: MNIST and CIFAR-10…. Each of the chosen networks were evaluated on the test set, producing softmax outputs that were averaged to obtain the final predictions” [Herron page 5 Ensembles of MENNDL Generated Networks]; see Fig. 1. Top network architectures including Dropout layers at different positions [Herron page 6])
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated wherein the plurality of predictors include dropouts so as to differ in network structure when training, or differ in at least one of number of dropouts, dropout position, or regularization value as taught by Herron into the combination of Chen, Mohamadabadi, and Yamashita because both Chen and Herron are directed towards systems that apply neural architecture search (NAS) techniques to ensemble learning. Given that Chen already uses dropout and data augmentation techniques (“We retrain the discovered architectures for 350 epochs on ImageNet using…dropout ratio 0.2…AutoAugment [7] and exponential moving average are also used for training” [Chen page 6 Implementation Details] and prioritizes diverse model architecture (“Different from the above methods, we perform explicit ensemble without separate training and search for diverse model architectures to build ensemble models with great feature expression ability” [Chen page 2 Related works]), one of ordinary skill in the art would recognize the value of incorporating the teachings of Herron to further boost model architecture diversity within the NEAS system of Chen (“Additionally, the diversity of network structures produced by NAS drives a natural bias towards diversity of predictions produced by the individual networks. This results in an improved ensemble over simply creating an ensemble that contains duplicates of the best network architecture retrained to have unique weights” [Herron Abstract]; “Figure 1 illustrates the architectures of the top networks produced by eight separate runs of MENNDL against the CIFAR-10 image dataset. Note that the architectures of the best performing networks produced by each run are diverse, yet each network performs comparably on the validation sets [Herron page 5 MENDDL]).
Regarding claim 5, the combination of Chen, Mohamadabadi, and Yamashita teaches the limitations of parent claim 1.
However, the combination does not expressly teach wherein each predictor includes a convolutional layer, and the plurality of predictors differ in position of a pooling layer.
In the same field of endeavor, Herron teaches a system that applies neural architecture search (NAS) techniques to ensemble learning ([Herron page 2 Introduction], as detailed above) wherein each predictor includes a convolutional layer, and the plurality of predictors differ in position of a pooling layer ([Herron page 5 Ensembles of MENNDL Generated Networks], as detailed above; “Multi-node Evolutionary Neural Networks for Deep Learning (MENNDL) is a software framework that implements an evolutionary algorithm for optimizing neural network topology and hyperparameters. More specifically, it can optimize the number of layers, layer type for each layer, and the corresponding layer hyperparameters” [Herron page 5 MENDL]; see Fig. 1. Top network architectures including convolution (e.g., Conv2D) and pooling (e.g., AvgPool2D) layers at different positions [Herron page 6]; MENDDL performs optimization by manipulating layer type (including convolution and pooling), number of layers, order of layers, etc. for any network architecture (i.e., predictor) within an ensemble)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated wherein if the plurality of predictors each include a convolutional layer, the plurality of predictors differ in position of a pooling layer as taught by Herron into the combination of Chen, Mohamadabadi, and Yamashita because both Chen and Herron are directed towards systems that apply neural architecture search (NAS) techniques to ensemble learning. Given that Chen already prioritizes diverse model architecture ([Chen page 2 Related works], as detailed above), one of ordinary skill in the art would recognize the value of incorporating the teachings of Herron to further boost model architecture diversity within the NEAS system of Chen ([Herron Abstract] and [Herron page 5 MENDDL], as detailed above).
Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Chen, Mohamadabadi, and Yamashita, as applied to claim 1 above, further in view of Narayanan et al., (“Multi-headed Neural Ensemble Search”, available arXiv 09 July 2021, cited in IDS dated 12/18/2024), hereinafter Narayanan.
Regarding claim 8, the combination of Chen, Mohamadabadi, and Yamashita teaches the limitations of parent claim 1, and Chen further teaches wherein the processor trains the machine learning model so as to increase diversity between predictors ([Chen page 2 Related works], as detailed above; see Diversity Guided Shrinking in Figure 2 [Chen page 3])
However, the combination does not expressly teach increasing diversity between predictors by increas[ing] a distance between an output of each predictor and an average output of the plurality of predictors.
In the same field of endeavor, Narayanan teaches a system that applies neural architecture search (NAS) techniques to ensemble learning (“While multi-headed models have been studied before, to the best of our knowledge, there is no prior work that studies the performance impact of searching head architectures. In this work, we employ neural architecture search (NAS) (Elsken et al., 2019) to search for the heads’ architecture in the multi-headed models. Unlike Deep Ensembles, multi-headed models are trained end-to end, which allows one-shot NAS methods (Bender et al., 2018; Liu et al., 2019) to optimize an ensemble objective” [Narayanan page 1 Introduction]) that increases diversity between predictors by increas[ing] a distance between each predictor and an average output of the plurality of predictors (“Given a classification task, let Dtrain = {(xn, yn) : n = 1, ..., N} be the training set, where xn ∈ R D is the D-dimensional input and yn ∈ {1, ..., C} is the label, assumed to be one of the C classes. We denote a base learner, in our case neural networks, by fθ, where θ represents the network parameters. A network takes the input xi and outputs a vector of probabilistic class posteriors over the classes as fθ(x) ∈ R C . We construct an ensemble F of M members fθ1 , ..., fθM by averaging the output of the networks” [Narayanan page 2 Multi-headed Ensembles]; “During the search phase of DARTS, there is no guarantee that the head architectures learnt via gradient descent would be diverse. To this end, we introduce an additional diversity term only in the loss function for the architecture weights, Lval. This ensures that the diversity in the one-shot model predictions originates from the architecture weights and not from the network weights. We use the Jensen-Shannon Divergence (JSD) to measure the diversity between the individual head predictions and maximize it in the validation objective. Unlike KL divergence, JSD is symmetric and bounded (Lin, 1991), which allows for direct maximization without the loss exploding. Given M heads, the diversity-encouraging loss term is: [loss equation] where λjsd is the weight of the JSD loss” [Narayanan page 3 Diversity Encouraging Loss for Differentiable Search]; The training objective maximizes diversity-encouraging term Ljsd, which is based on KL divergence between output of each network fθi(x) (i.e., predictor) and an average output F(x) (see loss equation), thereby maximizing diversity by increasing divergence (i.e., distance) of individual network outputs from a shared average).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated increas[ing] a distance between an output of each of the predictors and an average output of the plurality of predictors as taught by Narayanan into the combination of Chen, Mohamadabadi, and Yamashita because both Chen and Narayanan are directed towards systems that apply neural architecture search (NAS) techniques to ensemble learning. Given that Chen already prioritizes diversity of predictors, as established above, and Narayanan explicitly teaches a variety of known techniques to improve diversity of predictions of ensemble learners (“Some methods induce diversity by training with specialized losses (Lee et al., 2015; Zhou et al., 2018), building ensemble members with different topologies (Zaidi et al., 2020) or different training hyperparameters (Wenzel et al., 2021) to improve ensemble performance. Multi-head networks trained under a unified objective can produce robust ensembles too, by utilizing diversity encouraging specialized losses (Lee et al., 2015), co-distillation (Lan et al., 2018) or both (Dvornik et al., 2019)” [Narayanan page 2 Related work]), one of ordinary skill in the art would recognize the value of incorporating the teachings of Narayanan, such as a diversity-encouraging term in the loss function, to further optimize diversity of network predictions.
Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Chen, Mohamadabadi, and Yamashita, as applied to claim 1 above, further in view of Alhamdoosh et al., ("Fast decorrelated neural network ensembles with random weights", published April 2014), hereinafter Alhamdoosh.
Regarding claim 9, the combination of Chen, Mohamadabadi, and Yamashita teaches the limitations of parent claim 1.
However, the combination does not expressly teach wherein the processor trains the machine learning model such that the outputs of the plurality of predictors are uncorrelated.
In the same field of endeavor, Alhamdoosh teaches a system of building diverse neural network ensembles (“Negative correlation learning (NCL) aims to produce ensembles with sound generalization capability through controlling the disagreement among base learners’ outputs…To achieve a better solution, this paper employs the random vector functional link (RVFL) networks as base components, and incorporates with the NCL strategy for building neural network ensembles” [Alhamdoosh Abstract]) wherein the processor trains the machine learning model such that the outputs of the plurality of predictors are uncorrelated (“In order to reduce the impact of these factors on the generalization error of a learning system, a cluster of RVFL networks are combined together to produce efficient predictions [5,22]. In this paper, we propose a new ensemble learning approach that uses RVFL networks as ensemble components and it is fitted in negative correlation learning framework” [Alhamdoosh page 5 Decorrelated neural-net ensembles with random weights]; “Negative correlation learning (NCL) was proposed to reduce the covariance among ensemble individuals while the variance and bias terms are not increased. Unlike traditional ensemble learning approaches, NCL was introduced to train base models simultaneously in a cooperative manner that decorrelates individual errors Ei [14,30]. Mathematically, the learning error of the ith base model, given in Eq. (8), was modified to include a decorrelation penalty term pi as follows [equation 12] where λ ∈ [0,1] is a regularizing factor. The penalty term pi can be designed in different ways depending on whether the ensemble networks are trained sequentially or parallelly. For instance, it could decorrelate the current learning network with all previously learned networks…[equation 14] Notice that the penalty term in Eq. (14) reduces the correlation mutually among all ensemble individuals by using the actual ensemble output f(xn) instead of the target function yn” [Alhamdoosh pages 5-6 Review of negative correlation learning])
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated wherein the processor trains the machine learning model such that the outputs of the plurality of predictors are uncorrelated as taught by Alhamdoosh into the combination of Chen, Mohamadabadi, and Yamashita because both Chen and Alhamdoosh are directed towards building diverse neural network ensembles. Given that Chen already prioritizes diversity of predictors ([Chen page 2 Related works], as detailed above) and negative correlation learning is a known technique in the art ([Alhamdoosh pages 5-6 Review of negative correlation learning], as detailed above), a person of ordinary skill in the art would recognize the value of incorporating the teachings of Alhamdoosh into Chen, such as a decorrelation penalty term, to further optimize diversity of network predictions.
Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Chen, Mohamadabadi, and Yamashita, as applied to claim 1 above, further in view of Bachman et al., ("Learning with Pseudo-Ensembles", published 08 December 2014), hereinafter Bachman.
Regarding claim 10, the combination of Chen, Mohamadabadi, and Yamashita teaches the limitations of parent claim 1.
However, the combination does not expressly teach wherein the machine learning model includes a configuration in which noise is added to an output from the feature extractor to be input to each of the predictors.
In the same field of endeavor, Bachman teaches a system of building neural network ensemble frameworks (“In this paper, we formalize the notion of a pseudo-ensemble, which is a collection of child models spawned from a parent model by perturbing it with some noise process. Sec. 2 defines pseudo-ensembles, after which Sec. 3 discusses the relationships between pseudo-ensembles and standard ensemble methods, as well as existing notions of robustness. Once the pseudo-ensemble framework is defined, it can be leveraged to create new algorithms” [Bachman page 1 Introduction]) wherein the machine learning model includes a configuration in which noise is added to an output from a parent model to be input to each of the predictors (“A pseudo-ensemble is a collection of ξ-perturbed child models fθ(x; ξ), where ξ comes from a noise process pξ…The goal of learning with pseudo-ensembles is to produce models robust to perturbation. To formalize this, the general pseudo-ensemble objective for supervised learning can be written as follows: [equation 1] where (x, y) ∼ pxy is an (observation, label) pair drawn from the data distribution, ξ ∼ pξ is a noise realization, fθ(x; ξ) represents the output of a child model spawned from the parent model fθ via ξ-perturbation, y is the true label for x, and L(ˆy, y) is the loss for predicting yˆ instead of y” [Bachman page 2 What is a pseudo-ensemble?])
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated wherein the machine learning model includes a configuration in which noise is added to an output from a parent model (e.g., feature extractor of Chen) to be input to each of the predictors as taught by Bachman into the combination because both of these systems are directed towards building neural network ensemble frameworks. Given that Chen prioritizes generalization ability of the NEAS system (“The architecture discovered by NEAS transfers well to downstream object detection task, suggesting the generalization ability of the searched models”  [Chen page 2 Introduction]), already uses dropout techniques ([Chen page 6 Implementation Details], as detailed above), which are a form of noise, and the general understanding in the art that training models to be robust to noise improves generalization ability (“The definition and use of pseudo-ensembles are strongly motivated by the intuition that models trained to be robust to noise should generalize better than models that are (overly) sensitive to small perturbations” [Bachman page 2 Related work]), a person of ordinary skill in the art would recognize the value of incorporating the teachings of Bachman to optimize generalization ability of the NEAS architecture.
Response to Arguments
	The remarks filed 10/30/2025 have been fully considered.
	Applicant’s remarks [Remarks pages 9-17] traversing the non-eligible subject matter rejections under 35 U.S.C. 101 set forth in the office action mailed 07/30/2025, in view of claims 1-5 and 7-18 as amended, have been considered and are persuasive.
	In concordance with the discussion of previous 35 U.S.C. 101 issues held in the telephonic interview conducted 10/23/2025 (see Examiner Interview Summary Record mailed 10/28/2025) and applicant’s remarks (see [Remarks page 16, para. 1]), the examiner further acknowledges that the amended claims now recite a specific, technical procedure of training a specialized ensemble learning framework (comprising a plurality of predictors each coupled to the same feature extractor), extracting a shared feature extractor from the trained ensemble framework, and including the extracted feature extractor in a separate model that is applied for downstream inference. When considered as a whole, the claims ultimately go beyond generic invocation of an ensemble learning framework to instead recite a specific, unconventional manner of implementation that adequately reflects improvement of operation of a machine learning model, as detailed in the specification ([¶ 0054] – reducing memory and computational costs during training time via the shared feature extractor, and also during inference via extraction and deployment of the shared feature extractor), and thereby are no longer directed towards an abstract concept of merely observing data to make predictions. 
	Consequently, the previous rejections under 35 U.S.C. 101 are withdrawn.
	Applicant’s remarks [Remarks pages 17-19] traversing the prior art rejections under 35 U.S.C. 102 and 35 U.S.C. 103 set forth in the office action mailed 07/30/2025, in view of claims 1-5 and 7-18 as amended, have been considered, but are moot because the new grounds of rejection set forth above does not rely on the reference(s) applied in the prior rejection of record for the subject matter being specifically challenged in applicant' s argument.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Zhao et al. (“Object Detection with Deep Learning”, available arXiv 16 Apr 2019) discloses a review of deep learning based object detection frameworks (namely CNN-based architectures).
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to VIJAY M BALAKRISHNAN whose telephone number is (571) 272-0455. The examiner can normally be reached 10am-5pm EST Mon-Thurs.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, JENNIFER WELCH can be reached on (571) 272-7212. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/V.M.B./
Examiner, Art Unit 2143 
	
/JENNIFER N WELCH/Supervisory Patent Examiner, Art Unit 2143
Read full office action
Prosecution Timeline

Sep 12, 2022
Application Filed
Jul 26, 2025
Non-Final Rejection — §102, §103
Oct 23, 2025
Applicant Interview (Telephonic)
Oct 23, 2025
Examiner Interview Summary
Oct 30, 2025
Response Filed
Feb 07, 2026
Final Rejection — §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/766,854
Patent 12585912
GATED LINEAR CONTEXTUAL BANDITS
2y 5m to grant Granted Mar 24, 2026
17/517,698
Patent 12468967
METHOD AND SYSTEM FOR GENERATING A SOCIO-TECHNICAL DECISION IN RESPONSE TO AN EVENT
2y 5m to grant Granted Nov 11, 2025
Study what changed to get past this examiner. Based on 2 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
43%
Grant Probability
99%
With Interview (+85.7%)
3y 12m
Median Time to Grant
Moderate
PTA Risk
Based on 14 resolved cases by this examiner. Grant probability derived from career allow rate.