Office Action Analysis: 18160596 — SYSTEM AND METHOD FOR SELECTING MODEL TOPOLOGY

Office Action

§101 §103
DETAILED ACTION
This action is responsive to the application filed on 01/27/2023. Claims 1-20 are pending and have been examined.
This action is Non-final.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments: 
Argument 1: For 35 U.S.C. 101, the applicant argues that the pending claims are patent eligible because they are directed to an improvement in the functioning of a machine learning model itself, not merely to an abstract idea. The applicant relies on the Desjardins memo and follow up memo to argue that claims directed to improvements in machine learning model operation are eligible, and contends that the present claims are similar because the specification describes technical processes that reduce latent bias and improve inference quality. In particular, the applicant points to paragraphs 0013 through 0017 of the original specification, as well as Figures 3A through 4C, as describing modified split training, a divisional process, selection of divisional points, and obtaining and training a multipath inference model to reduce latent bias in inference models. The applicant then argues that original claim 1, and even more clearly amended claim 1, already reflect those technological improvements in the claim language itself, especially the claimed steps involving mutual information, divisional points, neural architecture search, multipath inference generation paths, and use of a revised inference generation path to provide inferences used for computer implemented services. Based on that, the applicant asserts that the claims, like the claims in Desjardins, Enfish, and McRO, are directed to improvements in technology or computer functionality and therefore integrate any alleged judicial exception into a practical application under Step 2A Prong Two.
Examiner Response to Argument 1: The examiner has considered the argument set forth above, however the argument is not persuasive. The applicant argues that the claims are directed to an improvement in the functioning of a machine learning model itself and therefore integrate any alleged judicial exception into a practical application, citing Desjardins, Enfish, and McRO. However, the limitations recited in the claims do not reflect a specific technological improvement to computer functionality or to the operation of machine learning systems themselves, but instead recite generalized data analysis and model management steps that fall within abstract data processing. For example, the claim limitations including “obtaining a magnitude of mutual information between labels and a bias feature,” “selecting…a provisional divisional point and a provisional number of hidden layers,” and “performing a neural architecture search using the provisional divisional point…to obtain a final divisional point and a final number of hidden layers” merely involve evaluating information, selecting parameters, and performing architecture search based on evaluation criteria. As explained in the rejection above, these operations correspond to mental processes or mathematical concepts because they involve evaluating relationships between variables, selecting values based on those relationships, and applying those selections to guide model configuration. Additionally, other recited steps such as “obtaining…a body portion and a first head portion,” “performing a training procedure using the multipath inference model,” and “using the revised first inference generation path to provide second inferences…used to provide computer implemented services” merely amount to instructions to apply the abstract idea on a generic computer implementation of a machine learning model. These steps represent generic data gathering, model training, and output generation activities that are well-understood, routine, and conventional in the field of machine learning. Unlike the claims in Enfish or McRO, the present claims do not recite a specific improvement to computer architecture, data structures, or algorithmic processing that changes how the computer itself operates; rather, they recite the use of conventional machine learning techniques to analyze bias-related information and adjust model parameters. Accordingly, the claims do not integrate the abstract idea into a practical application under Step 2A Prong Two and do not provide significantly more than the judicial exception under Step 2B. Therefore, the rejection of claims 1-20 under 35 U.S.C. 101 is maintained.
Argument 2: The applicant argues that the cited prior art does not teach or suggest obtaining a bias feature from inferences generated by a trained inference model, as recited in claim 1 (and analogous claims). The applicant asserts that the references relied upon by the examiner operate on training data or explicit attributes, whereas the amended claimed invention now requires the bias feature to be derived from the model’s own inferences, which the applicant contends is not disclosed or suggested by the prior art.
Examiner Response to Argument 2: The examiner has considered the argument set forth above but maintains that the rejection of claim 1 and analogous claims is proper and that the applicant’s argument is not persuasive. Alder describes a framework for auditing trained predictive models by analyzing their outputs to determine how particular attributes influence the model’s predictions. In this approach, the model is treated as a black box that produces prediction results, and the influence of an attribute can be determined even when that attribute is not explicitly included in the model inputs. Alder further evaluates this influence by executing the trained classifier and comparing prediction outcomes under different conditions to measure how the presence or absence of a feature affects the model’s behavior. This process derives bias-related information from the model’s prediction outputs rather than from the original training data itself. Accordingly, Alder teaches determining bias-related feature influence based on model inferences, and therefore the applicant’s argument that the prior art does not obtain bias information from model inferences is not persuasive.
Argument 3: The applicant argues that the cited prior art does not teach or suggest the newly added amended limitation of the independent claims requiring that the bias feature is not part of first training data on which the trained inference model is trained and is obtained from first inferences generated by the trained inference model. In particular, the applicant argues at page 12 of the Remarks that Zheng, Kang, Chu, and Ganin “fail to disclose or suggest at least these limitations of the amended independent claims,” and further argues that the cited references are “silent” with regard to the above-emphasized amended portions of the independent claims and instead teach bias features that are included in the initial training data.
Examiner Response to Argument 3: The examiner has considered the argument set forth above, however the argument is not persuasive. The examiner asserts that the rejection of claim 1 is proper because the cited references collectively teach the amended limitation when considered together. In particular, Alder teaches that a bias-related attribute need not be expressly present in the model inputs, stating that “we can find attribute influences even in cases where, upon further direct examination of the model, the attribute is not referred to by the model at all,” which the examiner interprets to be the same as the claimed bias feature not being part of first training data on which the trained inference model is trained, because both are directed to a bias or sensitive attribute that is absent from the model’s explicit inputs yet still influences the model through indirect means. Alder further teaches that trained models are used as black boxes that “output a prediction or score,” and teaches measuring indirect influence by running the already-trained classifier and comparing results without retraining, which the examiner interprets to be the same as the claimed bias feature being obtained from first inferences generated by the trained inference model, because both are directed to deriving bias-related information from outputs generated by an already-trained model rather than from the original training data itself. Accordingly, the examiner maintains that the cited combination teaches the amended limitation, and the applicant’s argument is therefore not persuasive.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition
of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the
conditions and requirements of this title. 
Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.

Regarding claim 1, 
Step 1: The claim is directed to a method, which falls under the category of a process. The claim satisfies step 1. 
Step 2A Prong 1: 
“performing a neural architecture search using the provisional divisional point, the provisional number of hidden layers, a predictive capability goal, and a neural architecture size goal to obtain a final divisional point and a final number of hidden layers” -- The limitation is directed to performing neural architecture using information from the provisional divisional point, hidden layer numbers, and others to obtain a final divisional point and number of layers as hidden. The limitation is directed to a process that can be performed in the human mind using evaluation, observation, and judgement, and thus the limitation is directed to a mental process. 
Step 2A Prong 2 and Step 2B: 
“obtaining a magnitude of mutual information between labels and a bias feature;… obtaining, based on the final divisional point and the final number of hidden layers, a body portion and a first head portion; obtaining, using the inference model and based on the body portion and the first head portion, -- The limitation recites obtaining information/data between labels/bias feature based on gathered data. The limitation is directed to an insignificant, extra-solution activity, which cannot be integrated to a practical application (see MPEP 2106.05(g)). Furthermore, under Step 2B, the act of obtaining information based on gathered data is a well-understood, routine, and conventional activity (WURC) that cannot provide significantly more than the judicial exception (see MPEP 2106.05(d)(II)). 
“the bias feature not being part of first training data on which the trained inference model is trained and is obtained from first inferences of the inferences generated by the trained inference model;” -- The limitation recites that the bias feature is not part of the first training data where the trained model is trained and obtained from the first instances generated by the model. The limitation is generically recited and amounts to no more than mere further limiting to a field of use/environment, and thus it cannot be integrated to a practical application, nor provide significantly more than the judicial exception (see MPEP 2106.05(h)). 
“selecting, based on the magnitude and the trained inference model, a provisional divisional point and a provisional number of hidden layers;”  -- The limitation is directed to selecting a divisional point and a number of hidden numbers based on a magnitude value and an inference model. The limitation is merely obtaining gathered data using a computerized model, which is considered an insignificant, extra-solution activity that cannot be integrated to a practical application (see MPEP 2106.05(g)). Furthermore, under Step 2B, the limitation is also directed to a well-understood, routine, and conventional activity (WURC) that cannot provide significantly more than the judicial exception (see MPEP 2106.05(d)(II)).
“performing a training procedure using the multipath inference model, the training procedure providing a revised second inference generation path and a revised first inference generation path;”  -- The limitation recites performing training procedure using an inference model. The limitation is directed to mere instructions to apply onto a computer, and thus it cannot be integrated in a practical application, nor provides significantly more than judicial exception (see MPEP 2106.05(f)). 
“A method for managing a trained inference model that may exhibit latent bias in inferences generated by the trained inference model, the method comprising…using the revised first inference generation path to provide second inferences of the inferences, the second inferences being used to provide computer implemented services…a multipath inference model comprising a first inference generation path trained using, in part, the labels and a second inference generation path trained using, in part, the bias feature;” -- The limitation recites a method for managing an inference model that comprises using generation paths to provide (transmit) services. The limitation is directed to an insignificant, extra-solution activity that cannot be integrated to a practical application (see MPEP 2106.05(g)). Furthermore, the act of transmitting data over a network, is a well-understood, routine, and conventional activity that cannot provide significantly more than the judicial exception (see MPEP 2106.05(d)(II)). 
Thus, claim 1 is non-patent eligible. Claims 10 and 16 are analogous to claim 1 (aside from CRM vs system claim), and thus it would face the same rejection above. 

Regarding claim 2, 
Step 1: The claim is directed to a method, which falls under the category of a process. The claim satisfies step 1.
There are no elements to be evaluated under Step 2A Prong 1. 
Step 2A Prong 2 and Step 2B: 
“The method of claim 1, wherein the first training data comprises features and the labels, and the second inference generation path being trained using second training data comprising the features and the bias feature.” -- The limitation recites using training data that comprises features and labels for training generation paths using data. The limitation amounts to no more than further limiting to a field of use/environment, and thus cannot be integrated to a practical application, nor provide significantly more than the judicial exception (see MPEP 2106.05(h)). 
Thus, claim 2 is non-patent eligible. Claims 11 and 17 are analogous to claim 2 (aside from CRM vs system claim), and thus it would face the same rejection above.

Regarding claim 3, 
Step 1: The claim is directed to a method, which falls under the category of a process. The claim satisfies step 1.
There are no elements to be evaluated under Step 2A Prong 1. 
Step 2A Prong 2 and Step 2B: 
“The method of claim 1, wherein the provisional divisional point divides hidden layers of the trained inference model into two groups, a first group of the two groups comprising a majority of the hidden layers when the magnitude exceeds a first threshold, a second group of the two groups comprising the majority of the hidden layers when the magnitude is below a second threshold, and the first group and the second group comprising a similar number of the hidden layers when the magnitude is between the first threshold and the second threshold.” -- The limitation recites that a divisional point introduced in claim 1 will be further divided into groups of data (first or second) that comprises majority of layers depending on a threshold. The limitation is merely limiting the divisional point to a field of use/environment, and does not integrate to a practical application, nor does it provide significantly more than the judicial exception (see MPEP 2106.05(h)). 
Thus, claim 3 is non-patent eligible. Claims 12 and 18 are analogous to claim 3 (aside from CRM vs system claim), and thus it would face the same rejection above.

Regarding claim 4, 
Step 1: The claim is directed to a method, which falls under the category of a process. The claim satisfies step 1.
There are no elements to be evaluated under Step 2A Prong 1. 
Step 2A Prong 2 and Step 2B: 
“The method of claim 3, wherein the provisional divisional point is a starting point for the neural architecture search.” -- The limitation recites that the provisional divisional point disclosed in earlier claims will further be considered the starting point for the neural architecture search. The limitation amounts to no more than merely further limiting to a field of use/environment, and does not integrate to a practical application, provide significantly more than the judicial exception (see MPEP 2106.05(h)). 
Thus, claim 4 is non-patent eligible. Claims 13 and 19 are analogous to claim 4 (aside from CRM vs system claim), and thus it would face the same rejection above.

Regarding claim 5, 
Step 1: The claim is directed to a method, which falls under the category of a process. The claim satisfies step 1.
There are no elements to be evaluated under Step 2A Prong 1. 
Step 2A Prong 2 and Step 2B: 
“The method of claim 1, wherein the provisional divisional point divides hidden layers of the inference model into two groups, hidden layer membership in a first group of the two groups scales proportionally to the magnitude, and hidden layer membership in the second group of the two groups scales inversely proportionally to the magnitude.” -- The limitation recites merely splitting hidden layers of a model for the provisional divisional point into two groups and proportionally/inversely proportionately scaling groups to a magnitude. The limitation amounts to no more than mere further limiting to a field of use/environment, thus it cannot be integrated to a practical application, nor provides significantly more than a judicial exception (see MPEP 2106.05(h)). 
Thus, claim 5 is non-patent eligible. Claims 14 and 20 are analogous to claim 4 (aside from CRM vs system claim), and thus it would face the same rejection above.

Regarding claim 6,
Step 1: The claim is directed to a method, which falls under the category of a process. The claim satisfies step 1.
Step 2A Prong 1: 
“The method of claim 5, wherein the magnitude is normalized to a range where at a first end of the range all of the hidden layers are members of the first group and at a second end of the range all of the hidden layers are members of the second group.” -- The limitation is directed to normalizing magnitude at end ranges of hidden layers. The limitation is directed to the use of mathematical calculation/operation, (see [0128-0129]), and thus the limitation is directed to math.
There are no elements to be evaluated under Step 2A Prong 2 and Step 2B. 
Thus, claim 6 is non-patent eligible. Claim 15 is analogous to claim 4 (aside from CRM claim), and thus it would face the same rejection above.

Regarding claim 7,
Step 1: The claim is directed to a method, which falls under the category of a process. The claim satisfies step 1.
There are no elements to be evaluated under Step 2A Prong 1. 
Step 2A Prong 2 and Step 2B: 
“The method of claim 1, wherein the neural architecture size goal defines a range for the hidden layers over which the neural architecture search is conducted.” -- The limitation recites the neural architecture goal will defines a range for the hidden layers where the architecture search will be conducted. The limitation is merely further limiting to a field of use/environment, and thus it does not integrate to a practical application, nor does it provide significantly more than the judicial exception (see MPEP 2106.05(h)). 
Thus, claim 7 is non-patent eligible.

Regarding claim 8,
Step 1: The claim is directed to a method, which falls under the category of a process. The claim satisfies step 1.
There are no elements to be evaluated under Step 2A Prong 1. 
Step 2A Prong 2 and Step 2B: 
“The method of claim 5, wherein the predictive capability goal indicates a minimum acceptable level of accuracy for the second inferences.” -- The limitation recites a predictive capability goal first introduced in  that will indicate a minimal (threshold) level of accuracy for inferences. The limitation amounts to no more than mere further limiting to a field of use/environment, and it does not integrate to a practical application, nor does not provide significantly more than the judicial exception (see MPEP 2106.05(h)). 
Thus, claim 8 is non-patent eligible.

Regarding claim 9, 
Step 1: The claim is directed to a method, which falls under the category of a process. The claim satisfies step 1.
There are no elements to be evaluated under Step 2A Prong 2 and Step 2B. 
Step 2A Prong 2 and Step 2B: 
“The method of claim 1, wherein the latent bias is caused by the bias feature and the bias feature being indicated in the first inferences generated by the inference model” -- The limitation merely applies the abstract idea in the context of a trained inference model. The limitation does not recite a technological improvement or other meaningful limitation, and thus amounts to mere instructions to apply the exception, which cannot be integrated into a practical application and does not provide significantly more than the judicial exception. (see MPEP 2106.05(f)). 
Thus, claim 9 is non-patent eligible.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this
Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not
identically disclosed as set forth in section 102, if the differences between the claimed invention and the
prior art are such that the claimed invention as a whole would have been obvious before the effective filing
date of the claimed invention to a person having ordinary skill in the art to which the claimed invention
pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are
summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness. 

Claims 1-20 are rejected under 35 U.S.C. §103 as being unpatentable over NPL reference “Auditing black-box models for indirect influence”, by Adler et. al. (referred herein as Adler) in view of “Neural architecture search with representation mutual information.” by Zheng et. al. (referred herein as Zheng) in view of NPL reference “Neurosurgeon: Collaborative intelligence between the cloud and mobile edge.” by Kang et. al. (referred herein as Kang) in view of NPL reference “Fairnas: Rethinking evaluation fairness of weight sharing neural architecture search.” by Chu et. al. (referred herein as Chu) further in view of NPL reference “Domain-adversarial training of neural networks.” by Ganin et. al. (referred herein as Ganin). 

Regarding claim 1, Adler teaches:
A method for managing a trained inference model to prevent latent bias in inferences generated by the trained inference model, the method comprising: ([Adler, page 1], “Data-trained predictive models see widespread use, but for the most part they are used as black boxes which output a prediction or score ...we present a technique for auditing black-box models” and [Adler, page 1], “Our work focuses on the problem of indirect influence: how some features might indirectly influence outcomes via other, related features.”, wherein the examiner interprets auditing black-box predictive models for indirect influence to be the same as managing a trained inference model to prevent latent bias in inferences generated by the trained inference model, because they are both directed to identifying and addressing bias-related influence in the outputs of an already-trained predictive model.)	
the bias feature not being part of first training data on which the trained inference model is trained ([Adler, page 1], “Our work focuses on the problem of indirect influence: how some features might indirectly influence outcomes via other, related features. As a result, we can find attribute influences even in cases where, upon further direct examination of the model, the attribute is not referred to by the model at all.”, wherein the examiner interprets finding attribute influences even when the attribute is not referred to by the model at all to be the same as the bias feature not being part of first training data on which the trained inference model is trained, because they are both directed to a scenario where the bias/sensitive attribute is absent from the model's explicit inputs yet still exerts influence through indirect means.)
and is obtained from first inferences of the inferences generated by the trained inference model; ([Adler, page 1], “Data-trained predictive models see widespread use, but for the most part they are used as black boxes which output a prediction or score ...we present a technique for auditing black-box models, … Our approach does not require the black-box model to be retrained. This is important if (for example) the model is only accessible via an API, and contrasts our work with other methods that investigate feature influence like feature selection.” and [Adler, page 3], “The indirect influence II(i) of a feature i on a classifier f applied to data (X, Y) is the difference in accuracy when f is run on X versus when it is run on X\Xi: II(i) = acc(X, Y, f) - acc(X\Xi, Y, f) (note that f is not retrained on X-j)”, wherein the examiner interprets computing the indirect influence by running the already-trained classifier f on data and measuring the accuracy difference (without retraining) to be the same as the bias feature being obtained from first inferences of the inferences generated by the trained inference model, because they are both directed to deriving bias-related information by executing (i.e., generating inferences from) the trained model on data, rather than from the training data itself.)
Adler does not teach obtaining a magnitude of mutual information between labels and a bias feature; selecting, based on the magnitude and the trained inference model, a provisional divisional point and a provisional number of hidden layers; performing a neural architecture search using the provisional divisional point, the provisional number of hidden layers, a predictive capability goal, and a neural architecture size goal to obtain a final divisional point and a final number of hidden layers; obtaining, based on the final divisional point and the final number of hidden layers, a body portion and a first head portion; obtaining, using the trained inference model and based on the body portion and the first head portion, a multipath inference model comprising a first inference generation path trained using, in part, the labels and a second inference generation path trained using, in part, the bias feature; performing a training procedure using the multipath inference model, the training procedure providing a revised second inference generation path and a revised first inference generation path; and using the revised first inference generation path to provide second inferences of the inferences, the second inferences being used to provide computer implemented services.
	Zheng teaches obtaining a magnitude of mutual information between labels and a bias feature, [Zheng, page 11914] “Rather than estimating architectures by using laborious training methods, we propose Representation Mutual Information (RMI) to achieve effective and efficient performance estimation. In particular, given an arbitrary network + that has a decent accuracy, we use X1+ XL+ to represent the random variables of feature maps in each layer. For any architecture that is sampled from the search space, we formally define the RMI score as

    PNG
    media_image1.png
    67
    305
    media_image1.png
    Greyscale

In general, an architecture that has a high RMI score tends to be a good architecture, as the proposed RMI score is robust, effective and efficient”, wherein the examiner interprets the RMI score as a computed mutual-information value between network representations to be the same as a magnitude of mutual information because they are both numeric MI quantities obtained to guide model or architecture decisions.)
Adler and Zheng do not teach selecting, based on the magnitude and the trained inference model, a provisional divisional point and a provisional number of hidden layers; performing a neural architecture search using the provisional divisional point, the provisional number of hidden layers, a predictive capability goal, and a neural architecture size goal to obtain a final divisional point and a final number of hidden layers; obtaining, based on the final divisional point and the final number of hidden layers, a body portion and a first head portion; obtaining, using the trained inference model and based on the body portion and the first head portion, a multipath inference model comprising a first inference generation path trained using, in part, the labels and a second inference generation path trained using, in part, the bias feature; performing a training procedure using the multipath inference model, the training procedure providing a revised second inference generation path and a revised first inference generation path; and using the revised first inference generation path to provide second inferences of the inferences, the second inferences being used to provide computer implemented services.
Kang teaches selecting, based on the magnitude and the trained inference model, a provisional divisional point and a provisional number of hidden layers; [Kang, page 623], “Partition Point Selection - “Neurosurgeon” then selects the best partition point. The candidate points are after each layer. Lines 16 and 18 evaluate the performance when partitioning at each candidate point and select the point for either best end-to-end latency or best mobile energy consumption”, wherein the examiner interprets selecting the best partition point after a specific layer (with evaluation at each candidate layer) to be the same as selecting a provisional divisional point and a provisional number of hidden layers because they are both directed to choosing a split layer in the inference model that fixes how many hidden layers fall into the portions on each side of the split.)
Adler, Zheng and Kang do not teach performing a neural architecture search using the provisional divisional point, the provisional number of hidden layers, a predictive capability goal, and a neural architecture size goal to obtain a final divisional point and a final number of hidden layers; obtaining, based on the final divisional point and the final number of hidden layers, a body portion and a first head portion; obtaining, using the trained inference model and based on the body portion and the first head portion, a multipath inference model comprising a first inference generation path trained using, in part, the labels and a second inference generation path trained using, in part, the bias feature; performing a training procedure using the multipath inference model, the training procedure providing a revised second inference generation path and a revised first inference generation path; and using the revised first inference generation path to provide second inferences of the inferences, the second inferences being used to provide computer implemented services.
Chu teaches performing a neural architecture search using the provisional divisional point, the provisional number of hidden layers, a predictive capability goal, and a neural architecture size goal to obtain a final divisional point and a final number of hidden layers; ([Chu, page 8] “For the second-stage, we adopt multi-objective optimization where three objectives are considered: accuracies, multiply-adds, and the number of parameters.” AND [Chu, page 6], “Figure 8 exhibits the resulting FairNAS-A, B and C models, which are sampled from our Pareto front to meet different hardware constraints… shown in Table 1. Notably, FairNAS-A obtains a highly competitive result 75.3% top-accuracy for ImageNet classification, which surpasses MnasNet-92 (+0.5%) and Single-Path-NAS(+0.3%). FairNAS-B matches Proxyless-GPU with much fewer parameters and multiply-adds. Besides, it surpasses Proxyless-R Mobile(+0.5%) with a comparable amount of multiply-adds.” wherein the examiner interprets the use of multi-objective optimization over accuracies, multiply-adds, and number of parameters together with selecting models from a Pareto front to meet hardware constraints to be the same as using a predictive capability goal and a neural architecture size goal to obtain a final model configuration (including a final divisional point and a final number of hidden layers) because they are both directed to architecture search that trades off accuracy with resource/size limits and results in a finalized architecture choice that fixes the network partition and layer counts.)
Adler, Zheng, Kang, and Chu does not teach obtaining, based on the final divisional point and the final number of hidden layers, a body portion and a first head portion; obtaining, using the trained inference model and based on the body portion and the first head portion, a multipath inference model comprising a first inference generation path trained using, in part, the labels and a second inference generation path trained using, in part, the bias feature; performing a training procedure using the multipath inference model, the training procedure providing a revised second inference generation path and a revised first inference generation path; and using the revised first inference generation path to provide second inferences of the inferences, the second inferences being used to provide computer implemented services.
Ganin teaches: 
obtaining, based on the final divisional point and the final number of hidden layers, a body portion and a first head portion; obtaining, using the trained inference model and based on the body portion and the first head portion, a multipath inference model comprising a first inference generation path trained using, in part, the labels and a second inference generation path trained using, in part, the bias feature; ([Ganin, page 11] “Figure 1: The proposed architecture includes a deep feature extractor (green) and a deep label predictor (blue), which together form a standard feed-forward architecture. Unsupervised domain adaptation is achieved by adding a domain classifier(red) connected to the feature extractor via a gradient reversal layer…[training] minimizes the label prediction loss…and the domain classification loss…

    PNG
    media_image2.png
    269
    653
    media_image2.png
    Greyscale

“, wherein the examiner interprets the deep feature extractor to be the same as a body portion and the deep label predictor to be the same as a first head portion because they are both directed to a shared trunk feeding a label-prediction head. The examiner further interprets adding a domain classifier trained alongside the label predictor to be the same as obtaining a second inference generation path trained using, in part, the bias feature and a first inference generation path trained using, in part, the labels because they are both directed to two heads over the same body; one trained on task labels and another trained on a sensitive/domain signal.)
	performing a training procedure using the multipath inference model, the training procedure providing a revised second inference generation path and a revised first inference generation path; and ([Ganin, page 11] “a saddle point…can be found as a stationary point of the following gradient updates: [followed by the update rules for the feature extractor, label predictor, and domain classifier] …We use stochastic estimates of these gradients, by sampling examples from the data set,”, wherein the examiner interprets the simultaneous update of the label predictor and the domain classifier during training to be the same as providing a revised first inference generation path and a revised second inference generation path because they are both directed to iteratively updating both heads (and shared body) in the training procedure.)
using the revised first inference generation path to provide second inferences of the inferences, the second inferences being used to provide computer implemented services. ([Ganin, page 13] “After the learning, the label predictor Gy(Gf (x; θf ); θy) can be used to predict labels for samples from the target domain…” and [Ganin, page 13], “After the learning, the label predictor Gy(Gf (x; θf ); θy) can be used to predict labels for samples from the target domain (as well as from the source domain).” wherein the examiner interprets using the trained label predictor after learning to predict labels to be the same as using the revised first inference generation path to provide second inferences of the inferences, the second inferences being used to provide computer implemented services, because they are both directed to using model-generated predictions in deployed computer-implemented software functionality.)
Adler, Zheng, Kang, Chu, Ganin, and the instant application are analogous art because they are all directed to managing inference models by identifying bias-related information. It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method claim 1 disclosed by Adler to include the mutual information scoring technique disclosed by Zheng. One would be motivated to do so to efficiently quantify information relationships useful for guiding model or architecture decisions in a trained inference model, as suggested by Zheng (Zheng, [page 11914] “Representation Mutual Information (RMI) to achieve effective and efficient performance estimation.”) It would have also been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method claim 1 disclosed by Adler to include the data partitioning technique disclosed by Kang. One would be motivated to do so to efficiently choose a split location in the trained inference model that improves system performance when dividing model processing across portions of the network, as suggested by Kang (Kang, [page 623] “Neurosurgeon then selects the best partition point.”) It would have also been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method claim 1 disclosed by Adler to include the multi-objective optimization technique disclosed by Chu. One would be motivated to do so to effectively balance predictive capability against model size and computational cost when determining a finalized network configuration, as suggested by Chu ([Chu, page 8] “multi-objective optimization where three objectives are considered: accuracies, multiply-adds, and the number of parameters.”) It would have also been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method claim 1 disclosed by Adler to include the feature extractor disclosed by Ganin. One would be motivated to do so to effectively train multiple paths over a shared representation so as to preserve task-prediction performance while accounting for a secondary bias/domain-related signal, as suggested by Ganin ([Ganin, page 11] “The proposed architecture includes a deep feature extractor ... and a deep label predictor ... [and] a domain classifier.”) Claims 10 and 16 are analogous to claim 1 (aside from CRM vs system claim), and thus it would face the same rejection above.

Regarding claim 2, Adler, Zheng, Kang, Chu, and Ganin teaches The method of claim 1, (see rejection of claim 1).
Ganin further teaches wherein the inference model is obtained using first training data comprising features and the labels, and the second inference generation path being trained using second training data comprising the features and the bias feature. ([Ganin, page 1] “The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain.” AND [Ganin, page 11] “Otherwise, the training proceeds standardly and minimizes the label prediction loss (for source examples) and the domain classification loss (for all samples)...Ld (Gd( R( Gf (xi; θf )); θd), di)”, wherein the examiner interprets training on labeled source-domain data (using input features xi with class labels yi) and optimizing a domain classification loss that uses domain labels di with the same input features to be the same as using first training data comprising features and the labels, and training a second inference generation path using second training data comprising the features and a bias feature because they are both directed to, respectively, (a) learning the task/label head from feature/label pairs and (b) learning the auxiliary/bias head from feature/non-task attribute pairs that serve as the bias feature.)
Adler, Zheng, Kang, Chu, Ganin, and the instant application are analogous art, because they are all directed to obtaining an inference model using first training data comprising features and the labels while also training a second inference generation path using second training data comprising the features and a bias feature.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method of claim 1 disclosed by Adler, Zheng, Kang, Chu, and Ganin to include the “minimizes the label prediction loss (for source examples) and the domain classification loss (for all samples)” disclosed by Ganin. One would be motivated to do so to effectively train, respectively, a label-prediction head on features with labels and an auxiliary head on features with a bias feature to reduce sensitive/domain leakage while preserving task performance, as suggested by Ganin (Ganin, [page 11] “minimizes the label prediction loss (for source examples) and the domain classification loss (for all samples).”) Claims 11 and 17 are analogous to claim 2 (the main difference being the type of claim), and thus will face the same rejection as set forth above.

Regarding claim 3, Adler, Zheng, Kang, Chu, and Ganin teaches The method of claim 1, (see rejection of claim 1).
Kang further teaches:
wherein the provisional divisional point divides hidden layers of the trained inference model into two groups, ([Kang, page 622] “Partition Point Selection – Neurosurgeon then selects the best partition point. The candidate points are after each layer.” AND [Kang, page 623] “there is exactly one partition point within the DNN for which information is sent from the mobile device to the cloud.”, wherein the examiner interprets selecting one partition point after a specific layer to be the same as the provisional divisional point dividing the hidden layers into two groups because they are both directed to inserting a single split that separates the preceding hidden layers from the succeeding hidden layers.)
a second group of the two groups comprising the majority of the hidden layers when the magnitude is below a second threshold, ([Kang, page 623] “the left-most bar represents cloud-only processing (i.e., partitioning at the beginning) while the right-most bar represents mobile-only execution (i.e., partitioning at the end).” AND [Kang, page 620] “Figures 8a - 8c show that different CV applications have different partition points for best latency, and Figures 9a - 9c show the different partition points for best energy for these DNNs”, wherein the examiner interprets the beginning/end split cases to be the same as allocating nearly all (a majority) of hidden layers to one group or the other because they are both directed to choosing a partition point that places most layers on one side under limiting conditions. The examiner further interprets “different partition points” to be the same as the second “threshold” because they are both geared to finding the best point to partition the NN.)
and the first group and the second group comprising a similar number of the hidden layers when the magnitude is between the first threshold and the second threshold. ([Kang, page 622] “Computer vision DNNs sometimes have better partition points in the middle of the DNN”, wherein the examiner interprets a middle partition to be the same as the two groups comprising a similar number of hidden layers because they are both directed to splitting near the midpoint so that the preceding and succeeding hidden layers are of comparable count. The examiner further interprets “different partition points” to be the same as the second “threshold” because they are both geared to finding the best point to partition the NN.)
Kang does not teach a first group of the two groups comprising a majority of the hidden layers when the magnitude exceeds a first threshold.
Zheng further teaches a first group of the two groups comprising a majority of the hidden layers when the magnitude exceeds a first threshold, ([Zheng, page 11914] “In general, an architecture that has a high RMI score tends to be a good architecture”, [Zheng, page 11915] “Such classification also generates a loss value threshold T…This architecture will be marked as a good sample if its loss value is below the threshold T” AND [Zheng, page 11913] “any architectures…that have > 85% top classification accuracy is used as an accurate performance indicator.”, wherein the examiner interprets using an RMI-based magnitude together with explicit thresholds (including an example greater-than cutoff) to be the same as using a first threshold on the magnitude because they are both directed to deciding model/topology choices based on whether a measured magnitude crosses a defined boundary.)
Adler, Zheng, Kang, Chu, Ganin, and the instant application are analogous art because they are all directed to threshold-based placement of a divisional point within a neural network that splits hidden layers into earlier and later groups.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method of claim 1 disclosed by Adler, Zheng, Kang, Chu, and Ganin to include the mutual information metric disclosed by Zheng. One would be motivated to do so to efficiently estimate performance of architectures, as suggested by Zheng (Zheng, page 11914] “Rather than estimating architectures by using laborious training methods, we propose Representation Mutual Information (RMI) to achieve effective and efficient performance estimation.”). Claims 12 and 18 are analogous to claim 3 (the main difference being the type of claim), and thus will face the same rejection as set forth above.

Regarding claim 4, Adler, Zheng, Kang, Chu, and Ganin teaches The method of claim 3, (see rejection of claim 3).
Kang further teaches wherein the provisional divisional point is a starting point for the neural architecture search. ([Kang, page 622-623] “Partition Point Selection - Neurosurgeon then selects the best partition point. The candidate points are after each layer. Lines 16 and 18 evaluate the performance when partitioning at each candidate point and select the point for either best end-to-end latency or best mobile energy consumption.”, wherein the examiner interprets selecting “the best partition point…after each layer” to be the same as a provisional divisional point and “Uniform initialization for the populations P1 and Q1” to be the same as a starting point for the neural architecture search because they are both directed to first choosing a specific split layer in the network and then using that chosen configuration as the initial condition from which the NAS procedure begins. The examiner further interprets “each candidate” to be the same as a provisional divisional point because they are both initial, layer-level split locations that are explicitly evaluated prior to choosing a single best (final) split.)
Adler, Zheng, Kang, Chu, Ganin, and the instant application are analogous art because they are all directed to using a selected divisional split as the starting configuration for a subsequent neural architecture search that balances predictive performance and resource or size constraints.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method of claim 3 disclosed by Adler, Zheng, Kang, Chu, and Ganin to include the mutli-objective optimization approach disclosed by Chu. One would be motivated to do so to efficiently guide the search from the chosen split toward architectures that satisfy accuracy and size objectives, as suggested by Chu ([Chu, page 6-7] “we adopt multi-objective optimization where three objectives are considered: accuracies, multiply-adds, and the number of parameters…sampled from our Pareto front to meet different hardware constraints.”). Claims 13 and 19 are analogous to claim 4 (the main difference being the type of claim), and thus will face the same rejection as set forth above.

Regarding claim 5, Adler, Zheng, Kang, Chu, and Ganin teaches The method of claim 1, (see rejection of claim 1).
Kang further wherein the provisional divisional point divides hidden layers of the trained inference model into two groups, hidden layer membership in a first group of the two groups scales proportionally to the magnitude, and hidden layer membership in the second group of the two groups scales inversely proportionally to the magnitude. ([Kang, page 623] “Partition Point Selection – Neurosurgeon then selects the best partition point. The candidate points are after each layer. Lines 16 and 18 evaluate the performance when partitioning at each candidate point and select the point for either best end-to-end latency or best mobile energy consumption.” AND [Kang, page 622] “Each bar represents the mobile energy consumption if the DNN is partitioned after each layer, where the left-most bar represents cloud-only processing (i.e., partitioning at the beginning) while the right-most bar represents mobile-only execution (i.e., partitioning at the end). The partition points for best energy are each marked by F…The best way to partition a DNN depends on its topology and constituent layers.”, wherein the examiner interprets evaluating partitioning after each layer and selecting a best partition point with the split moving along the sequence of layers to be the same as the provisional divisional point dividing hidden layers into two groups with hidden layer membership in a first group scaling proportionally to the magnitude and hidden layer membership in the second group scaling inversely proportionally to the magnitude because they are both directed to adjusting a single split across ordered hidden layers such that as the driving magnitude increases the number of layers on one side of the split increases while the number on the other side decreases.)
Adler, Zheng, Kang, Chu, Ganin, and the instant application are analogous art because they are all directed to selecting and adjusting a partition layer across ordered hidden layers.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method of claim 1 disclosed by Adler, Zheng, Kang, Chu, and Ganin to include the candidate partition point selection process disclosed by Kang. One would be motivated to do so to effectively adjust the split location across the hidden layers in response to the magnitude so that the number of layers on each side scales appropriately while improving system performance, as suggested by Kang (Kang, [page 623] “Partition Point Selection – Neurosurgeon then selects the best partition point. The candidate points are after each layer…select the point for either best end-to-end latency or best mobile energy consumption”). Claims 14 and 20 are analogous to claim 5 (the main difference being the type of claim), and thus will face the same rejection as set forth above.

Regarding claim 6, Adler, Zheng, Kang, Chu, and Ganin teaches The method of claim 5, (see rejection of claim 5).
Zheng further teaches wherein the magnitude is normalized to a range ([Zheng, page 11915] “introduce the normalized Hilbert-Schmidt Independence Criterion (HSIC)…the normalized HSIC [21, 26, 63] is defined as… I(X,Y) ≈ nHSIC_linear(X,Y) = ||YᵀX||²_F / (||XᵀX||_F ||YᵀY||_F)”, wherein the examiner interprets the use of a normalized HSIC definition (i.e., scaling a dependence measure by the product of self-similarities) to be the same as the magnitude is normalized to a range because they are both directed to producing a bounded, scale-invariant score from an underlying dependence quantity so it can be compared or used downstream.)
Zheng does not teach where at a first end of the range all of the hidden layers are members of the first group and at a second end of the range all of the hidden layers are members of the second group.
Kang further teaches where at a first end of the range all of the hidden layers are members of the first group and at a second end of the range all of the hidden layers are members of the second group. ([Kang, page 619] “Each bar represents the end-to-end latency if the DNN is partitioned after each layer, where the left-most bar represents cloud-only processing (i.e., partitioning at the beginning) while the right-most bar represents mobile-only execution (i.e., partitioning at the end).”, wherein the examiner interprets cloud-only processing and mobile-only execution, i.e., the two extremes where the partition is at the very beginning or the very end, to be the same as the first end of the range assigning all hidden layers to the first group and the second end of the range assigning all hidden layers to the second group because they are both directed to extreme partition choices in which every hidden layer lies entirely on one side of the split.)
Adler, Zheng, Kang, Chu, and Ganin, and the instant application are analogous art because they are all directed to normalizing a dependence measure to a bounded range and interpreting the range’s endpoints as split choices.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method of claim 5 disclosed by Adler, Zheng, Kang, Chu, Ganin, to include the partition point selection procedure disclosed by Kang. One would be motivated to do so to effectively tie the normalized range to concrete split decisions that optimize deployment objectives, as suggested by Kang ([Kang, page 622] “Partition Point Selection - Neurosurgeon then selects the best partition point. The candidate points are after each layer. Lines 16 and 18 evaluate the performance when partitioning at each candidate point and select the point for either best end-to-end latency or best mobile energy consumption…this evaluation is lightweight and efficient.”). Claims 15 is analogous to claim 6 (the main difference being the type of claim), and thus will face the same rejection as set forth above.

Regarding claim 7, Adler, Zheng, Kang, Chu, and Ganin teaches The method of claim 1, (see rejection of claim 1).
Chu further teaches wherein the neural architecture size goal defines a range for the hidden layers over which the neural architecture search is conducted. ([Chu, page 7] “For the second-stage, we adopt multi-objective optimization where three objectives are considered: accuracies, multiply-adds, and the number of parameters.” AND [Chu, page 5] “A search space of 19 layers as ProxylessNAS”, wherein the examiner interprets optimizing with multiply-adds and the number of parameters together with defining a search space of 19 layers to be the same as a neural architecture size goal defining a range for the hidden layers over which the neural architecture search is conducted because they are both directed to constraining the search by model-size measures (size goal) and by a specific depth span (range of hidden layers) within which candidates are explored.)
Adler, Zheng, Kang, Chu, and Ganin, and the instant application are analogous art because they are all directed to constraining a neural architecture search by a neural architecture size goal that defines a range of hidden layers.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method of claim 1 disclosed by Adler, Zheng, Kang, Chu, and Ganin to include three objectives for multi-objective optimization disclosed by Chu. One would be motivated to do so to efficiently optimize search strategies using objectives, as suggested by Chu ([Chu, page 5] “A search space of 19 layers as ProxylessNAS”)

Regarding claim 8, Adler, Zheng, Kang, Chu, and Ganin teaches The method of claim 5, (see rejection of claim 5).
Zheng further teaches wherein the predictive capability goal indicates a minimum acceptable level of accuracy for the inferences. ([Zheng, page 7] “a network architecture with an accuracy greater than 85% is suitable enough to make RMI score a good indicator.” AND [Zheng, page 5] “Such classification also generates a loss value threshold T for the following steps.”, wherein the examiner interprets the paper’s requirement that architectures surpass 85% accuracy together with the use of a threshold T to accept only “good” architectures to be the same as the predictive capability goal indicating a minimum acceptable level of accuracy for the inferences because they are both directed to setting a performance threshold that candidates must meet or exceed before being deemed acceptable.)
Adler, Zheng, Kang, Chu, Ganin, and the instant application are analogous art because they are all directed to managing trained inference models by evaluating model architectures using performance criteria.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method claim 5 disclosed by Adler, Zheng, Kang, Chu, and Ganin to include the mutual information score disclosed by Zheng. One would be motivated to do so to effectively ensure that candidate neural network architectures satisfy a minimum predictive performance requirement before being used to guide architecture evaluation and selection, as suggested by Zheng (Zheng, [page 7] “a network architecture with an accuracy greater than 85% is suitable enough to make Representation Mutual Information [RMI] score a good indicator.”).

Regarding claim 9, Adler, Zheng, Kang, Chu, and Ganin teach The method of claim 1, (see rejection of claim 1).
Ganin further teaches wherein the latent bias is caused by the bias feature ([Ganin, page 7, sec 4] “we ensure that the internal representation of the neural network contains no discriminative information about the origin of the input (source or target), while preserving a low risk on the source (labeled) examples.”, wherein the examiner interprets “origin of the input (source or target)” to be the same as the bias feature and interprets the effort to remove “discriminative information” about that origin from the learned representation to be the same as latent bias being caused by the bias feature, because they are both directed to a nuisance / sensitive attribute causing undesired influence in the learned model representation and resulting predictions.)
Ganin does not teach and the bias feature being indicated in the first inferences generated by the trained inference model.
Alder teaches and the bias feature being indicated in the first inferences generated by the trained inference model. ([Alder, page 1]”Data-trained predictive models see widespread use, but for the most part they are used as black boxes which output a prediction or score…Our work focuses on the problem of indirect influence: how some features might indirectly influence outcomes via other, related features. As a result, we can find attribute influences even in cases where, upon further direct examination of the model, the attribute is not referred to by the model at all.”, and [Adler, page 3] “The indirect influence II(i) of a feature i on a classifier f applied to data (X, Y) is the difference in accuracy when f is run on X versus when it is run on X\Xi: II(i) = acc(X, Y, f) - acc(X\Xi, Y, f) (note that f is not retrained on X-j)”, wherein the examiner interprets black-box models that “output a prediction or score” and the detection of “attribute influences” by running the trained classifier to be the same as the bias feature being indicated in the first inferences generated by the trained inference model, because they are both directed to identifying the presence or effect of a bias-related feature from the outputs / inferences generated by an already-trained model.)
Adler, Zheng, Kang, Chu, Ganin, and the instant application are analogous art because they are all directed to identifying and mitigating bias-related influence in trained inference models by analyzing the effects of sensitive or nuisance attributes on model outputs.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method claim 1 disclosed by Adler, Zheng, Kang, Chu, and Ganin to include the black box analysis technique disclosed by Alder. One would be motivated to do so to effectively identify bias-related influence reflected in the outputs of a trained inference model even when the bias feature is not explicitly present in the model inputs or training data, as suggested by Alder ([Alder, page 1] “Data-trained predictive models see widespread use, but for the most part they are used as black boxes which output a prediction or score…we can find attribute influences even in cases where… the attribute is not referred to by the model at all.”).


Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DEVAN KAPOOR whose telephone number is (703)756-1434. The examiner can normally be reached Monday - Friday: 9:00AM - 5:00 PM EST (times may vary).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, David Yi can be reached at (571) 270-7519. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/DEVAN KAPOOR/Examiner, Art Unit 2126                                                                                                                                                                                                        
/DAVID YI/Supervisory Patent Examiner, Art Unit 2126
Read full office action
SYSTEM AND METHOD FOR SELECTING MODEL TOPOLOGY

This examiner grants 11% of cases after interview

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

SYSTEM AND METHOD FOR SELECTING MODEL TOPOLOGY

This examiner grants 11% of cases after interview

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email