Last updated: April 19, 2026
Application No. 17/504,282
DEEP NEURAL NETWORK OPTIMIZATION SYSTEM FOR MACHINE LEARNING MODEL SCALING

Final Rejection §101§103
Filed
Oct 18, 2021
Examiner
PATEL, LOKESHA G
Art Unit
2125
Tech Center
2100 — Computer Architecture & Software
Assignee
Intel Corporation
OA Round
2 (Final)
Interview Optional

— +38.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 74 resolved cases, 2023–2026
Examiner Intelligence

PATEL, LOKESHA G View full profile →
Grants 76% — above average
Career Allow Rate
56 granted / 74 resolved
+20.7% vs TC avg
Strong +38% interview lift
Without
With
+38.0%
Interview Lift
resolved cases with interview
Typical timeline
4y 5m
Avg Prosecution
20 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
29.5%
-10.5% vs TC avg
§103
35.3%
-4.7% vs TC avg
§102
8.0%
-32.0% vs TC avg
§112
18.1%
-21.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 74 resolved cases
Office Action

§101 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
The present application was filed on 10/18/2021. This action is in response to amendments and remarks filed on 10/16/2025. In the current amendments claims 1-4, 6-9, 11-14 and 16-25 have been amended, claim 10 has been canceled, and no claims were added. Thus, claims 1-9 and 11-25 are pending and have been examined. Claims 1 and 16 are the independent claims.
Response to Amendment
In response to amendments and remarks filed on 10/16/2025, the objections to the specification and drawings, set forth in the previous Office Action, have been withdrawn due to Applicant’s amendments to the specification filed 10/16/2025. Claims 1, 4, 6, 8-9 and 11 are no longer being interpreted under 35 U.S.C. 112(f) in light of Applicant’s claim amendments and remarks. The rejections of claims 1-9 and 10-11 under 35 U.S.C. 112(b) and 112(b), set forth in the previous Office Action, have been withdrawn due to Applicant’s claim amendments and remarks. 
Information Disclosure Statement
The information disclosure statements (IDS) submitted on 07/03/2025, 08/05/2025, 09/26/2025, 11/06/2025 and 12/08/2025 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statements are being considered by the examiner.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-9 and 11-25 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Claim 1:
Step 1:  Claim 1 is directed to an apparatus for sparse distillation of machine learning (ML) models, which is directed to a machine, one of the statutory categories. 
Step 2A Prong 1:  The claim recites the limitations: 
to distill knowledge of a supernet into a subnet during a single ML training epoch - In the context of the claim limitation, this encompasses a mental process of observing supernet data/knowledge.
during the same single ML training epoch as the distilling prune one or more parameters from the subnet ML model based on an identified parameter of target hardware, the subnet ML model to operate on the target hardware, the prune  to produce a sparse distilled subnet ML model - In the context of the claim limitation, this encompasses a mental process of opinion based on observing params and deciding which ones to prune/delete.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim further recites “interface circuitry”; “at least one programmable circuit to be programmed by the instructions to” – these are mere instructions to implement an abstract idea on a computer or merely use a computer as a tool to perform an abstract idea (i.e., as generic computer components performing generic computer functions). See MPEP 2106.05(f). Accordingly, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Step 2B Analysis:  The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element is directed to a mere instruction to apply the judicial exception. Mere instruction to apply a judicial exception does not amount to significantly more. See MPEP 2106.05(f). Nothing in the claim provides significantly more than this. As such, the claim is not patent eligible.
Claim 2:
Step 1:  Claim 2 is directed to an apparatus for sparse distillation of machine learning (ML) models, which is directed to a machine, one of the statutory categories. 
Step 2A Prong 1:  The claim recites the limitations: 
to distill the knowledge and prune the one or more parameters simultaneously - In the context of the claim limitation, this encompasses a mental process of opinion based on observing params and deciding which ones to prune/delete.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim further recites “wherein one or more of the at least one programmable circuit is to distill” – this is mere instructions to implement an abstract idea on a computer or merely use a computer as a tool to perform an abstract idea (i.e., as generic computer components performing generic computer functions). See MPEP 2106.05(f). Accordingly, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Step 2B Analysis:  The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element is directed to a mere instruction to apply the judicial exception. Mere instruction to apply a judicial exception does not amount to significantly more. See MPEP 2106.05(f). Nothing in the claim provides significantly more than this. As such, the claim is not patent eligible.
Claim 3:
Step 1:  Claim 3 is directed to an apparatus for sparse distillation of machine learning (ML) models, which is directed to a machine, one of the statutory categories. 
Step 2A Prong 1:  Please see analysis of claim 1.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites “wherein one or more of the at least one programmable circuit is to perform a single pass over a training dataset during the single ML training epoch”, which recite insignificant extra-solution activity of mere data gathering and output. MPEP 2106.05(g). Accordingly, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Step 2B Analysis:  The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The recitations of “wherein…” is directed to insignificant extra-solution activity that is well known, routine and conventional because the limitations are directed to receiving or transmitting data over a network, e.g., using the Internet to gather data. See MPEP 2106.05(d)(II), OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network). Nothing in the claim provides significantly more than this. As such, the claim is not patent eligible.
Claim 4:
Step 1:  Claim 4 is directed to an apparatus for sparse distillation of machine learning (ML) models, which is directed to a machine, one of the statutory categories. 
Step 2A Prong 1:  Please see analysis of claim 1.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim further recites “wherein one or more of the at least one programable circuit is to”, “train the subnet using the training dataset” – these are mere instructions to implement an abstract idea on a computer or merely use a computer as a tool to perform an abstract idea (i.e., as generic computer components performing generic computer functions). See MPEP 2106.05(f). The claim also recites “operate the supernet to extract the knowledge to be distilled into the subnet”; “using the extracted knowledge to guide the training of the subnet”, which recite insignificant extra-solution activity of mere data gathering and output. MPEP 2106.05(g). Accordingly, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Step 2B Analysis:  The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element is directed to a mere instruction to apply the judicial exception. Mere instruction to apply a judicial exception does not amount to significantly more. See MPEP 2106.05(f). Furthermore, the recitations of “operate…”; “using…” are directed to insignificant extra-solution activity that is well known, routine and conventional because the limitations are directed to receiving or transmitting data over a network, e.g., using the Internet to gather data. See MPEP 2106.05(d)(II), OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network). Nothing in the claim provides significantly more than this. As such, the claim is not patent eligible.
Claim 5:
Step 1:  Claim 5 is directed to an apparatus for sparse distillation of machine learning (ML) models, which is directed to a machine, one of the statutory categories. 
Step 2A Prong 1:  Please see analysis of claim 4.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites “wherein the knowledge includes both logits and feature maps extracted from the supernet”, which recite insignificant extra-solution activity of mere data gathering and output. MPEP 2106.05(g). Accordingly, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Step 2B Analysis:  The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The recitations of “wherein…” is directed to insignificant extra-solution activity that is well known, routine and conventional because the limitations are directed to receiving or transmitting data over a network, e.g., using the Internet to gather data. See MPEP 2106.05(d)(II), OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network). Nothing in the claim provides significantly more than this. As such, the claim is not patent eligible.
Claim 6:
Step 1:  Claim 6 is directed to an apparatus for sparse distillation of machine learning (ML) models, which is directed to a machine, one of the statutory categories. 
Step 2A Prong 1:  Please see analysis of claim 4.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim further recites “wherein one or more of the at least one programable circuit is to execute” – these are mere instructions to implement an abstract idea on a computer or merely use a computer as a tool to perform an abstract idea (i.e., as generic computer components performing generic computer functions). See MPEP 2106.05(f).The claim recites “an attention transfer distillation algorithm to transfer the knowledge from the supernet to the subnet”, which recite insignificant extra-solution activity of mere data gathering and output. MPEP 2106.05(g). Accordingly, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Step 2B Analysis:  The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element is directed to a mere instruction to apply the judicial exception. Mere instruction to apply a judicial exception does not amount to significantly more. See MPEP 2106.05(f). Furthermore, the recitations of “use an attention…” is directed to insignificant extra-solution activity that is well known, routine and conventional because the limitations are directed to receiving or transmitting data over a network, e.g., using the Internet to gather data. See MPEP 2106.05(d)(II), OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network). Nothing in the claim provides significantly more than this. As such, the claim is not patent eligible.
Claim 7:
Step 1:  Claim 7 is directed to an apparatus for sparse distillation of machine learning (ML) models, which is directed to a machine, one of the statutory categories. 
Step 2A Prong 1:  Please see analysis of claim 1.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim further recites “wherein the supernet ML model is a neural network (NN) including one or more of a deep NN (DNN), feed forward NN (FFN), deep FNN (DFFN), convolutional NN (CNN), deep CNN (DCN), deconvolutional NN (DNN), deep belief NN, perception NN, graph NN, recurrent NN (RNN), Long Short Term Memory (LSTM) algorithm, gated recurrent unit (GRU), echo state network (ESN), spiking NN (SNN), deep stacking network (DSN), Markov chain, perception NN, generative adversarial network (GAN), transformer, self-attention (SA) mechanism, stochastic NN, Bayesian Network (BN), Bayesian belief network (BBN), Bayesian NN (BNN), Deep BNN (DBNN), Dynamic BN (DBN), probabilistic graphical model (PGM), Boltzmann machine, restricted Boltzmann machine (RBM), Hopfield network, convolutional deep belief network (CDBN), Linear Dynamical System (LDS), Switching LDS (SLDS), Optical NN (ONN), and/or an NN for reinforcement learning (RL) and/or deep RL (DRL)” – this is mere instructions to apply merely asserting that a judicial exception is to be carried out on a generic computer cannot meaningfully integrate the judicial exception into a practical application. See MPEP § 2106.05(f). Generic computer programmed with generically-recited AI algorithms (or using generic NN models recited at a high level of generality). See MPEP 2106.05(f). Accordingly, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Step 2B Analysis:  The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element is directed to a mere instruction to apply the judicial exception. Mere instruction to apply a judicial exception does not amount to significantly more. See MPEP 2106.05(f). Nothing in the claim provides significantly more than this. As such, the claim is not patent eligible.
Claim 8:
Step 1:  Claim 8 is directed to an apparatus for sparse distillation of machine learning (ML) models, which is directed to a machine, one of the statutory categories. 
Step 2A Prong 1:  The claim recites the limitations: 
wherein the queries matrix, the values matrix, and the keys matrix include the parameters to be pruned - In the context of the claim limitation, this encompasses mental process of evaluating and observing matrices to be pruned.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim further recites “wherein one or more of the at least one programmable circuit is to” – this is a mere instruction to apply merely asserting that a judicial exception is to be carried out on a generic computer cannot meaningfully integrate the judicial exception into a practical application. See MPEP § 2106.05(f). The claim recites “generate, based on input data, a queries matrix, a values matrix, and a keys matrix”; “provide the queries matrix, the values matrix, and the keys matrix to the pruning”, which recite insignificant extra-solution activities of mere data gathering and output. MPEP 2106.05(g). Accordingly, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Step 2B Analysis:  The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element is directed to a mere instruction to apply the judicial exception. Mere instruction to apply a judicial exception does not amount to significantly more. See MPEP 2106.05(f). The recitations of “generate…”, “provide…” are directed to insignificant extra-solution activities that is well known, routine and conventional because the limitations are directed to receiving or transmitting data over a network, e.g., using the Internet to gather data. See MPEP 2106.05(d)(II), OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network). Nothing in the claim provides significantly more than this. As such, the claim is not patent eligible.
Claim 9:
Step 1:  Claim 9 is directed to an apparatus for sparse distillation of machine learning (ML) models, which is directed to a machine, one of the statutory categories. 
Step 2A Prong 1:  Please see analysis of claim 8.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim further recites “wherein one or more of the at least one programmable circuit is to” – this is a mere instruction to apply merely asserting that a judicial exception is to be carried out on a generic computer cannot meaningfully integrate the judicial exception into a practical application. See MPEP § 2106.05(f). The claim recites “apply the input data to a parameterized learnable transformation (PLT) to generate the queries matrix, the values matrix, and the keys matrix”, which recite insignificant extra-solution activity of mere data gathering and output. MPEP 2106.05(g). Accordingly, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Step 2B Analysis:  The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element is directed to a mere instruction to apply the judicial exception. Mere instruction to apply a judicial exception does not amount to significantly more. See MPEP 2106.05(f). The recitations of “apply the input…” is directed to insignificant extra-solution activities that is well known, routine and conventional because the limitations are directed to receiving or transmitting data over a network, e.g., using the Internet to gather data. See MPEP 2106.05(d)(II), OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network). Nothing in the claim provides significantly more than this. As such, the claim is not patent eligible.
Claim 11:
Step 1:  Claim 11 is directed to an apparatus for sparse distillation of machine learning (ML) models, which is directed to a machine, one of the statutory categories. 
Step 2A Prong 1:  The claim recites the limitations: 
 perform an operation on the queries matrix and the keys matrix, wherein the operation is a matrix multiplication operation or a 1 x 1 convolution on the queries matrix and the keys matrix - In the context of the claim limitation, this encompasses a mathematical concept of multiplication operation.
apply a softmax function to an output of the operation - In the context of the claim limitation, this encompasses a mathematical concept of calculating a softmax function.
generate a self attention output based on a combination of the values matrix and an output of the softmax function - In the context of the claim limitation, this encompasses a mathematical concept of generating output based on a softmax function.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim further recites “wherein one or more of the at least one programmable circuit is to” – this is a mere instruction to apply merely asserting that a judicial exception is to be carried out on a generic computer cannot meaningfully integrate the judicial exception into a practical application. See MPEP § 2106.05(f). Accordingly, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Step 2B Analysis:  The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element is directed to a mere instruction to apply the judicial exception. Mere instruction to apply a judicial exception does not amount to significantly more. See MPEP 2106.05(f). Nothing in the claim provides significantly more than this. As such, the claim is not patent eligible.
Claim 12:
Step 1:  Claim 12 is directed to an apparatus for sparse distillation of machine learning (ML) models, which is directed to a machine, one of the statutory categories. 
Step 2A Prong 1:  Please see analysis of claim 8.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim further recites “wherein one or more of the at least one programmable circuit is to replace a convolutional neural network (CNN) with a self attention (SA) mechanism” – this is mere instructions to implement an abstract idea on a computer or merely use a computer as a tool to perform an abstract idea (i.e., as generic computer components performing generic computer functions). See MPEP 2106.05(f). Accordingly, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Step 2B Analysis:  The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element is directed to a mere instruction to apply the judicial exception. Mere instruction to apply a judicial exception does not amount to significantly more. See MPEP 2106.05(f). Nothing in the claim provides significantly more than this. As such, the claim is not patent eligible.
Claim 13:
Step 1:  Claim 13 is directed to an apparatus for sparse distillation of machine learning (ML) models, which is directed to a machine, one of the statutory categories. 
Step 2A Prong 1:  Please see analysis of claim 8.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim further recites “wherein the supernet is a convolutional neural network (CNN) comprising a plurality of convolutional layers and a plurality of layers that are not convolutional layers” – this is mere instructions to apply the judicial exception using generic computer programmed with generic computer equipment. See MPEP 2106.05(f). Accordingly, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Step 2B Analysis:  The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element is directed to a mere instruction to apply the judicial exception. Mere instruction to apply a judicial exception does not amount to significantly more. See MPEP 2106.05(f). Nothing in the claim provides significantly more than this. As such, the claim is not patent eligible.
Claim 14:
Step 1:  Claim 14 is directed to an apparatus for sparse distillation of machine learning (ML) models, which is directed to a machine, one of the statutory categories. 
Step 2A Prong 1:  Please see analysis of claim 8.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim further recites “wherein one or more of the at least one programmable circuit is to replace the plurality of convolutional layers in the CNN with a plurality of SA layers” – this is mere instructions to apply the judicial exception using generic computer programmed with generic computer equipment. See MPEP 2106.05(f). Accordingly, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Step 2B Analysis:  The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element is directed to a mere instruction to apply the judicial exception. Mere instruction to apply a judicial exception does not amount to significantly more. See MPEP 2106.05(f). Nothing in the claim provides significantly more than this. As such, the claim is not patent eligible.
Claim 15:
Step 1:  Claim 15 is directed to an apparatus for sparse distillation of machine learning (ML) models, which is directed to a machine, one of the statutory categories. 
Step 2A Prong 1:  Please see analysis of claim 8.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim further recites “a client device, an application server, an edge computing server of an edge computing framework, or a cloud compute node of a cloud computing service” – this is mere instructions to implement an abstract idea on a computer or merely use a computer as a tool to perform an abstract idea (i.e., as generic computer components performing generic computer functions). See MPEP 2106.05(f). Accordingly, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Step 2B Analysis:  The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element is directed to a mere instruction to apply the judicial exception. Mere instruction to apply a judicial exception does not amount to significantly more. See MPEP 2106.05(f). Nothing in the claim provides significantly more than this. As such, the claim is not patent eligible.
Claim 16:
Step 1:  Claim 16 is directed to a non-transitory machine readable storage medium, which is directed to an article of manufacture, one of the statutory categories. 
Step 2A Prong 1:  The claim recites the limitations: 
to prune one or more parameters from the second ML model during the one pass over the training dataset - In the context of the claim limitation, this encompasses a mental process of opinion based on observing params and deciding which ones to prune/delete.
during the same one pass over the training dataset as the distill, prune one or more parameters from the second ML model based on an identified parameter of target hardware, the second ML model will operate on the target hardware - In the context of the claim limitation, this encompasses a mental process of observing supernet data/knowledge.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim further recites “to cause at least one programmable circuit to at least” – this is a mere instructions to implement an abstract idea on a computer or merely use a computer as a tool to perform an abstract idea (i.e., as generic computer components performing generic computer functions). See MPEP 2106.05(f). The claim recites “provide a training dataset to a first machine learning (ML) model and a second ML model, the second ML model having fewer parameters than the first ML model”, which recite insignificant extra-solution activity of mere data gathering and output. MPEP 2106.05(g). Accordingly, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Step 2B Analysis:  The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element is directed to a mere instruction to apply the judicial exception. Mere instruction to apply a judicial exception does not amount to significantly more. See MPEP 2106.05(f). The recitations of “provide…” is directed to insignificant extra-solution activities that is well known, routine and conventional because the limitations are directed to receiving or transmitting data over a network, e.g., using the Internet to gather data. See MPEP 2106.05(d)(II), OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network). Nothing in the claim provides significantly more than this. As such, the claim is not patent eligible.
Claim 17:
Step 1:  Claim 17 is directed to a non-transitory machine readable storage medium, which is directed to an article of manufacture, one of the statutory categories. 
Step 2A Prong 1:  The claim recites the limitations: 
the distill the knowledge and prune the of the one or more parameters simultaneously - In the context of the claim limitation, this encompasses a mental process of opinion based on observing params and deciding which ones to prune/delete.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim further recites “wherein the instructions cause one or more of the at least one programmable circuit to” – mere instructions to implement an abstract idea on a computer or merely use a computer as a tool to perform an abstract idea (i.e., as generic computer components performing generic computer functions). See MPEP 2106.05(f). Accordingly, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Step 2B Analysis:  The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element is directed to a mere instruction to apply the judicial exception. Mere instruction to apply a judicial exception does not amount to significantly more. See MPEP 2106.05(f). Nothing in the claim provides significantly more than this. As such, the claim is not patent eligible.
Claim 18:
Step 1:  Claim 18 is directed to a non-transitory machine readable storage medium, which is directed to an article of manufacture, one of the statutory categories.
Step 2A Prong 1:  Please see analysis of claim 16.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim further recites “cause one or more of the at least one programmable circuit to”, “train the first ML model using a training dataset”; “train the second ML model using the training dataset” – mere instructions to implement an abstract idea on a computer or merely use a computer as a tool to perform an abstract idea (i.e., as generic computer components performing generic computer functions). See MPEP 2106.05(f). The claim also recites “operate the first ML model to extract the knowledge to be distilled into the second ML model”; “using the extracted knowledge to the training of the second ML model”, which recite insignificant extra-solution activity of mere data gathering and output. MPEP 2106.05(g). Accordingly, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Step 2B Analysis:  The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element is directed to a mere instruction to apply the judicial exception. Mere instruction to apply a judicial exception does not amount to significantly more. See MPEP 2106.05(f). Furthermore, the recitations of “operate…”; “using…” are directed to insignificant extra-solution activities that is well known, routine and conventional because the limitations are directed to receiving or transmitting data over a network, e.g., using the Internet to gather data. See MPEP 2106.05(d)(II), OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network). Nothing in the claim provides significantly more than this. As such, the claim is not patent eligible.
Claim 19:
Step 1:  Claim 19 is directed to a non-transitory machine readable storage medium, which is directed to an article of manufacture, one of the statutory categories. 
Step 2A Prong 1:  Please see analysis of claim 18.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim further recites “the instructions cause one or more of the at least one programmable circuit to” – these are mere instructions to implement an abstract idea on a computer or merely use a computer as a tool to perform an abstract idea (i.e., as generic computer components performing generic computer functions). See MPEP 2106.05(f). The claim also recites “extract logits and feature maps from the first ML model, wherein the knowledge includes both the extracted logits and the extracted feature maps”, which recite insignificant extra-solution activity of mere data gathering and output. MPEP 2106.05(g). Accordingly, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Step 2B Analysis:  The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element is directed to a mere instruction to apply the judicial exception. Mere instruction to apply a judicial exception does not amount to significantly more. See MPEP 2106.05(f). Furthermore, the recitations of “extract…” are directed to insignificant extra-solution activities that is well known, routine and conventional because the limitations are directed to receiving or transmitting data over a network, e.g., using the Internet to gather data. See MPEP 2106.05(d)(II), OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network). Nothing in the claim provides significantly more than this. As such, the claim is not patent eligible.
Claim 20:
Step 1:  Claim 20 is directed to a non-transitory machine readable storage medium, which is directed to an article of manufacture, one of the statutory categories. 
Step 2A Prong 1:  Please see analysis of claim 16.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim further recites “wherein the instructions cause one or more of the at least one programmable circuit to” – these are mere instructions to implement an abstract idea on a computer or merely use a computer as a tool to perform an abstract idea (i.e., as generic computer components performing generic computer functions). See MPEP 2106.05(f). The claim also recites “operate an attention transfer distillation algorithm to transfer the knowledge from the first ML model to the second ML model such that the second ML model includes a spatial attention map that is similar to a spatial attention map of the first ML model”, which recite insignificant extra-solution activity of mere data gathering and output. MPEP 2106.05(g). Accordingly, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Step 2B Analysis:  The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element is directed to a mere instruction to apply the judicial exception. Mere instruction to apply a judicial exception does not amount to significantly more. See MPEP 2106.05(f). Furthermore, the recitations of “operate…” is directed to insignificant extra-solution activities that is well known, routine and conventional because the limitations are directed to receiving or transmitting data over a network, e.g., using the Internet to gather data. See MPEP 2106.05(d)(II), OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network). Nothing in the claim provides significantly more than this. As such, the claim is not patent eligible.
Claim 21:
Step 1:  Claim 21 is directed to a non-transitory machine readable storage medium, which is directed to an article of manufacture, one of the statutory categories. 
Step 2A Prong 1:  The claim recites the limitations: 
wherein the queries matrix, the values matrix, and the keys matrix include the parameters to be pruned - In the context of the claim limitation, this encompasses a mathematical concept of pruning parameters.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim further recites “wherein the instructions cause one or more of the at least one programmable circuit to” – these are mere instructions to implement an abstract idea on a computer or merely use a computer as a tool to perform an abstract idea (i.e., as generic computer components performing generic computer functions). See MPEP 2106.05(f). The claim also recites “generate, based on the training dataset, a queries matrix, a values matrix, and a keys matrix”; “provide the queries matrix, the values matrix, and the keys matrix”, which recite insignificant extra-solution activity of mere data gathering and output. MPEP 2106.05(g). Accordingly, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Step 2B Analysis:  The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element is directed to a mere instruction to apply the judicial exception. Mere instruction to apply a judicial exception does not amount to significantly more. See MPEP 2106.05(f). Furthermore, the recitations of “generate…”, “provide…” are directed to insignificant extra-solution activities that is well known, routine and conventional because the limitations are directed to receiving or transmitting data over a network, e.g., using the Internet to gather data. See MPEP 2106.05(d)(II), OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network). Nothing in the claim provides significantly more than this. As such, the claim is not patent eligible.
Claim 22:
Step 1:  Claim 22 is directed to a non-transitory machine readable storage medium, which is directed to an article of manufacture, one of the statutory categories. 
Step 2A Prong 1:  The claim recites the limitations: 
perform an operation on the queries matrix and the keys matrix, wherein the operation is a matrix multiplication operation or a 1 x 1 convolution on the queries matrix and the keys matrix - In the context of the claim limitation, this encompasses a mathematical concept of multiplication operation.
apply a softmax function to an output of the operation - In the context of the claim limitation, this encompasses a mathematical concept of calculating softmax function.
generate an SA output based on a combination of the values matrix and an output of the softmax function - In the context of the claim limitation, this encompasses a mathematical concept of generating output based on softmax function.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim further recites “wherein, the instructions cause one or more of the at least one programmable circuit to” – these are mere instructions to implement an abstract idea on a computer or merely use a computer as a tool to perform an abstract idea (i.e., as generic computer components performing generic computer functions). See MPEP 2106.05(f). The claim also recites “apply parts of the training dataset to a parameterized learnable transformation (PLT) to generate the queries matrix, the values matrix, and the keys matrix”, which recite insignificant extra-solution activity of mere data gathering and output. MPEP 2106.05(g). Accordingly, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Step 2B Analysis:  The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element is directed to a mere instruction to apply the judicial exception. Mere instruction to apply a judicial exception does not amount to significantly more. See MPEP 2106.05(f). Furthermore, the recitations of  “apply part…” is directed to insignificant extra-solution activities that is well known, routine and conventional because the limitations are directed to receiving or transmitting data over a network, e.g., using the Internet to gather data. See MPEP 2106.05(d)(II), OIP Techs., Inc., v. Amazon.com, Inc., 788 F.3d 1359, 1363, 115 USPQ2d 1090, 1093 (Fed. Cir. 2015) (sending messages over a network). Nothing in the claim provides significantly more than this. As such, the claim is not patent eligible.
Claim 23:
Step 1:  Claim 23 is directed to a non-transitory machine readable storage medium, which is directed to an article of manufacture, one of the statutory categories. 
Step 2A Prong 1:  Please see analysis of claim 21.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim further recites “wherein the instructions cause the programmable circuitry to replace a convolutional neural network (CNN) with a self attention (SA) mechanism” – these are mere instructions to implement an abstract idea on a computer or merely use a computer as a tool to perform an abstract idea (i.e., as generic computer components performing generic computer functions). See MPEP 2106.05(f). Generic computer programmed with generically-recited AI algorithms (or using generic NN models recited at a high level of generality). See MPEP 2106.05(f). Accordingly, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Step 2B Analysis:  The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element is directed to a mere instruction to apply the judicial exception. Mere instruction to apply a judicial exception does not amount to significantly more. See MPEP 2106.05(f). Nothing in the claim provides significantly more than this. As such, the claim is not patent eligible.
Claim 24:
Step 1:  Claim 24 is directed to a non-transitory machine readable storage medium, which is directed to an article of manufacture, one of the statutory categories. 
Step 2A Prong 1:  Please see analysis of claim 21.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim further recites “wherein the first ML model is a convolutional neural network (CNN) comprising a set of convolutional layers and a set of layers that are not convolutional layers; wherein the set of convolutional layers in the CNN are replaced with a set of SA layers, and the set of SA layers form an SA mechanism” – these are mere instructions to implement an abstract idea on a computer or merely use a computer as a tool to perform an abstract idea (i.e., as generic computer components performing generic computer functions). See MPEP 2106.05(f). Generic computer programmed with generically-recited AI algorithms (or using generic NN models recited at a high level of generality). See MPEP 2106.05(f). Accordingly, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Step 2B Analysis:  The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element is directed to a mere instruction to apply the judicial exception. Mere instruction to apply a judicial exception does not amount to significantly more. See MPEP 2106.05(f). Nothing in the claim provides significantly more than this. As such, the claim is not patent eligible.
Claim 25:
Step 1:  Claim 25 is directed to a non-transitory machine readable storage medium, which is directed to an article of manufacture, one of the statutory categories. 
Step 2A Prong 1:  Please see analysis of claim 21.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim further recites “wherein the programmable circuity is a client device, an application server, an edge computing server of an edge computing framework, or a cloud compute node of a cloud computing service” – these are mere instructions to implement an abstract idea on a computer or merely use a computer as a tool to perform an abstract idea (i.e., as generic computer components performing generic computer functions). See MPEP 2106.05(f). Accordingly, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Step 2B Analysis:  The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element is directed to a mere instruction to apply the judicial exception. Mere instruction to apply a judicial exception does not amount to significantly more. See MPEP 2106.05(f). Nothing in the claim provides significantly more than this. As such, the claim is not patent eligible.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-4, 7, 13, 15, 16-18 and 25 are rejected under 35 U.S.C. 103 as being unpatentable over O’ Neill (“Deep Neural Compression Via Concurrent Pruning And Self-Distillation”) in view of Ming (“Variational Bayesian Sparsification for Distillation Compression”).
Claim 1. 
O’ Neill teaches an apparatus for sparse distillation of machine learning (ML) models, the apparatus comprising (1 INTRODUCTION & Page 2 “We propose self-distilled pruning, a novel pruning framework that improves the generalization of pruned networks without introducing any additional parameters, using only a set of soft targets” teaches neural network combing self-distillation and pruning corresponding to sparse distillation): 
distill knowledge of a supernet ML model into a subnet ML model during a single ML training epoch (ABSTRACT & Page 1 “This work proposes a novel self-distillation based pruning strategy, whereby the representational similarity between the pruned and unpruned versions of the same network is maximized. Unlike previous approaches that treat distillation and pruning separately, we use distillation to inform the pruning criteria, without requiring a separate student network as in knowledge distillation” and Page 4 & 3.1 MAXIMIZING CROSS-CORRELATION BETWEEN PRUNED AND UNPRUNED EMBEDDINGS “Additionally, we provide a PyTorch based pseudo-code for SDP-CC in Figure 1 for a single epoch, in the general case” teaches the model uses self-distillation which is an unpruned model and a pruned model corresponding to a supernet into a subnet); 
and during the same single ML training epoch as the distilling, prune one or more parameters from the subnet ML model based on an identified parameter of target hardware1, the subnet ML model to operate on the target hardware, the prune to produce a sparse distilled subnet ML model (1 INTRODUCTION & Page 2 “We provide three insights as to why self-distillation leads to more generalizable pruned networks. Namely, we observe that self-distilled pruning (1) recovers performance faster after pruning steps (i.e., improves convergence)…4. A comprehensive study of iterative pruning for monolingual and cross-lingual pretrained models on GLUE and XGLUE benchmarks. To our knowledge, this is the only work to include an evaluation of pruned model performance in the cross-lingual transfer setting” and 3.3 HOW DOES SELF-DISTILLATION IMPROVE PRUNED MODEL GENERALIZATION ? & Page 5 “The first explanation for why self-distillation leads to better generalization in iterative pruning is that the soft targets bias the optimization and smoothen the loss surface through implicit similarities between the classes encoded in the logits” teaches pruned model (subnet) prune soft targeted bias corresponds to target hardware).
O’ Neill does not explicitly teach interface circuitry; instructions; and at least one programmable circuit to be programmed by the instructions to.
However, in the same field, analogous art, Ming teaches interface circuitry; instructions; and at least one programmable circuit to be programmed by the instructions to (SECTION 4. Experiments & “We mainly validate the effectiveness of our proposed Variational Bayesian Sparsification Distillation on the well-known three datasets, MNIST [26], CIFAR 10 and CIFAR 100 [27]. All the following experiments were run on NVIDIA GeForce GTX 1080 Ti GPUs… For MNIST, we construct a multilayer perception (MLP), and a convolutional network, LeNet-5. For CIFAR 10, we perform with VGG models Table 1” teaches NVIDIA GeForce GTX 1080 Ti GPUs corresponds the circuit and running VGG models using CIFAR 10 and MNIST comprising electronic design such as hardware implementation of neural network).
O’ Neill and Ming are analogous art because they both are directed to the minibatch re-weighting method is proposed to dynamically balance the hard and soft knowledge, which can largely boost distillation accuracy.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Ming into the disclosed invention of O’ Neill.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, the experiments were run on “NVIDIA GeForce GTX 1080 Ti GPUs” and used three large-scale datasets for sparsification to improve the classification accuracy” (Ming, Page 4 SECTION 4. Experiments and Page 5 SECTION 5. Conclusions).
Claim 2. 
O’ Neill in view of Ming teaches the apparatus of claim 1, 
O’ Neill further discloses distill of the knowledge and prune of the one or more parameters simultaneously (3.3 HOW DOES SELF-DISTILLATION IMPROVE PRUNED MODEL GENERALIZATION ? & Page 5 “The first explanation for why self-distillation leads to better generalization in iterative pruning is that the soft targets bias the optimization and smoothen the loss surface through implicit similarities between the classes encoded in the logits” teaches concurrent pruning and self-distillation).
Ming further teaches wherein one or more of the at least one programmable circuit is to (SECTION 4. Experiments & “We mainly validate the effectiveness of our proposed Variational Bayesian Sparsification Distillation on the well-known three datasets, MNIST [26], CIFAR 10 and CIFAR 100 [27]. All the following experiments were run on NVIDIA GeForce GTX 1080 Ti GPUs” teaches NVIDIA GeForce GTX 1080 Ti GPUs corresponds to the circuit).
O’ Neill and Ming are analogous art because they both are directed to distillation techniques that employ pruning strategies.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Ming into the disclosed invention of O’ Neill.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, the experiments were run on “NVIDIA GeForce GTX 1080 Ti GPUs” and used three large-scale datasets for sparsification to improve the classification accuracy” (Ming, Page 4 SECTION 4. Experiments and Page 5 SECTION 5. Conclusions).
Claim 3. 
O’ Neill in view of Ming teaches the apparatus of claim 1, 
O’ Neill further teaches perform a single pass over a training dataset during the single ML training epoch (3 PROPOSED METHODOLOGY & Page 3 “We begin by defining a dataset D := {(Xi , yi)} D i=1 with single samples si = (Xi , yi), where each Xi (in the D training samples)” and 3.1 MAXIMIZING CROSS-CORRELATION BETWEEN PRUNED AND UNPRUNED EMBEDDINGS & Page 4 “we provide a PyTorch based pseudo-code for SDP-CC in Figure 1 for a single epoch, in the general case” teaches perform single epoch on d training samples).
Ming further teaches wherein one or more of the at least one programmable circuit is to (SECTION 4. Experiments & “We mainly validate the effectiveness of our proposed Variational Bayesian Sparsification Distillation on the well-known three datasets, MNIST [26], CIFAR 10 and CIFAR 100 [27]. All the following experiments were run on NVIDIA GeForce GTX 1080 Ti GPUs” teaches NVIDIA GeForce GTX 1080 Ti GPUs corresponds to the circuit).
O’ Neill and Ming are analogous art because they both are directed to distillation techniques that employ pruning strategies.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Ming into the disclosed invention of O’ Neill.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, the experiments were run on “NVIDIA GeForce GTX 1080 Ti GPUs” and used three large-scale datasets for sparsification to improve the classification accuracy” (Ming, Page 4 SECTION 4. Experiments and Page 5 SECTION 5. Conclusions).
Claim 4. 
O’ Neill in view of Ming teaches the apparatus of claim 1, 
O’ Neill further discloses train the supernet using a training dataset (3 PROPOSED METHODOLOGY & Page 3 “We begin by defining a dataset D := {(Xi , yi)} D i=1 with single samples si = (Xi , yi), where each Xi (in the D training samples)” teaches d training samples);
operate the supernet to extract the knowledge to be distilled into the subnet (3.1 MAXIMIZING CROSS-CORRELATION BETWEEN PRUNED AND UNPRUNED EMBEDDING & Page 3-4 “To reiterate, z^S is obtained from the pruned version of the network (fΘp ) and z^T is obtained from the unpruned version (fΘ). Since the learned output representations should be similar if their inputs are similar, we aim to address the problem where a correlation measure may produce representations that are instead proportional to their inputs” teaches maximizing the representational similarity between pruned and unpruned model);
and train the subnet using the training dataset and using the extracted knowledge to guide the training of the subnet (5 EMPIRICAL RESULTS & Page 8 “We note that the number of training samples used for retraining plays an important role in the rate of performance degradation” teaches training using the training dataset and 3.1 MAXIMIZING CROSS-CORRELATION BETWEEN PRUNED AND UNPRUNED EMBEDDING & Page 3-4 “To reiterate, z^S is obtained from the pruned version of the network (fΘp ) and z^T is obtained from the unpruned version (fΘ). Since the learned output representations should be similar if their inputs are similar” teaches training using the extracted representation).
Ming further teaches wherein one or more of the at least one programable circuit is to (SECTION 4. Experiments & “We mainly validate the effectiveness of our proposed Variational Bayesian Sparsification Distillation on the well-known three datasets, MNIST [26], CIFAR 10 and CIFAR 100 [27]. All the following experiments were run on NVIDIA GeForce GTX 1080 Ti GPUs” teaches NVIDIA GeForce GTX 1080 Ti GPUs corresponds to the circuit).
O’ Neill and Ming are analogous art because they both are directed to distillation techniques that employ pruning strategies.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Ming into the disclosed invention of O’ Neill.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, the experiments were run on “NVIDIA GeForce GTX 1080 Ti GPUs” and used three large-scale datasets for sparsification to improve the classification accuracy” (Ming, Page 4 SECTION 4. Experiments and Page 5 SECTION 5. Conclusions).
Claim 7. 
O’ Neill in view of Ming teaches the apparatus of claim 1, 
O’ Neill further discloses wherein the supernet ML model is a neural network (NN) including one or more of a deep NN (DNN), feed forward NN (FFN), deep FNN (DFFN), convolutional NN (CNN), deep CNN (DCN), deconvolutional NN (DNN), deep belief NN, perception NN, graph NN, recurrent NN (RNN), Long Short Term Memory (LSTM) algorithm, gated recurrent unit (GRU), echo state network (ESN), spiking NN (SNN), deep stacking network (DSN), Markov chain, perception NN, generative adversarial network (GAN), transformer, self-attention (SA) mechanism, stochastic NN, Bayesian Network (BN), Bayesian belief network (BBN), Bayesian NN (BNN), Deep BNN (DBNN), Dynamic BN (DBN), probabilistic graphical model (PGM), Boltzmann machine, restricted Boltzmann machine (RBM), Hopfield network, convolutional deep belief network (CDBN), Linear Dynamical System (LDS), Switching LDS (SLDS), Optical NN (ONN), and/or an NN for reinforcement learning (RL) and/or deep RL (DRL) (3.2 A FROBENIUS DISTORTION PERSPECTIVE OF SELF-DISTILLED PRUNING & Page 4 “Frobenius distortions which is a loose approximation for deep networks” teaches deep neural network).
Claim 13. 
O’ Neill in view of Ming teaches the apparatus of claim 1, 
O’ Neill further discloses wherein the supernet is a convolutional neural network (CNN) comprising a plurality of convolutional layers and a plurality of layers that are not convolutional layers (3.2 A FROBENIUS DISTORTION PERSPECTIVE OF SELF-DISTILLED PRUNING & Page 4 “Frobenius distortions which is a loose approximation for deep networks” and Algorithm 1 and 4 EXPERIMENTAL SETUP & Page 6“ For XGLUE tasks, we perform 15 pruning steps on XLMRoBERTABase, one per 15 epochs, while for the GLUE tasks, we perform 32 pruning steps on BERTBase. The compression rate and number of pruning steps is higher for GLUE tasks compared to XGLUE, because GLUE tasks involve evaluation in the supervised classification setting; whereas in XGLUE we report in the more challenging zero-shot cross-lingual transfer setting with only a single language used for training (i.e., English)” teaches unpruned model (supernet) architect contain convolution because pruning in the context of deep neural network which implies to convolutional architecture).
Claim 15.
O’ Neill in view of Ming teaches the apparatus of claim 1, 
O’ Neill further discloses wherein the apparatus is a client device, an application server, an edge computing server of an edge computing framework, or a cloud compute node of a cloud computing service (4 EXPERIMENTAL SETUP & Page 6 “We perform experiments on monolingual tasks within the GLUE (Wang et al., 2018) benchmark2 with pretrained BERTBase and multilingual tasks from the XGLUE benchmark (Liang et al., 2020) with pretrained XLMRBase” teaches GLUE/XGLUE benchmark test which run on cloud style computing environment and 1 INTRODUCTION & Page 2 “we propose the use of a cross-correlation objective for self-distillation pruning that reduces redundancy and encourages sparse solutions, naturally fitting with magnitude-based pruning. This sets state of the art results for magnitude-based pruning” teaches parameter reduction is primality intended to enable on devices like edge or application server and 1 INTRODUCTION Page 2 1 “We propose self-distilled pruning, a novel pruning framework that improves the generalization of pruned networks without introducing any additional parameters, using only a set of soft targets” teaches compact pruned models are inherently sized for client-side deployment).
Claim 16. 
O’ Neill teaches provide a training dataset to a first machine learning (ML) model and a second ML model, the second ML model having fewer parameters than the first ML model (3 PROPOSED METHODOLOGY & Page 3 “We begin by defining a dataset D := {(Xi , yi)} D i=1 with single samples si = (Xi , yi), where each Xi (in the D training samples)” teaches d training samples ABSTRACT & Page 1 “This work proposes a novel self-distillation based pruning strategy, whereby the representational similarity between the pruned and unpruned versions of the same network is maximized. Unlike previous approaches that treat distillation and pruning separately, we use distillation to inform the pruning criteria, without requiring a separate student network as in knowledge distillation” and 1 INTRODUCTION & Page 2 “We propose self-distilled pruning, a novel pruning framework that improves the generalization of pruned networks without introducing any additional parameters, using only a set of soft targets” teaches the model using self-distillation which is an unpruned model and a pruned model corresponding to a supernet into a subnet, pruned model doesn’t required any additional parameter corresponding to fewer parameter than unpruned network);
during training of the first and second ML models, distill knowledge of the first ML model into the second ML model during one pass over the training dataset (ABSTRACT & Page 1 “This work proposes a novel self-distillation based pruning strategy, whereby the representational similarity between the pruned and unpruned versions of the same network is maximized. Unlike previous approaches that treat distillation and pruning separately, we use distillation to inform the pruning criteria, without requiring a separate student network as in knowledge distillation” teaches the model using self-distillation which is an unpruned model and a pruned model corresponding to a supernet into a subnet); 
and during the same one pass over the training dataset as the distill prune one or more parameters from the second ML model based on an identified parameter of target hardware2, the second ML model will operate on the target hardware (1 INTRODUCTION & Page 2 “We provide three insights as to why self-distillation leads to more generalizable pruned networks. Namely, we observe that self-distilled pruning (1) recovers performance faster after pruning steps (i.e., improves convergence)…4. A comprehensive study of iterative pruning for monolingual and cross-lingual pretrained models on GLUE and XGLUE benchmarks. To our knowledge, this is the only work to include an evaluation of pruned model performance in the cross-lingual transfer setting” teaches pruned model (second ML) prune unwanted parameter).
O’ Neill does not explicitly teach a non-transitory machine readable storage medium comprising instructions to cause at least one programmable circuit to at least. 
However, Ming teaches to cause at least one programmable circuit to at least (SECTION 4. Experiments & “We mainly validate the effectiveness of our proposed Variational Bayesian Sparsification Distillation on the well-known three datasets, MNIST [26], CIFAR 10 and CIFAR 100 [27]. All the following experiments were run on NVIDIA GeForce GTX 1080 Ti GPUs” and 2. RELATED WORK & Page 2 “they required pre-determined codebook and computational efficiency algorithm. The latter ones focused on using fixed-point data to represent the weights of CNNs [14], including Binary weight and ternary weight [7]” teaches NVIDIA GeForce GTX 1080 Ti GPUs corresponds to programmable circuit and comprising codebook which required instruction).
O’ Neill and Ming are analogous art because they both are directed to distillation techniques that employ pruning strategies.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Ming into the disclosed invention of O’ Neill.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, the experiments were run on “NVIDIA GeForce GTX 1080 Ti GPUs” and used three large-scale datasets for sparsification to improve the classification accuracy” (Ming, Page 4 SECTION 4. Experiments and Page 5 SECTION 5. Conclusions).
Claim 17. 
O’ Neill in view of Ming teaches the non-transitory machine readable storage medium of claim 16, 
O’ Neill further distill the knowledge and prune the one or more parameters simultaneously (3.3 HOW DOES SELF-DISTILLATION IMPROVE PRUNED MODEL GENERALIZATION ? & Page 5 “The first explanation for why self-distillation leads to better generalization in iterative pruning is that the soft targets bias the optimization and smoothen the loss surface through implicit similarities between the classes encoded in the logits” teaches concurrent pruning and self-distillation).
Ming further teaches wherein the instructions cause one or more of the at least one programmable circuit to (SECTION 4. Experiments & “We mainly validate the effectiveness of our proposed Variational Bayesian Sparsification Distillation on the well-known three datasets, MNIST [26], CIFAR 10 and CIFAR 100 [27]. All the following experiments were run on NVIDIA GeForce GTX 1080 Ti GPUs” and 2. RELATED WORK & Page 2 “they required pre-determined codebook and computational efficiency algorithm. The latter ones focused on using fixed-point data to represent the weights of CNNs [14], including Binary weight and ternary weight [7]”  teaches NVIDIA GeForce GTX 1080 Ti GPUs corresponds to the circuit).
O’ Neill and Ming are analogous art because they both are directed to distillation techniques that employ pruning strategies.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Ming into the disclosed invention of O’ Neill.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, the experiments were run on “NVIDIA GeForce GTX 1080 Ti GPUs” and used three large-scale datasets for sparsification to improve the classification accuracy” (Ming, Page 4 SECTION 4. Experiments and Page 5 SECTION 5. Conclusions).
Claim 18. 
O’ Neill in view of Ming teaches the non-transitory machine readable storage medium of claim 16, 
O’ Neill further teaches train the first ML model using a training dataset (3 PROPOSED METHODOLOGY & Page 3 “We begin by defining a dataset D := {(Xi , yi)} D i=1 with single samples si = (Xi , yi), where each Xi (in the D training samples)” teaches d training samples); 
and operate the first ML model to extract the knowledge to be distilled into the second ML model ((3.1 MAXIMIZING CROSS-CORRELATION BETWEEN PRUNED AND UNPRUNED EMBEDDING & Page 3-4 “To reiterate, z^S is obtained from the pruned version of the network (fΘp ) and z^T is obtained from the unpruned version (fΘ). Since the learned output representations should be similar if their inputs are similar, we aim to address the problem where a correlation measure may produce representations that are instead proportional to their inputs” teaches maximizing the representational similarity between pruned and unpruned model), 
and train the second ML model using the training dataset and using the extracted knowledge to guide the training of the second ML model (5 EMPIRICAL RESULTS & Page 8 “We note that the number of training samples used for retraining plays an important role in the rate of performance degradation” teaches training using the training dataset and 3.1 MAXIMIZING CROSS-CORRELATION BETWEEN PRUNED AND UNPRUNED EMBEDDING & Page 3-4 “To reiterate, z^S is obtained from the pruned version of the network (fΘp ) and z^T is obtained from the unpruned version (fΘ). Since the learned output representations should be similar if their inputs are similar” teaches training using the extracted representation).
Ming further teaches Wherein the instructions cause one or more of the at least one programmable circuit to (SECTION 4. Experiments & “We mainly validate the effectiveness of our proposed Variational Bayesian Sparsification Distillation on the well-known three datasets, MNIST [26], CIFAR 10 and CIFAR 100 [27]. All the following experiments were run on NVIDIA GeForce GTX 1080 Ti GPUs” and 2. RELATED WORK & Page 2 “they required pre-determined codebook and computational efficiency algorithm. The latter ones focused on using fixed-point data to represent the weights of CNNs [14], including Binary weight and ternary weight [7]”  teaches NVIDIA GeForce GTX 1080 Ti GPUs corresponds to the circuit).
O’ Neill and Ming are analogous art because they both are directed to distillation techniques that employ pruning strategies.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Ming into the disclosed invention of O’ Neill.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, the experiments were run on “NVIDIA GeForce GTX 1080 Ti GPUs” and used three large-scale datasets for sparsification to improve the classification accuracy” (Ming, Page 4 SECTION 4. Experiments and Page 5 SECTION 5. Conclusions).
Claim 25. 
O’ Neill in view of Ming teaches the non-transitory machine readable storage medium of claim 16, 
O’ Neill further teaches wherein the programmable circuity is a client device, an application server, an edge computing server of an edge computing framework, or a cloud compute node of a cloud computing service (4 EXPERIMENTAL SETUP & Page 6 “We perform experiments on monolingual tasks within the GLUE (Wang et al., 2018) benchmark2 with pretrained BERTBase and multilingual tasks from the XGLUE benchmark (Liang et al., 2020) with pretrained XLMRBase” teaches GLUE/XGLUE benchmark test which run on cloud style computing environment and 1 INTRODUCTION & Page 2 “we propose the use of a cross-correlation objective for self-distillation pruning that reduces redundancy and encourages sparse solutions, naturally fitting with magnitude-based pruning. This sets state of the art results for magnitude-based pruning” teaches parameter reduction is primality intended to enable on devices like edge or application server and 1 INTRODUCTION Page 2 1 “We propose self-distilled pruning, a novel pruning framework that improves the generalization of pruned networks without introducing any additional parameters, using only a set of soft targets” teaches compact pruned models are inherently sized for client-side deployment).

Claims 5-6, 8-9, 11-12, 14 and 19-24 are rejected under 35 U.S.C. 103 as being unpatentable over O’ Neill (“Deep Neural Compression Via Concurrent Pruning And Self-Distillation”) in view of Ming (“Variational Bayesian Sparsification for Distillation Compression”) and further in view of Ji (“Show, Attend and Distill: Knowledge Distillation via Attention-based Feature Matching”).
Claim 5. 
O’ Neill in view of Ming teaches the apparatus of claim 4, 
O’ Neill in view of Ming does not explicitly teach wherein the knowledge includes both logits and feature maps extracted from the supernet.
However, Ji teaches wherein the knowledge includes both logits and feature maps extracted from the supernet (Graph-driven VAEs for different tasks & Page 982 “the Dirichlet distribution can be approximated with a logistic normal and a softmax formulation by the Laplace approximation (Hennig et al. 2012). When the number of topics L is large, the Dirichlet distribution can be approximated with a multivariate logistic normal (Srivastava and Sutton 2017) with the i-th element of its mean μT and diagonal covariance matrix ΣT” teaches diagonal covariance matrix which is corresponding to knowledge  include softmax (logits) and Experiments & Page 7948 “Our method that utilizes an attention mechanism to identify similar features between the teacher and student shows the best performance over all experiment settings” teaches feature map are output of the network wherein feature map matches from teacher (supernet) and student network).
O’ Neill, Ming and Ji are analogous art because they are all directed to distillation algorithms for neural networks.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Ji into the disclosed invention of O’ Neill in view of Ming.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, adjusting feature level of student “provides better performance than the baseline methods” and “Based on knowledge distillation, recent studies have shown significant improvements in model compression” (Ji, Conclusion & Page 7951; & Page 7945, Introduction).
Claim 6. 
O’ Neill in view of Ming and further in view of Ji teaches the apparatus of claim 5,
Ming further teaches wherein one or more of the at least one programable circuit is to execute (SECTION 4. Experiments & “We mainly validate the effectiveness of our proposed Variational Bayesian Sparsification Distillation on the well-known three datasets, MNIST [26], CIFAR 10 and CIFAR 100 [27]. All the following experiments were run on NVIDIA GeForce GTX 1080 Ti GPUs” teaches NVIDIA GeForce GTX 1080 Ti GPUs corresponds to the programmable circuit).
O’ Neill and Ming are analogous art because they both are directed to distillation techniques that employ pruning strategies.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Ming into the disclosed invention of O’ Neill.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, the experiments were run on “NVIDIA GeForce GTX 1080 Ti GPUs” and used three large-scale datasets for sparsification to improve the classification accuracy” (Ming, Page 4 SECTION 4. Experiments and Page 5 SECTION 5. Conclusions).
Ji further teaches an attention transfer distillation algorithm to transfer the knowledge from the supernet to the subnet (Related Work & Page 7947 “In order to identify the similarity between hTt and hSs , AFD adopts a query-key concept of the attention mechanism (Xu et al. 2015; Vaswani et al. 2017). Specifically, each teacher feature generates a query, qt, and each student feature identifies a key, ks” teaches attention-based feature distillation to regulate the contribution of feature map to final distillation).
O’ Neill and Ji are analogous art because they are all directed to distillation algorithms for neural networks.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Ji into the disclosed invention of O’ Neill.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, adjusting feature level of student “provides better performance than the baseline methods” and “Based on knowledge distillation, recent studies have shown significant improvements in model compression” (Ji, Conclusion & Page 7951; & Page 7945, Introduction).
Claim 8. 
O’ Neill in view of Ming teaches the apparatus of claim 1,
Ming further teaches wherein one or more of the at least one programmable circuit is to (SECTION 4. Experiments & “We mainly validate the effectiveness of our proposed Variational Bayesian Sparsification Distillation on the well-known three datasets, MNIST [26], CIFAR 10 and CIFAR 100 [27]. All the following experiments were run on NVIDIA GeForce GTX 1080 Ti GPUs” teaches NVIDIA GeForce GTX 1080 Ti GPUs corresponds to the circuit).
O’ Neill and Ming are analogous art because they both are directed to distillation techniques that employ pruning strategies.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Ming into the disclosed invention of O’ Neill.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, the experiments were run on “NVIDIA GeForce GTX 1080 Ti GPUs” and used three large-scale datasets for sparsification to improve the classification accuracy” (Ming, Page 4 SECTION 4. Experiments and Page 5 SECTION 5. Conclusions).
O’ Neill in view of Ming doesn’t explicitly teach generate, based on input data, a queries matrix, a values matrix, and a keys matrix, wherein the queries matrix, the values matrix, and the keys matrix include the parameters to be pruned; and provide the queries matrix, the values matrix, and the keys matrix to the pruning.
However, Ji teaches generate, based on input data, a queries matrix, a values matrix, and a keys matrix, wherein the queries matrix, the values matrix, and the keys matrix include the parameters to be pruned; and provide the queries matrix, the values matrix, and the keys matrix to the pruning (Related Work & Page 7947 “By utilizing the queries and keys, attention values that represent relations between teacher and student candidates are calculated with a “softmax” function” and Experiments & Page 7950 “we evaluate our model by varying the value of β. β, is used to train the attention map, α, which determines the links of the AFD network, and to decide the degree of how much the student mimic the teacher features” teaches queries, key and value matrix computed from input by using neural projection).
O’ Neill, Ming and Ji are analogous art because they are all directed to distillation algorithms for neural networks.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Ji into the disclosed invention of O’ Neill in view of Ming.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, adjusting feature level of student “provides better performance than the baseline methods” and “Based on knowledge distillation, recent studies have shown significant improvements in model compression” (Ji, Conclusion & Page 7951; & Page 7945, Introduction).
Claim 9. 
O’ Neill in view of Ming and further in view of Ji teaches the apparatus of claim 8, 
Ming further teaches wherein one or more of the at least one programmable circuit is (SECTION 4. Experiments & “We mainly validate the effectiveness of our proposed Variational Bayesian Sparsification Distillation on the well-known three datasets, MNIST [26], CIFAR 10 and CIFAR 100 [27]. All the following experiments were run on NVIDIA GeForce GTX 1080 Ti GPUs” teaches NVIDIA GeForce GTX 1080 Ti GPUs corresponds to the circuit).
O’ Neill and Ming are analogous art because they both are directed to distillation techniques that employ pruning strategies.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Ming into the disclosed invention of O’ Neill.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, the experiments were run on “NVIDIA GeForce GTX 1080 Ti GPUs” and used three large-scale datasets for sparsification to improve the classification accuracy” (Ming, Page 4 SECTION 4. Experiments and Page 5 SECTION 5. Conclusions).
Ji further teaches apply the input data to a parameterized learnable transformation (PLT) to generate the queries matrix, the values matrix, and the keys matrix (Page 7948 & Experiments “Our method that utilizes an attention mechanism to identify similar features between the teacher and student shows the best performance over all experiment settings. In particular, our method shows an improvement over ATT which uses the same feature distance for distillation” and Attention-based Feature Distillation & Page 7947 “fQ and fK are activation function of the query and key. WQ t ∈ R d×d T and WK s ∈ R d×d S are linear transition parameters for the t-th query and the s-th key”  teaches using input feature map to learn to project into query, key and value which corresponding to transition from input to  query, key and value is PLT).
O’ Neill, Ming and Ji are analogous art because they are all directed to distillation algorithms for neural networks.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Ji into the disclosed invention of O’ Neill.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, adjusting feature level of student “provides better performance than the baseline methods” and “Based on knowledge distillation, recent studies have shown significant improvements in model compression” (Ji, Conclusion & Page 7951; & Page 7945, Introduction).
Claim 11. 
O’ Neill in view of Ming and further in view of Ji teaches the apparatus of claim 8,
Ming further teaches wherein one or more of the at least one programmable circuit is (SECTION 4. Experiments & “We mainly validate the effectiveness of our proposed Variational Bayesian Sparsification Distillation on the well-known three datasets, MNIST [26], CIFAR 10 and CIFAR 100 [27]. All the following experiments were run on NVIDIA GeForce GTX 1080 Ti GPUs” teaches NVIDIA GeForce GTX 1080 Ti GPUs corresponds to the circuit).
O’ Neill and Ming are analogous art because they both are directed to distillation techniques that employ pruning strategies.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Ming into the disclosed invention of O’ Neill.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, the experiments were run on “NVIDIA GeForce GTX 1080 Ti GPUs” and used three large-scale datasets for sparsification to improve the classification accuracy” (Ming, Page 4 SECTION 4. Experiments and Page 5 SECTION 5. Conclusions).
Ji further teaches perform an operation on the queries matrix and the keys matrix, wherein the operation is a matrix multiplication operation or a 1 x 1 convolution on the queries matrix and the keys matrix (Attention-based Feature Distillation & Page 7947 “In order to identify the similarity between h T t and h S s , AFD adopts a query-key concept of the attention mechanism (Xu et al. 2015; Vaswani et al. 2017). Specifically, each teacher feature generates a query, qt, and each student feature identifies a key, ks…By utilizing the queries and keys, attention values that represent relations between teacher and student candidates are calculated with a “softmax” function; αt = softmax([(q > t WQ-K 1 kt,1 + (p T t ) >p S 1 )/ √ d, · · · ,(q > t WQ-K S kt,S + (p T t ) >p S S )/ √ d]). (2)” teaches attention-based feature matching which is calculating similarity between feature map corresponding to computing matrix multiplication of query and key);
apply a softmax function to an output of the operation (Attention-based Feature Distillation & Page 7947 “By utilizing the queries and keys, attention values that represent relations between teacher and student candidates are calculated with a “softmax” function; αt = softmax([(q > t WQ-K 1 kt,1 + (p T t ) >p S 1 )/ √ d, · · · ,(q > t WQ-K S kt,S + (p T t ) >p S S )/ √ d]). (2)” teaches applying softmax function on the operations); 
and generate a self attention output based on a combination of the values matrix and an output of the softmax function (Attention-based Feature Distillation & Page 7947 “αt is the attention vector that capture relation between the t-th teacher feature and whole student features. By utilizing αt, the teacher feature, h T t , enables to transfer its knowledge selectively to student features. The final distillation term forms as 
    PNG
    media_image1.png
    34
    290
    media_image1.png
    Greyscale
” teaches attention vector us obtained from softmax multiplied with 
    PNG
    media_image2.png
    33
    127
    media_image2.png
    Greyscale
corresponding to value).
O’ Neill, Ming and Ji are analogous art because they are all directed to distillation algorithms for neural networks.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Ji into the disclosed invention of O’ Neill in view of Ming.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, adjusting feature level of student “provides better performance than the baseline methods” and “Based on knowledge distillation, recent studies have shown significant improvements in model compression” (Ji, Conclusion & Page 7951; & Page 7945, Introduction).
Claim 12. 
O’ Neill in view of Ming and further in view of Ji teaches the apparatus of claim 8,
Ming further teaches wherein one or more of the at least one programmable circuit is to (SECTION 4. Experiments & “We mainly validate the effectiveness of our proposed Variational Bayesian Sparsification Distillation on the well-known three datasets, MNIST [26], CIFAR 10 and CIFAR 100 [27]. All the following experiments were run on NVIDIA GeForce GTX 1080 Ti GPUs” teaches NVIDIA GeForce GTX 1080 Ti GPUs corresponds to the circuit).
O’ Neill and Ming are analogous art because they both are directed to distillation techniques that employ pruning strategies.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Ming into the disclosed invention of O’ Neill.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, the experiments were run on “NVIDIA GeForce GTX 1080 Ti GPUs” and used three large-scale datasets for sparsification to improve the classification accuracy” (Ming, Page 4 SECTION 4. Experiments and Page 5 SECTION 5. Conclusions).
Ji further teaches replace a convolutional neural network (CNN) with a self attention (SA) mechanism (Figure 2 teaches attention meta network instead of CNN blocks).
O’ Neill, Ming and Ji are analogous art because they are all directed to distillation algorithms for neural networks.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Ji into the disclosed invention of O’ Neill in view of Ming.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, adjusting feature level of student “provides better performance than the baseline methods” and “Based on knowledge distillation, recent studies have shown significant improvements in model compression” (Ji, Conclusion & Page 7951; & Page 7945, Introduction).
Claim 14. 
O’ Neill in view of Ming teaches the apparatus of claim 13,
Ming further teaches wherein one or more of the at least one programmable circuit is to (SECTION 4. Experiments & “We mainly validate the effectiveness of our proposed Variational Bayesian Sparsification Distillation on the well-known three datasets, MNIST [26], CIFAR 10 and CIFAR 100 [27]. All the following experiments were run on NVIDIA GeForce GTX 1080 Ti GPUs” teaches NVIDIA GeForce GTX 1080 Ti GPUs corresponds to the circuit).
O’ Neill and Ming are analogous art because both are directed to distillation techniques that employ pruning strategies.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Ming into the disclosed invention of O’ Neill.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, the experiments were run on “NVIDIA GeForce GTX 1080 Ti GPUs” and used three large-scale datasets for sparsification to improve the classification accuracy” (Ming, Page 4 SECTION 4. Experiments and Page 5 SECTION 5. Conclusions).
O’ Neill in view of Ming does not explicitly teach replace the plurality of convolutional layers in the CNN with a plurality of SA layers.
However, Ji teaches  replace the plurality of convolutional layers in the CNN with a plurality of SA layers (Figure 2 teaches attention meta network instead of CNN blocks).
O’ Neill, Ming and Ji are analogous art because they are all directed to distillation algorithms for neural networks.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Ji into the disclosed invention of O’ Neill in view of Ming.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, adjusting feature level of student “provides better performance than the baseline methods” and “Based on knowledge distillation, recent studies have shown significant improvements in model compression” (Ji, Conclusion & Page 7951; & Page 7945, Introduction).
Claim 19. 
O’ Neill in view of Ming teaches the non-transitory machine readable storage medium of claim 18, 
Ming further teaches wherein instructions cause one or more of the at least one programmable circuit to (SECTION 4. Experiments & “We mainly validate the effectiveness of our proposed Variational Bayesian Sparsification Distillation on the well-known three datasets, MNIST [26], CIFAR 10 and CIFAR 100 [27]. All the following experiments were run on NVIDIA GeForce GTX 1080 Ti GPUs” teaches NVIDIA GeForce GTX 1080 Ti GPUs corresponds to programmable circuit).
O’ Neill and Ming are analogous art because they both are directed to distillation techniques that employ pruning strategies.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Ming into the disclosed invention of O’ Neill.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, the experiments were run on “NVIDIA GeForce GTX 1080 Ti GPUs” and used three large-scale datasets for sparsification to improve the classification accuracy” (Ming, Page 4 SECTION 4. Experiments and Page 5 SECTION 5. Conclusions).
O’ Neill in view of Ming does not explicitly teach extract logits and feature maps from the first ML model, wherein the knowledge includes both the extracted logits and the extracted feature maps.
However, Ji teaches extract logits and feature maps from the first ML model, wherein the knowledge includes both the extracted logits and the extracted feature maps (Graph-driven VAEs for different tasks & Page 982 “the Dirichlet distribution can be approximated with a logistic normal and a softmax formulation by the Laplace approximation (Hennig et al. 2012). When the number of topics L is large, the Dirichlet distribution can be approximated with a multivariate logistic normal (Srivastava and Sutton 2017) with the i-th element of its mean μT and diagonal covariance matrix ΣT” teaches diagonal covariance matrix which is corresponding to knowledge  include softmax (logits) and Experiments & Page 7948 “Our method that utilizes an attention mechanism to identify similar features between the teacher and student shows the best performance over all experiment settings” teaches feature map are output of the network wherein feature map matches from teacher (supernet) and student network).
O’ Neill, Ming and Ji are analogous art because they are all directed to distillation algorithms for neural networks.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Ji into the disclosed invention of O’ Neill in view of Ming.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, adjusting feature level of student “provides better performance than the baseline methods” and “Based on knowledge distillation, recent studies have shown significant improvements in model compression” (Ji, Conclusion & Page 7951; & Page 7945, Introduction).
Claim 20. 
O’ Neill in view of Ming teaches the non-transitory machine readable storage medium of claim 16, 
Ming further teaches wherein instructions cause one or more of the at least one programmable circuit to (SECTION 4. Experiments & “We mainly validate the effectiveness of our proposed Variational Bayesian Sparsification Distillation on the well-known three datasets, MNIST [26], CIFAR 10 and CIFAR 100 [27]. All the following experiments were run on NVIDIA GeForce GTX 1080 Ti GPUs” teaches NVIDIA GeForce GTX 1080 Ti GPUs corresponds to programmable circuit).
O’ Neill and Ming are analogous art because they both are directed to distillation techniques that employ pruning strategies.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Ming into the disclosed invention of O’ Neill.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, the experiments were run on “NVIDIA GeForce GTX 1080 Ti GPUs” and used three large-scale datasets for sparsification to improve the classification accuracy” (Ming, Page 4 SECTION 4. Experiments and Page 5 SECTION 5. Conclusions).
O’ Neill in view of Ming does not explicitly teach operate an attention transfer distillation algorithm to transfer the knowledge from the first ML model to the second ML model such that the second ML model includes a spatial attention map that is similar to a spatial attention map of the first ML model.
However, Ji teaches operate an attention transfer distillation algorithm to transfer the knowledge from the first ML model to the second ML model such that the second ML model includes a spatial attention map that is similar to a spatial attention map of the first ML model (Related Work & Page 7947 “In order to identify the similarity between hTt and hSs , AFD adopts a query-key concept of the attention mechanism (Xu et al. 2015; Vaswani et al. 2017). Specifically, each teacher feature generates a query, qt, and each student feature identifies a key, ks” teaches attention based feature distillation to regulate the contribution of feature map to final distillation and Related Work & Page 7946 “Let h T = {h T 1 , ..., hT T } be a set of the feature candidates from the teacher and h S = {h S 1 , ..., hS S } be a set of feature candidates from the student where T and S indicate the numbers of the candidates from the teacher and student” teaches knowledge transfer data between teacher to student and Experiments & Page 7948 “Our method that utilizes an attention mechanism to identify similar features between the teacher and student shows the best performance over all experiment settings” teaches feature map are output of the network wherein feature map matches from teacher (first neural network) and student network).
O’ Neill, Ming and Ji are analogous art because they are all directed to distillation algorithms for neural networks.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Ji into the disclosed invention of O’ Neill in view of Ming.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, adjusting feature level of student “provides better performance than the baseline methods” and “Based on knowledge distillation, recent studies have shown significant improvements in model compression” (Ji, Conclusion & Page 7951; & Page 7945, Introduction).
Claim 21. 
O’ Neill in view of Ming teaches the non-transitory machine readable storage medium of claim 16,
Ming further teaches wherein instructions cause one or more of the at least one programmable circuit to (SECTION 4. Experiments & “We mainly validate the effectiveness of our proposed Variational Bayesian Sparsification Distillation on the well-known three datasets, MNIST [26], CIFAR 10 and CIFAR 100 [27]. All the following experiments were run on NVIDIA GeForce GTX 1080 Ti GPUs” teaches NVIDIA GeForce GTX 1080 Ti GPUs corresponds to programmable circuit).
O’ Neill and Ming are analogous art because they both are directed to distillation techniques that employ pruning strategies.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Ming into the disclosed invention of O’ Neill.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, the experiments were run on “NVIDIA GeForce GTX 1080 Ti GPUs” and used three large-scale datasets for sparsification to improve the classification accuracy” (Ming, Page 4 SECTION 4. Experiments and Page 5 SECTION 5. Conclusions).
O’ Neill in view of Ming does not explicitly teach generate, based on the training dataset, a queries matrix, a values matrix, and a keys matrix, wherein the queries matrix, the values matrix, and the keys matrix include the parameters to be pruned by the pruning mechanism; and provide the queries matrix, the values matrix, and the keys matrix to the pruning.
However, Ji teaches generate, based on the training dataset, a queries matrix, a values matrix, and a keys matrix, wherein the queries matrix, the values matrix, and the keys matrix include the parameters to be pruned by the pruning mechanism; and provide the queries matrix, the values matrix, and the keys matrix to the pruning (Related Work & Page 7947 “By utilizing the queries and keys, attention values that represent relations between teacher and student candidates are calculated with a “softmax” function” and Experiments & Page 7950 “we evaluate our model by varying the value of β. β, is used to train the attention map, α, which determines the links of the AFD network, and to decide the degree of how much the student mimic the teacher features” teaches queries, key and value matrix computed from input by using neural projection).
O’ Neill, Ming and Ji are analogous art because they are all directed to distillation algorithms for neural networks.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Ji into the disclosed invention of O’ Neill.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, adjusting feature level of student “provides better performance than the baseline methods” and “Based on knowledge distillation, recent studies have shown significant improvements in model compression” (Ji, Conclusion & Page 7951; & Page 7945, Introduction).
Claim 22. 
O’ Neill in view of Ming and further in view of Ji teaches non-transitory machine readable storage medium of claim 21:
Ming further teaches wherein instructions cause one or more of the at least one programmable circuit to (SECTION 4. Experiments & “We mainly validate the effectiveness of our proposed Variational Bayesian Sparsification Distillation on the well-known three datasets, MNIST [26], CIFAR 10 and CIFAR 100 [27]. All the following experiments were run on NVIDIA GeForce GTX 1080 Ti GPUs” teaches NVIDIA GeForce GTX 1080 Ti GPUs corresponds to programmable circuit).
O’ Neill and Ming are analogous art because they both are directed to distillation techniques that employ pruning strategies.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Ming into the disclosed invention of O’ Neill.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, the experiments were run on “NVIDIA GeForce GTX 1080 Ti GPUs” and used three large-scale datasets for sparsification to improve the classification accuracy” (Ming, Page 4 SECTION 4. Experiments and Page 5 SECTION 5. Conclusions).
Ji further teaches apply parts of the training dataset to a parameterized learnable transformation (PLT) to generate the queries matrix, the values matrix, and the keys matrix (Page 7948 & Experiments “Our method that utilizes an attention mechanism to identify similar features between the teacher and student shows the best performance over all experiment settings. In particular, our method shows an improvement over ATT which uses the same feature distance for distillation” and Attention-based Feature Distillation & Page 7947 “fQ and fK are activation function of the query and key. WQ t ∈ R d×d T t and WK s ∈ R d×d S s are linear transition parameters for the t-th query and the s-th key”  teaches using input feature map to learn to project into query, key and value which corresponding to transition from input to  query, key and value is PLT);
perform an operation on the queries matrix and the keys matrix, wherein the operation is a matrix multiplication operation or a 1×1 convolution on the queries matrix and the keys matrix (Attention-based Feature Distillation & Page 7947 “In order to identify the similarity between h T t and h S s , AFD adopts a query-key concept of the attention mechanism (Xu et al. 2015; Vaswani et al. 2017). Specifically, each teacher feature generates a query, qt, and each student feature identifies a key, ks…By utilizing the queries and keys, attention values that represent relations between teacher and student candidates are calculated with a “softmax” function; αt = softmax([(q > t WQ-K 1 kt,1 + (p T t ) >p S 1 )/ √ d, · · · ,(q > t WQ-K S kt,S + (p T t ) >p S S )/ √ d]). (2)” teaches attention-based feature matching which is calculating similarity between feature map corresponding to computing matrix multiplication of query and key);
apply a softmax function to an output of the operation (Attention-based Feature Distillation & Page 7947 “By utilizing the queries and keys, attention values that represent relations between teacher and student candidates are calculated with a “softmax” function; αt = softmax([(q > t WQ-K 1 kt,1 + (p T t ) >p S 1 )/ √ d, · · · ,(q > t WQ-K S kt,S + (p T t ) >p S S )/ √ d]). (2)” teaches applying softmax function on the operations); 
and generate an SA output based on a combination of the values matrix and an output of the softmax function (Attention-based Feature Distillation & Page 7947 “αt is the attention vector that capture relation between the t-th teacher feature and whole student features. By utilizing αt, the teacher feature, h T t , enables to transfer its knowledge selectively to student features. The final distillation term forms as 
    PNG
    media_image1.png
    34
    290
    media_image1.png
    Greyscale
” teaches attention vector us obtained from softmax multiplied with 
    PNG
    media_image2.png
    33
    127
    media_image2.png
    Greyscale
corresponding to value).
O’ Neill, Ming and Ji are analogous art because they are all directed to distillation algorithms for neural networks.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Ji into the disclosed invention of O’ Neill in view of Ming.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, adjusting feature level of student “provides better performance than the baseline methods” and “Based on knowledge distillation, recent studies have shown significant improvements in model compression” (Ji, Conclusion & Page 7951; & Page 7945, Introduction).
Claim 23. 
O’ Neill in view of Ming and further in view of Ji teaches the non-transitory machine readable storage medium of claim 21,
Ming further teaches wherein instructions cause one or more of the at least one programmable circuit to (SECTION 4. Experiments & “We mainly validate the effectiveness of our proposed Variational Bayesian Sparsification Distillation on the well-known three datasets, MNIST [26], CIFAR 10 and CIFAR 100 [27]. All the following experiments were run on NVIDIA GeForce GTX 1080 Ti GPUs” and 2. RELATED WORK & Page 2 “they required pre-determined codebook and computational efficiency algorithm. The latter ones focused on using fixed-point data to represent the weights of CNNs [14], including Binary weight and ternary weight [7]” teaches NVIDIA GeForce GTX 1080 Ti GPUs corresponds to the circuit).
O’ Neill and Ming are analogous art because they both are directed to distillation techniques that employ pruning strategies.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Ming into the disclosed invention of O’ Neill.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, the experiments were run on “NVIDIA GeForce GTX 1080 Ti GPUs” and used three large-scale datasets for sparsification to improve the classification accuracy” (Ming, Page 4 SECTION 4. Experiments and Page 5 SECTION 5. Conclusions).
Ji further teaches replace a convolutional neural network (CNN) with a self attention (SA) mechanism (Figure 2 teaches attention meta network instead of CNN blocks).
O’ Neill, Ming and Ji are analogous art because they are both directed to distillation algorithms for neural networks.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Ji into the disclosed invention of O’ Neill in view of Ming. 
One of ordinary skill in the arts would have been motivated to make this modification because of the following, adjusting feature level of student “provides better performance than the baseline methods” and “Based on knowledge distillation, recent studies have shown significant improvements in model compression” (Ji, Conclusion & Page 7951; & Page 7945, Introduction).
Claim 24. 
O’ Neill in view of Ming and further in view of Ji teaches the non-transitory machine readable storage medium of claim 21, 
O’ Neill further teaches wherein the first ML model is a convolutional neural network (CNN) comprising a set of convolutional layers and a set of layers that are not convolutional layers (3.2 A FROBENIUS DISTORTION PERSPECTIVE OF SELF-DISTILLED PRUNING & Page 4 “Frobenius distortions which is a loose approximation for deep networks” and Algorithm 1 and 4 EXPERIMENTAL SETUP & Page 6“ For XGLUE tasks, we perform 15 pruning steps on XLMRoBERTABase, one per 15 epochs, while for the GLUE tasks, we perform 32 pruning steps on BERTBase. The compression rate and number of pruning steps is higher for GLUE tasks compared to XGLUE, because GLUE tasks involve evaluation in the supervised classification setting; whereas in XGLUE we report in the more challenging zero-shot cross-lingual transfer setting with only a single language used for training (i.e., English)” teaches unpruned model (supernet) architect contain convolution because pruning in the context of deep neural network which implies to convolutional architecture).
Ji further teaches wherein the set of convolutional layers in the CNN are replaced with a set of SA layers, and the set of SA layers form an SA mechanism (Figure 2 teaches attention meta network instead of CNN blocks).
O’ Neill, Ming and Ji are analogous art because they are all directed to distillation algorithms for neural networks.
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Ji into the disclosed invention of O’ Neill in view of Ming.
One of ordinary skill in the arts would have been motivated to make this modification because of the following, adjusting feature level of student “provides better performance than the baseline methods” and “Based on knowledge distillation, recent studies have shown significant improvements in model compression” (Ji, Conclusion & Page 7951; & Page 7945, Introduction).
Response to Arguments
Applicant's arguments filed on 10/16/2025 with respect to 35 U.S.C. 101 rejections of claims 1-25 have been fully considered but they are not persuasive.
The cancellation of claim 10 has rendered the rejections of this claim moot. With respect to the 35 U.S.C. 101 rejection of claim 1, applicant asserts, “As recently reinforced by Director Squires in [sic Ex parte] Guillaume Desjardins [sic – et al.], "an improvement to how a machine learning model itself operates" integrates any abstract idea into a patent eligible practical application. (Exparte Guillaume Desjardins3 et al. page 9). In the instant application, the claims, as amended, set forth during the same single ML training epoch as the distilling, prune one or more parameters from the subnet ML model based on an identified parameter of target hardware, the subnet ML model to operate on the target hardware, the prune to produce a sparse distilled subnet ML model. As such, the claims presented unmistakably recite an "improvement to how a machine learning model itself operates" and are statutory for precisely the reasons explained by Director Squires…As set forth above, the technical improvements described in the specification are tied to the amended claims. Thus, the claims and specification are in agreement in that they recite and disclose improvements in how machine learning operates (e.g., via increased speed and reduced parameters). To the extent any abstract idea is recited in the claims (a fact not conceded), the recited improvements reflect a practical application of any such abstract idea. Accordingly, reconsideration and withdrawal of the rejections under 101 are requested” (Remarks Pg.13-15).
Examiner Response:
The examiner respectfully disagrees with applicant’s assertions regarding the rejections under 35 U.S.C. 101. Regarding applicant’s apparent reliance on the decision of the Appeals Review Panel in Ex parte Desjardins, No. 2024-000567 (P.T.A.B. Sept. 26, 2025). Desjardins is distinguishable from the present claims. In Desjardins, unlike the claims at issue here, the appellants specifically identified claim language that reflected a technological improvement. In particular, the appellants argued that, when evaluating the claim as a whole, independent claim 1 included the following limitation evidencing an improvement: "adjust the first values of the plurality of parameters to optimize performance of the machine learning model on the second machine learning task while protecting performance of the machine learning model on the first machine learning task". The board was persuaded that this limitation constituted an improvement to the operation of the machine learning model itself, rather than merely an abstract mathematical calculation (Page 9 of Ex parte Desjardins, No. 2024-000567). Thus, the appellant in Desjardins expressly alleged and supported that the claimed subject matter improved machine learning technology itself. By contrast, in the instant application, Applicant does not identify any specific claim language characterizing a comparable technological improvement, not does Applicant point to claim limitations analogous to those at issue in Desjardins. 
Furthermore, applicant’s original specification states, in paragraph [0022] “first ML model optimization approach…an ML model that does not compromise on performance in terms of accuracy, speed, and power” and paragraph [0035] “the sparse distillation system…without requiring any custom designs or packages”. The claims themselves merely recite high level pruning and distillation processes. Such recitations describe desired results or benefits rather than a specific improvement to computer or machine learning technology. Importantly, statements of advantage in the Specification, without corresponding claim limitations that reflect a technological improvement, do not establish that the claimed subject matter improves technology itself. Therefore, the rejections under 35 U.S.C. 101 are maintained.
Applicant's arguments filed on 10/16/2025 with respect to 35 U.S.C. 103 rejections of claims 1-9 and 11-25 have been fully considered but they are not persuasive.
The cancellation of claim 10 has rendered the rejections of this claim moot. With respect to the 35 U.S.C. 103 rejection of claim 1, applicant asserts, “O'Neill does not teach or suggest prune one or more parameters from the subnet ML model based on an identified parameter of target hardware. While Ji was not used in the rejection of claim 1, it is noted that Ji fails to cure the deficiencies of O'Neill. As such, the O'Neill/Ji combination does not establish a prima facie case for rejecting claim 1. Accordingly, reconsideration and withdrawal of the rejection of claim 1 and all claims depending therefrom are requested” (Remarks Pg.15).
Examiner Response:
The examiner respectfully disagrees. Applicant's arguments with respect to the rejections of claims 1-4, 7, 13, 15, 16-18 and 25  under 35 U.S.C. 102 in the previous office action have been fully considered and are persuasive in part. 
However, these claims are now rejected under 35 U.S.C. 103. In particular, as discussed in the section 103 rejections above, the combination of O’Neill in view of the newly-cited Ming reference teaches all the recited limitations of amended independent claims 1 and 16, and dependent claims 2-4, 7, 13, 15, 17-18 and 25. 
As also discussed above, the combination of O’Neill in view of Ming and further in view of Ji teaches all the limitations of dependent claims 5-6, 8-9, 11-12, 14 and 19-24.
Applicant’s argument does not explain why O'Neill allegedly fails to teach the claimed limitation. Specifically, Applicant asserts that O'Neill does not teach or suggest “prune one or more parameters from the subnet ML model based on an identified parameter of target hardware”. However, as discussed above, O'Neill teaches this limitation in section 1 INTRODUCTION & Page 2 and 3.3 HOW DOES SELF-DISTILLATION IMPROVE PRUNED MODEL GENERALIZATION ? & Page 5. O'Neill discloses a pruned model (i.e., a subnet) in which parameters are selectively pruned, including soft-targeted bias, in correspondence with target hardware constraints. Accordingly, contrary to applicant’s above-noted assertion, O'Neill teaches pruning parameters of a subnet machine learning model based on identified paraments of target hardware. Furthermore, new prior art has been applied to address the claimed limitations. All limitations of dependent 2-9 and 11-15 are taught by O'Neill in view of the additional references, as detailed in the section 103 rejection set forth above. Therefore, the rejections under 35 U.S.C. 103 are maintained.

With respect to the 35 U.S.C. 103 ejection of claim 16, applicant asserts, “Claim 16 sets forth instructions to cause programmable circuitry to at least: during the same one pass over the training dataset as the distill, prune one or more parameters from the second ML model based on an identified parameter of target hardware, the second ML model will operate on the target hardware. The combination of O'Neill and Ji fails to teach or suggest such instructions. Accordingly, reconsideration and withdrawal of the rejection of claim 16 and all claims depending therefrom are requested” (Remarks Pg.16).
Examiner Response:
The examiner respectfully disagrees. Applicant’s argument does not explain why O'Neill allegedly fails to teach the claimed limitation. Specifically, Applicant asserts that O'Neill does not teach or suggest “prune one or more parameters from the subnet ML model based on an identified parameter of target hardware”. However, as discussed above, O'Neill teaches this limitation in section 1 INTRODUCTION & Page 2 and 3.3 HOW DOES SELF-DISTILLATION IMPROVE PRUNED MODEL GENERALIZATION ? & Page 5. O'Neill discloses a pruned model (i.e., a subnet) in which parameters are selectively pruned, including soft-targeted bias, in correspondence with target hardware constraints. Accordingly, O'Neill teaches pruning parameters of a subnet machine learning model based on identified paraments of target hardware. Furthermore, new prior art has been applied to address the claimed limitations. All limitations of dependent 17-25 are taught by O'Neill in view of the additional references, as detailed in the section 103 rejection set forth above. Therefore, the rejections under 35 U.S.C. 103 are maintained.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Lokesha Patel whose telephone number is (571)272-6267. The examiner can normally be reached 8 AM - 4 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached at (571) 272-7796. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/LOKESHA PATEL/Examiner, Art Unit 2125                                                                                                                                                                                                        

/KAMRAN AFSHAR/Supervisory Patent Examiner, Art Unit 2125                                                                                                                                                                                                        


    
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
    

    
        1 Aside from merely repeating the claim language and providing general examples (see, e.g., paragraph [0029] stating “The sparse distillation system 200 supports both structured column/channel pruning and magnitude pruning to yield models that can increase inference speeds as well offer the benefit of reduced parameters across various target hardware without requiring any custom designs or packages”), applicant’s specification does not explicitly define nor provide details of the recited “target hardware”. Therefore, “target hardware”, under the broadest reasonable interpretation (BRI), in view of the specification is any target bias will consider as target hardware.
        2 Aside from merely repeating the claim language and providing general examples (see, e.g., paragraph [0029] stating “The sparse distillation system 200 supports both structured column/channel pruning and magnitude pruning to yield models that can increase inference speeds as well offer the benefit of reduced parameters across various target hardware without requiring any custom designs or packages”), applicant’s specification does not explicitly define nor provide details of the recited “target hardware”. Therefore, “target hardware”, under the broadest reasonable interpretation (BRI), in view of the specification is any target bias which are focus on specific computer or processor will consider as target hardware.
        3 Examiner notes that applicant is apparently referring to Appeals Review Panel in Ex parte Desjardins, No. 2024-000567 (hereinafter “Desjardins”).
Read full office action
Prosecution Timeline

Oct 18, 2021
Application Filed
Jul 11, 2025
Non-Final Rejection — §101, §103
Oct 16, 2025
Response Filed
Feb 06, 2026
Final Rejection — §101, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/347,150
Patent 12585938
Consensus Driven Learning
2y 5m to grant Granted Mar 24, 2026
16/273,973
Patent 12572811
CONTROLLABLE AND INTERPRETABLE CONTENT CONVERSION
2y 5m to grant Granted Mar 10, 2026
17/475,003
Patent 12561556
DEVICES, SYSTEMS, METHODS, AND MEDIA FOR DOMAIN ADAPTATION USING HYBRID LEARNING
2y 5m to grant Granted Feb 24, 2026
17/467,971
Patent 12536454
TODDLER-INSPIRED BAYESIAN LEARNING METHOD AND COMPUTING APPARATUS FOR PERFORMING THE SAME
2y 5m to grant Granted Jan 27, 2026
17/352,899
Patent 12530615
INTELLIGENT OVERSIGHT OF MULTI-PARTY ENGAGEMENTS
2y 5m to grant Granted Jan 20, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
76%
Grant Probability
99%
With Interview (+38.0%)
4y 5m
Median Time to Grant
Moderate
PTA Risk
Based on 74 resolved cases by this examiner. Grant probability derived from career allow rate.