Last updated: May 29, 2026
Application No. 17/722,003
Latency-Aware Neural Network Pruning and Applications Thereof

Non-Final OA §103
Filed
Apr 15, 2022
Examiner
RAMESH, TIRUMALE K
Art Unit
2121
Tech Center
2100 — Computer Architecture & Software
Assignee
Microsoft Technology Licensing, LLC
OA Round
2 (Non-Final)
Interview Optional

— +2.1% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 18% grant rate with +2.1% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 40 resolved cases, 2023–2026
Examiner Intelligence

RAMESH, TIRUMALE K View full profile →
Grants only 18% of cases
Career Allowance Rate
7 granted / 40 resolved
-37.5% vs TC avg
Minimal +2% lift
Without
With
+2.1%
Interview Lift
resolved cases with interview
Typical timeline
4y 7m
Avg Prosecution
17 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
1.2%
-38.8% vs TC avg
§103
98.4%
+58.4% vs TC avg
§102
0.4%
-39.6% vs TC avg
Black line = Tech Center average estimate • Based on career data from 40 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
 (Submitted 9/26/25)
In regard to 101 rejections
-	On Page 15, the applicant argues that the claim 1 (claim 18) is amended to recite” mutating the parent model using a mutating model, to produce a child model, the mutating model including two neural networks that operate in two respective stages” and further recites “ generating a reward score for the child model that takes into consideration at least accuracy and latency of the child model”.  The applicant has amended the claims 13 by bringing in the limitations from claim 17 which was subsequently CANCELLED. The applicant amended the claim 14 reciting  “mutating the parent model using a mutating model, to produce a child model, the mutating model including two neural networks that operate in two respective stages” and further recites “ generating a reward score for the child model that takes into consideration at least accuracy and latency of the child model”.  
After, reviewing the amendments, the examiner WITHDRAWS the 101 rejections on clam 1-2, 4-6, 8-16, and 18-20. 
The applicant had added two new claims 24 and 25 that are dependent on claims 1 and 10 respectively.
In regard to 103 rejections
-	The examiner notes that the applicant has amended the claims 1, 14 and 18 independent claims and has amended the dependents claims 10, 16, , 21, 22 and has added two new claims 24 and 25. The applicant has CACELED the claims 3, 7 and 17.  Claims 4, 9, 13, 15 and 18 are noted as “previously presented” in regard to the claims posted on 9/26/25 which was amended on 9/18/25. As a result, the examiner has used a new prior art to map to these claims based on the amendments to the claims on 9/18/25.
-	On Page 16, the applicant argues that the current references in combination do not teaches the amendments. The applicant amendments now revised to a scope of generating the reward score for the child model within the broader context of mutations of the parent model to generate child model within the context of using sparsity level and the model being a shared neural network. 
	

Examiner’s Response
It appears that the applicant has a broader scope of the invention with significant amendment made specifically to generate the reward score considering the accuracy and latency within the context of mutation of parent model for searching for the architecture using neural architecture mechanism.  As a result, the examiner submits the applicant argument is MOOT as a result of new ground of rejection with inclusion of  reference  ”LU” that may strongly teach the the broader new invention scope presented in the by the applicant for amended claims. The examiner also used a new reference “Yang” to teach the new claim 25.
In Conclusion, the examiner rejects the claims 1-2, 4-6, 8-16, 18-20 and 24-25 under 103 and MOVES the application to FINAL REJECTION. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 4, 6, 8-9, 18, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over 
in view BINGQIAN LU et.al. (hereinafter LU) One Proxy Device Is Enough for Hardware-Aware Neural Architecture Search, Proc. ACM Meas. Anal. O9. Syst., Vol. 5, No. 3, Article 34. Publication date: December 2021
In view of SHALINA MUKHOPADHYAY et.al. (hereinafter MUKH) US 2023/0334330 A1 
[Foreign Priority IN202221022177 Filed 2022-04-13], 

In regard to claim 1: (Currently Amended)
	LU discloses:
-	receiving a specified latency constraint; 
In [3.1, Page 34:6]:
	the inference latency and energy of an architecture on a device are very strongly correlated. That is, an energy constraint can be implicitly mapped to a corresponding latency constraint.
(BRI: Within the context of searching, this requires evaluation stage where architects or algorithms are selected from a large set of possibilities to receive the latency constraint)
-	using neural architecture search to produce the chosen machine-trained model that satisfies the latency constraint, based on a collection of candidate machine-trained models
In [3, Page 34:6]:
 PROBLEM FORMULATION, INSIGHTS, AND PRACTICAL CONSIDERATION We present the problem formulation for hardware-aware NAS, show the key insights for when we can reduce the latency evaluation cost to O(1), and finally discuss practical considerations
In [3.1 Problem Formulation, Page 34:6]:
The general problem of hardware-aware NAS can be formulated as follows:

    PNG
    media_image1.png
    77
    615
    media_image1.png
    Greyscale

where x represents the architecture, X is the search space under consideration,                         
                            
                                
                                    w
                                
                                
                                    x
                                
                            
                        
                    
is the network weight given architecture x,                          
                            
                                
                                    L
                                
                                
                                    d
                                
                            
                        
                      is the average inference latency constraint, and d ∈ D denotes a device with D being the device set.
-	different candidate machine-trained models in the collection of machine-trained models
specifying different respective ways of removing weights in a shared neural network architecture, on a layer-by-layer basis,
in [2.1.1, Page 34:3]:
Specifically, as illustrated in the left subfigure of Fig. 3, the NAS process is governed by a controller
In [2.1.1, Page 34:4]:
(e.g., a reinforcement learning agent): given each candidate architecture produced by the controller, the model is trained on the training dataset and then evaluated for its performance, based on which the controller produces another candidate architecture. This process repeats until convergence or the maximum search iteration is reached

    PNG
    media_image2.png
    241
    803
    media_image2.png
    Greyscale

In [2.1 , Page 34:3]:
Overview Neural architecture is a key design hyperparameter that affects the inference accuracy and latency of DNN models. In Fig. 2, we show an example architecture, which is found by searching over the possible layer-wise kernel sizes

    PNG
    media_image3.png
    221
    822
    media_image3.png
    Greyscale

-	specifying different respective ways of removing weights in a shared neural network architecture, on a layer-by-layer basis, wherein the neural architecture search includes
In [2.1.2 , Page 34:4]:
One-shot NAS. In view of the extremely diverse devices and platforms for model deployment, one-shot NAS and its variants such as few-shot NAS have recently been proposed to reduce the search cost by exploiting the weight sharing mechanism.
In [2.1.2 , Page 34:4]:
 Concretely, as illustrated in the right subfigure of Fig. 3, the key idea of one-shot NAS is to decouple the training process from the search process: pre-train a super large model (called supernet) whose weight is shared among all the candidate architectures, and then use a separate search process to discover optimal architectures that inherit the weights from the supernet. 
In [5.2, Page 34:12]:
Removing non-Pareto-optimal architectures. 
We measure the actual latencies of Pareto-optimal architectures (obtained for either the paroxy or adapted proxy device) on the target device, and remove non-Pareto-optimal architectures
(BRI: process of removing "non-optimal models" or components from a neural network deployed on a target device is fundamentally an application of model pruning, which directly involves the removal or zeroing out of weights)

-	selecting a parent model from the collection of candidate machine-trained models, the parent model being a neural network having plural layers;
In [A , A.1, SUMMARY OF EVOLUTIONARY SEARCH, Page 34:24]:
For each evolutionary search iteration, we select the fittest individuals as parents for reproduction, which will survive in the next generation and also breed new individuals through crossover. For example, if our population size is 1000 and the parent ratio is 0.25, we have 250 fittest individuals as parents. Then, we randomly select a pair of parents each time for crossover and generate a child. Within the crossover process, each element in the child’s vector is chosen randomly from one of the parents’. 
In [B.4, Page 34:30]:
Search Space. Similar to MobileNet-V2, the FBNet search space is also layer-wise with a fixed macro-architecture, which defines the number of layers and input/output dimensions of each layer and fixes the first and last three layers, with the remaining layers to be searched. 
In [6.1.1 , Page 34:16]:
The search space consists of depth of each stage, kernel size of convolutional layers, and expansion ratio of each block. 
-	mutating the parent model using a mutating model, to produce a child model, the mutating model including two neural networks that operate in two respective stages, wherein a first neural network of the two neural networks has been trained to select a level of the parent model, 
In [6.1.1, Page 34:16]:
NAS Method. We consider one-shot NAS and use the Once-For-All network [9] as a supernet that has the same search space as ours. We run evolutionary search to find optimal architectures
Our parameter settings are: population size is 1000, parent ratio is 0.25, mutation probability is 0.1, mutation ratio is 0.25, and we search for 50 generations given each latency constraint. 
(BRI: neural network supernet (or SuperNet) contains multiple sub-networks  (SubNets) within one large, overparameterized network, allowing efficient exploration of many architectures)
	In [A SUMMARY OF EVALUTIONARY SEARCH, Page 34:24]:
To run evolutionary search, we first randomly sample the initial population of individuals according to the population size. Next, we evaluate the fitness of each individual in the population, where the fitness function is defined as: 
(t − 1) · accuracy + t · latency (4) 

where t ∈ [0, 1] is the weight parameter to balance the tradeoff between accuracy and latency of each individual model, and accuracy and latency are predicted values given by the accuracy and latency predictors, respectively. 
In [A SUMMARY OF EVALUTIONARY SEARCH, Page 34:24]:
For each evolutionary search iteration, we select the fittest individuals as parents for reproduction, which will survive in the next generation and also breed new individuals through crossover. For example, if our population size is 1000 and the parent ratio is 0.25, we have 250 fittest individuals as parents. Then, we randomly select a pair of parents each time for crossover and generate a child. Within the crossover process, each element in the child’s vector is chosen randomly from one of the parents’. Also, based on the mutation ratio setting, part of the offsprings will further perform mutation operations. For example, with mutation ratio 0.25 and mutation probability 0.1, 250 out of 750 children have a possibility of 0.1 to mutate. If a child is chosen to mutate, its kernel size, expansion ratio, and depth will be randomly sampled out of all the possible values for exploration. After crossover and mutation, we have a new population consisting of parents, bred children, and mutated children. Next, the fittest individuals are selected as new parents for next iteration. The above crossover and mutation steps will be repeated for the maximum evolutionary search iteration number
-	given sparsity levels of the plural layers of the parent model, to provide a selected level, and wherein a second neural network of the two neural networks has been trained to vary a sparsity level of the selected layer, given the selected layer produced by the first neural network;
In [5.4.2, Page 34:15]:
we consider the proxy device’s latency predictor in a linear form:                         
                            
                                
                                    L
                                
                                
                                    
                                        
                                            d
                                        
                                        
                                            0
                                        
                                    
                                
                            
                        
                    (x) =                         
                            
                                
                                    w
                                
                                
                                    T
                                
                            
                             
                        
                    x, where w is the weight and x is the architecture representation (e.g., one-hot encoding of the searchable operators, penultimate layer output in a neural network-based predictor,3 or encoding of the execution units). We measure the latencies of a small set of sample architectures x ∈ A on the target device, noting that this step is also needed to check the SRCC value and incurs a negligible overhead compared to SOTA approaches (i.e., tens of hours of latency measurement. 
In [5.4.2, Page 34:15]:
Then, with the latency measurement samples denoted by (xi ,yi), we quickly adapt the proxy device’s latency predictor as 


    PNG
    media_image4.png
    63
    700
    media_image4.png
    Greyscale

where I is the identity vector with all the elements being 1, the operator “◦” denotes the element-wise multiplication, and λ ≥ 0 is a hyperparameter controlling the weight for the sparsity regularization term |b| and tuned based on a small validation set of architectures (20 architectures in our experiment) split from the sample architecture set A.
The interpretation of using Eqn. (3) is as follows. First, the scaling factor α reflects our intuition that a more complex operator that is slower on one device is generally also slower on another device. Second, the sparsity term b accounts for the fact that the slow-down factors for an operator on two devices are not necessarily the same.
(BRI: adapting a latency predictor with sparsity regularization is often designed to provide variation of sparsity levels across different layers, rather than a uniform sparsity. The latency predictor, by incorporating real-world hardware feedback (or a learned model of it), identifies which layers benefit most from increased sparsity in terms of actual speedup)
-	generating a reward score for the child model that takes into consideration at least accuracy and latency of the child model, wherein producing the accuracy that is used to generate the reward score 
In [A SUMMARY OF EVALUTIONARY SEARCH, Page 34:24]:
To run evolutionary search, we first randomly sample the initial population of individuals according to the population size. Next, we evaluate the fitness of each individual in the population, where the fitness function is defined as: 
(t − 1) · accuracy + t · latency (4) 

where t ∈ [0, 1] is the weight parameter to balance the tradeoff between accuracy and latency of each individual model, and accuracy and latency are predicted values given by the accuracy and latency predictors, respectively. 
(BRI: the fitness is the reward that depends on the accuracy and latency (see equation (4))
In [7, Page 34:20]:
fast evaluation of accuracy and inference latency to rank different architectures is crucial for efficient hardware-aware NAS
In [7, Page 34:20]:
Given many diverse devices, scalability of latency evaluation is critically important. A straight forward approach is to build a meta latency predictor that incorporates hardware features as additional input
In [3.3, Page 34:7]:
 To quantify the degree of latency monotonicity in practice, we use the metric of Spearman’s Rank Correlation Coefficient (SRCC), which lies between -1 and 1 and assesses statistical dependence between the rankings of two variables using a monotonic function. The greater the SRCC of CNN latencies on two devices, the better the latency monotonicity. SRCC of 0.9 to 1.0 is usually viewed as strongly dependent in terms of monotonicity [3].
(BRI: Latency monotonicity is the stability of the latency showing a stable upward/downward trend rather than erratic spikes)
In [A SUMMARY OF EVALUTIONARY SEARCH, Page 34:24]:
For each evolutionary search iteration, we select the fittest individuals as parents for reproduction, which will survive in the next generation and also breed new individuals through crossover.
(BRI: the fittest is the reward to generate the child (new individuals from the selected parent that is going to survive)
In [5.3.2, Page 34:14]:
 Checking latency monotonicity. To check whether strong latency monotonicity is satisfied between the selected proxy device and a target device, we estimate the SRCC based on a small set A of sample architectures and then compare it against a threshold.
In [6.1, Page 34:16]:
As a result, the imperfection in the accuracy predictor explains why a strong, but not perfect, latency monotonicity (e.g., SRCC>0.9) is enough for our one-proxy approach to find Pareto-optimal architectures for a new target device
(BRI: SRCC (Spearman's Rank Correlation Coefficient) is a statistical method to measure how consistently the order of latencies for different tasks or models stays the same across various devices or platforms, indicating latency monotonicity)
-	 includes pruning weights of the child model, given sparsity levels of the child model
with the latency measurement samples denoted by (                        
                            
                                
                                    x
                                
                                
                                    i
                                
                            
                        
                     ,                         
                            
                                
                                    y
                                
                                
                                    i
                                
                            
                        
                    ) , we quickly adapt the proxy device’s latency predictor as 

    PNG
    media_image5.png
    22
    252
    media_image5.png
    Greyscale


 	tailored to the target device, by solving the following problem:

    PNG
    media_image6.png
    62
    672
    media_image6.png
    Greyscale

 I is the identity vector with all the elements being 1, the operator “◦” denotes the element-wise multiplication, and λ ≥ 0 is a hyperparameter controlling the weight for the sparsity regularization term |b| and tuned based on a small validation set of architectures (20 architectures in our experiment) split from the sample architecture set A.
(BRI: tuning a hyperparameter that controls the weight for sparsity regularization indirectly represents an approach to achieving pruning, as it aims to learn a sparse model, which is a key goal of pruning)
-	adjusting weights of the mutating model that performs said mutating based on the reward score to increase a likelihood that the mutating model will make decisions that are rewarded by said generating;
In [2.1.2, Page 34:4]:
In view of the extremely diverse devices and platforms for model deployment, one-shot NAS and its variants such as few-shot NAS have recently been proposed to reduce the search cost by exploiting the weight sharing mechanism 
(BRI: the process of discovering optimal architectures via weight-sharing Neural Architecture 
often involves adjusting (finetuning) the inherited weights. over optimal architectures that inherit the weights from the supernet). 
In [A , A.1, SUMMARY OF EVOLUTIONARY SEARCH, Page 34:24]:
For each evolutionary search iteration, we select the fittest individuals as parents for reproduction, which will survive in the next generation and also breed new individuals through crossover. For example, if our population size is 1000 and the parent ratio is 0.25, we have 250 fittest individuals as parents. Then, we randomly select a pair of parents each time for crossover and generate a child. 
In [A, A.1, SUMMARY OF EVOLUTIONARY SEARCH, Page 34:24]:
Also, based on the mutation ratio setting, part of the offsprings will further perform mutation operations. For example, with mutation ratio 0.25 and mutation probability 0.1, 250 out of 750 children have a possibility of 0.1 to mutate
To run evolutionary search, we first randomly sample the initial population of individuals according to the population size. Next, we evaluate the fitness of each individual in the population, where the fitness function is defined as
(t − 1) · accuracy + t · latency (4) 

where t ∈ [0, 1] is the weight parameter to balance the tradeoff between accuracy and latency of each individual model, and accuracy and latency are predicted values given by the accuracy and latency predictors, respectively. 
In [Abstract, Page 34:1]:
In this work, we address the scalability challenge by exploiting latency monotonicity — the architecture latency rankings on different devices are often correlated. 
In [B.1 Latency Monotonicity, Page 34:25]: 
We show the results in Fig. 18, which are in line with our experiments: latency
In [B.1 Latency Monotonicity, Page 34:26]: 
monotonicity among mobile devices is strong (>0.95), while FLOP-latency ranking correlation for mobile devices 
(BRI: fitness is the rewarding. Within hardware-aware Neural Architecture Search (NAS), the latency (or latency ranking) of a model is commonly incorporated as a reward score or part of a multi-objective fitness function)
-	updating the collection of candidate machine-trained models based on the child model; 
In [6.1.1, Page 34:16]:
NAS Method. We consider one-shot NAS and use the Once-For-All network [9] as a supernet that has the same search space
In [7, Page 34:20]:
NAS uses a super net that includes all the weights for candidate architectures
In [6.1.1, Page 34:16]:
Accuracy Predictor. The evolutionary search is assisted with by an accuracy predictor for fast architecture performance evaluation.
In [6.1.1, Page 34:16]:
Our accuracy predictor is a neural network with four fully-connected layers and updated with 176 samples on top of the predictor 
In [6.1.1, Page 34:16]:
The accuracy predictor takes a 128-dimensional feature vector (which is converted from a 21-dimensional architecture configuration within the search space) as input. Fig. 12(a) compares the actual and predicted accuracies 
(BRI: in the context of Neural Architecture Search (NAS), a supernet is essentially a large, overparameterized neural network that contains many different candidate architectures (subnetworks) within it)
 ( BRI: an "accuracy predictor" or, more accurately, that uses updated real-world data is a key component of a process called continuous learning or continuous which ultimately results in an updated collection of machine-trained models) 
In [2.1.1, Page 34:3]:
 search process is entangled with the model training process.
In [2.1.1, Page 34:3]:
Specifically, as illustrated in the left subfigure of Fig. 3, the NAS process is governed by a controller
In [2.1.1, Page 34:4]:
(e.g., a reinforcement learning agent): given each candidate architecture produced by the controller, the model is trained on the training dataset and then evaluated for its performance, based on which the controller produces another candidate architecture
(BRI: a machine learning model is trained on a training dataset and its performance is assessed using a separate validation and test dataset for  adjusting the model's hyperparameters can inform a "controller" or system which then triggers the development of an updated or new model architecture is the continuous training)
-	and repeating the selecting, mutating, generating, adjusting, and updating until a specified objective is achieved, to produce the chosen machine-trained model.
In [2.1.2, Page 34:4]:
a search process based on evolutionary algorithms or reinforcement learning to find an optimal architecture
In [2.1.1, Page 34:4]:
given each candidate architecture produced by the controller, the model is trained on the training dataset and then evaluated for its performance, based on which the controller produces another candidate architecture. This process repeats until convergence or the maximum search iteration is reached. 
In [5.2, Page 34:12]:
Our scalable hardware-aware NAS approach is illustrated in Fig. 10 and described in Algorithm 1.

    PNG
    media_image7.png
    248
    761
    media_image7.png
    Greyscale

In [5.2, Page 34:12]:

    PNG
    media_image8.png
    367
    835
    media_image8.png
    Greyscale

(BRI: a controller that iteratively proposes candidate architectures which are then trained and evaluated to reach an objective and the "repeating the selecting, mutating, generating, adjusting, and updating until a specified objective is achieved," describes the general principles of an evolutionary algorithm) 
	LU does not explicitly disclose:
-	A computer-implemented method for identifying and applying a chosen machine-trained model, comprising: 
However, MUKH discloses:
A computer-implemented method for identifying and applying a chosen machine-trained model, comprising: 
In [0037]:
 The EA agent of the FAST EA NAS model generates a plurality of child Neural Network (NN) architectures for the fine-grained NAS space from the NAS space based on the multi-objective reward function (R). 
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine LU, and MUKH.
LU teaches using neural architecture search to produce the chosen machine-trained and  selecting the models to meet the latency constraint, model mutations, reward score and sparsity level, and pruning of weights for given sparsity level of the layers.
MUKH teaches neural architecture search methods and systems.
One of ordinary skill would have motivation to combine LU, and MUKH that provides improved performance exploring possible space for candidate architectures (MUKH[ 0053]):



In regard to claim 4: (Previously Presented)
LU discloses:
-	wherein said selecting operates by selecting the parent model based on latency and accuracy exhibited by the parent model, relative to latency and accuracy exhibited by other candidate machine-trained models.  
In [A, A.1, SUMMARY OF EVOLUTIONARY SEARCH, Page 34:24]:
To run evolutionary search, we first randomly sample the initial population of individuals according to the population size. Next, we evaluate the fitness of each individual in the population, where the fitness function is defined as
(t − 1) · accuracy + t · latency (4) 
where t ∈ [0, 1] is the weight parameter to balance the tradeoff between accuracy and latency of each individual model, and accuracy and latency are predicted values given by the accuracy and latency predictors, respectively. By varying t ∈ [0, 1], we can obtain a set of Pareto-optimal architectures. 
For each evolutionary search iteration, we select the fittest individuals as parents for reproduction, which will survive in the next generation and also breed new individuals through crossover.
(BRI: common strategy in machine learning and system design for balancing the trade-offs between model accuracy and inference latency, often referred to as a multi-objective model selection. This is often achieved by Pareto-optimal to improve latency without sacrificing some accuracy, or improve accuracy without increasing latency)
In regard to claim 6: (Previously Presented)
LU  does not explicitly disclose:
-	wherein the latency that is used to generate the reward score is produced using trainable logic that performs prediction.  
However, MUKH discloses:
-	wherein the latency that is used to generate the reward score is produced using trainable logic that performs prediction.  
In [0035]:
The actual latency performance metric required by the multi-objective reward function (R) can be predicted using a prediction function (P), without actually profiling the Neural Network (NN) architectures on a platform to make a NAS search faster.
In [0006]:
the method includes formulating a multi-objective reward function (R) as a function of the plurality of performance metrics, wherein each of the plurality of performance metrics is individually modulated, prioritized and thresholded based on the relative metric weightage assigned to each of the plurality of performance metric in accordance of requirements of a target application to be executed on the platform via the tiny model
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine LU, and MUKH.
LU teaches using neural architecture search to produce the chosen machine-trained and  selecting the models to meet the latency constraint, model mutations, reward score and sparsity level, pruning of weights for given sparsity level of the layers.
MUKH teaches neural architecture search methods and systems.
One of ordinary skill would have motivation to combine LU, and MUKH that provides improved performance exploring possible space for candidate architectures (MUKH[ 0053]):
In regard to claim 8: (Previously Presented)
LU does not explicitly disclose:
-	wherein said adjusting involves adjusting weights in the trainable logic that performs said mutating based on a reinforcement learning training objective.  
However, MUKH discloses:
-	wherein said adjusting involves adjusting weights in the trainable logic that performs said mutating based on a reinforcement learning training objective.  
In [0037]: 
the EA agent of the FAST EA NAS model generates a plurality of child Neural Network (NN) architectures for the fine-grained NAS space from the NAS space based on the multi-objective reward function (R). Since traditionally EA is a non-learning approach, approach like RL is added to the traditional EA by incorporating domain knowledge as “learnable mutations” by the EA agent, in the evolution process
in [0053]:
 DQN learning based NAS: The neural architecture search algorithm based on reinforcement learning attempts to design high performance neural network architectures automatically. This is done with the help of an agent, by the process of exploring new architecture designs, evaluating them in terms of accuracy and model size, and then training the agent with those sets of states, actions, and rewards. 
In [0022]:
 The relative weightage assigned to each of the performance metric is tunable, enabling dynamic changing of the multi-objective reward function (R) without requiring rebuilding and retraining of the Fast Evolutionary Algorithm (EA) NAS model and the DQN NAS model to align to changing requirements of the target application to be executed on of the platform. 
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine LU, and MUKH.
LU teaches using neural architecture search to produce the chosen machine-trained and  selecting the models to meet the latency constraint, model mutations, reward score and sparsity level, pruning of weights for given sparsity level of the layers.
MUKH teaches neural architecture search methods and systems.
One of ordinary skill would have motivation to combine LU,  and MUKH that provides improved performance exploring possible space for candidate architectures (MUKH[ 0053]):
In regard to claim 9: (Previously Presented)
LU discloses:		
wherein said updating involves adding the chosen machine-trained model to the collection of candidate machine-trained models, and removing at least one existing candidate machine-trained model from the collection of candidate machine-trained models.  
in [2.1.1, Page 34:3]:
Specifically, as illustrated in the left subfigure of Fig. 3, the NAS process is governed by a controller
In [2.1.1, Page 34:4]:
(e.g., a reinforcement learning agent): given each candidate architecture produced by the controller, the model is trained on the training dataset and then evaluated for its performance, based on which the controller produces another candidate architecture. This process repeats until convergence or the maximum search iteration is reached

    PNG
    media_image2.png
    241
    803
    media_image2.png
    Greyscale

In [2.1 , Page 34:3]:
Overview Neural architecture is a key design hyperparameter that affects the inference accuracy and latency of DNN models. In Fig. 2, we show an example architecture, which is found by searching over the possible layer-wise kernel sizes


    PNG
    media_image3.png
    221
    822
    media_image3.png
    Greyscale

In [6.1, Page 34:16]:
The evolutionary search is assisted with by an accuracy predictor for fast architecture performance evaluation 
In [6.1, Page 34:16]:
Our accuracy predictor is a neural network with four fully-connected layers and updated with 176 samples on top of the predictor used 
(BRI: in Neural Architecture Search (NAS), modifying or updating a layer (e.g., changing its type, parameters, or connections) effectively creates a new candidate model within the defined search space)
In [5.2, Page 34:12]:
Removing non-Pareto-optimal architectures. 
We measure the actual latencies of Pareto-optimal architectures (obtained for either the paroxy or adapted proxy device) on the target device, and remove non-Pareto-optimal architectures
(BRI: removing non-Pareto optimal solutions necessarily means removing at least one 	 candidate model (or solution) from the initial set) 
In regard to claim 18: (Previously Presented)
LU discloses:
identifying a collection of candidate machine-trained models; 
In [3, Page 34:6]:
 PROBLEM FORMULATION, INSIGHTS, AND PRACTICAL CONSIDERATION We present the problem formulation for hardware-aware NAS, show the key insights for when we can reduce the latency evaluation cost to O(1), and finally discuss practical considerations
3.1 Problem Formulation 
The general problem of hardware-aware NAS can be formulated as follows:

    PNG
    media_image1.png
    77
    615
    media_image1.png
    Greyscale

where x represents the architecture, X is the search space under consideration,                         
                            
                                
                                    w
                                
                                
                                    x
                                
                            
                        
                    
is the network weight given architecture x,                          
                            
                                
                                    L
                                
                                
                                    d
                                
                            
                        
                      is the average inference latency constraint, and d ∈ D denotes a device with D being the device set.
selecting a parent model from the collection of candidate machine-trained model
In [A , A.1, SUMMARY OF EVOLUTIONARY SEARCH, Page 34:24]:
For each evolutionary search iteration, we select the fittest individuals as parents for reproduction, which will survive in the next generation and also breed new individuals through crossover. For example, if our population size is 1000 and the parent ratio is 0.25, we have 250 fittest individuals as parents. Then, we randomly select a pair of parents each time for crossover and generate a child. Within the crossover process, each element in the child’s vector is chosen randomly from one of the parents’. 
In [B.4, Page 34:30]:
Search Space. Similar to MobileNet-V2, the FBNet search space is also layer-wise with a fixed macro-architecture, which defines the number of layers and input/output dimensions of each layer and fixes the first and last three layers, with the remaining layers to be searched. 
in [6.1.1 , Page 34:16]:
The search space consists of depth of each stage, kernel size of convolutional layers, and expansion ratio of each block. 
-	mutating the parent model using a mutating model, to produce a child model, the mutating model including two neural networks that operate in two respective stages, wherein a first neural network of the two neural networks has been trained to select a level of the parent model, 
in [6.1.1 , Page 34:16]:
NAS Method. We consider one-shot NAS and use the Once-For-All network [9] as a supernet that has the same search space as ours. We run evolutionary search to find optimal architectures
Our parameter settings are: population size is 1000, parent ratio is 0.25, mutation probability is 0.1, mutation ratio is 0.25, and we search for 50 generations given each latency constraint. 
(BRI: neural network supernet (or SuperNet) contains multiple sub-networks (SubNets) within one large, overparameterized network, allowing efficient exploration of many architectures)
	In [A SUMMARY OF EVALUTIONARY SEARCH, Page 34:24]:
To run evolutionary search, we first randomly sample the initial population of individuals according to the population size. Next, we evaluate the fitness of each individual in the population, where the fitness function is defined as: 
(t − 1) · accuracy + t · latency (4) 

where t ∈ [0, 1] is the weight parameter to balance the tradeoff between accuracy and latency of each individual model, and accuracy and latency are predicted values given by the accuracy and latency predictors, respectively. 
In [A SUMMARY OF EVALUTIONARY SEARCH, Page 34:24]:
For each evolutionary search iteration, we select the fittest individuals as parents for reproduction, which will survive in the next generation and also breed new individuals through crossover. For example, if our population size is 1000 and the parent ratio is 0.25, we have 250 fittest individuals as parents. Then, we randomly select a pair of parents each time for crossover and generate a child. Within the crossover process, each element in the child’s vector is chosen randomly from one of the parents’. Also, based on the mutation ratio setting, part of the offsprings will further perform mutation operations. For example, with mutation ratio 0.25 and mutation probability 0.1, 250 out of 750 children have a possibility of 0.1 to mutate. If a child is chosen to mutate, its kernel size, expansion ratio, and depth will be randomly sampled out of all the possible values for exploration. After crossover and mutation, we have a new population consisting of parents, bred children, and mutated children. Next, the fittest individuals are selected as new parents for next iteration. The above crossover and mutation steps will be repeated for the maximum evolutionary search iteration number
-	given sparsity levels of the plural layers of the parent model, to provide a selected level, and wherein a second neural network of the two neural networks has been trained to vary a sparsity level of the selected layer, given the selected layer produced by the first neural network;
in [5.4.2, Page 34:15]:
we consider the proxy device’s latency predictor in a linear form: Ld0 (x) = wT x, where w is the weight and x is the architecture representation (e.g., one-hot encoding of the searchable operators, penultimate layer output in a neural network-based predictor,3 or encoding of the execution units). We measure the latencies of a small set of sample architectures x ∈ A on the target device, noting that this step is also needed to check the SRCC value and incurs a negligible overhead compared to SOTA approaches (i.e., tens of hours of latency measurement. 
In [5.4.2, Page 34:15]:
Then, with the latency measurement samples denoted by (xi ,yi), we quickly adapt the proxy device’s latency predictor as 


    PNG
    media_image4.png
    63
    700
    media_image4.png
    Greyscale

where I is the identity vector with all the elements being 1, the operator “◦” denotes the element-wise multiplication, and λ ≥ 0 is a hyperparameter controlling the weight for the sparsity regularization term |b| and tuned based on a small validation set of architectures (20 architectures in our experiment) split from the sample architecture set A.
The interpretation of using Eqn. (3) is as follows. First, the scaling factor α reflects our intuition that a more complex operator that is slower on one device is generally also slower on another device. Second, the sparsity term b accounts for the fact that the slow-down factors for an operator on two devices are not necessarily the same.
(BRI: adapting a latency predictor with sparsity regularization is often designed to provide variation of sparsity levels across different layers, rather than a uniform sparsity. The latency predictor, by incorporating real-world hardware feedback (or a learned model of it), identifies which layers benefit most from increased sparsity in terms of actual speedup)
In [5.4.2, Page 34:15]:
we consider the proxy device’s latency predictor in a linear form:                         
                            
                                
                                    L
                                
                                
                                    
                                        
                                            d
                                        
                                        
                                            0
                                        
                                    
                                
                            
                        
                    (x) =                         
                            
                                
                                    w
                                
                                
                                    T
                                
                            
                             
                        
                    x,  where x
is the weight and x is the architecture representation (e.g., one-hot encoding of the searchable operators, penultimate layer output in a neural network-based predictor or encoding of the execution units)
-	generating a reward score for the child model that takes into consideration at least accuracy and latency of the child model, wherein producing the accuracy that is used to generate the reward score 
In [A SUMMARY OF EVALUTIONARY SEARCH, Page 34:24]:
To run evolutionary search, we first randomly sample the initial population of individuals according to the population size. Next, we evaluate the fitness of each individual in the population, where the fitness function is defined as: 
(t − 1) · accuracy + t · latency (4) 
where t ∈ [0, 1] is the weight parameter to balance the tradeoff between accuracy and latency of each individual model, and accuracy and latency are predicted values given by the accuracy and latency predictors, respectively. 
(BRI: the fitness is the reward that depends on the accuracy and latency (see equation (4))
In [7, Page 34:20]:
fast evaluation of accuracy and inference latency to rank different architectures is crucial for efficient hardware-aware NAS
In [7, Page 34:20]:
Given many diverse devices, scalability of latency evaluation is critically important. A straight forward approach is to build a meta latency predictor that incorporates hardware features as additional input
in [3.3, Page 34:7]:
To quantify the degree of latency monotonicity in practice, we use the metric of Spearman’s Rank Correlation Coefficient (SRCC), which lies between -1 and 1 and assesses statistical dependence between the rankings of two variables using a monotonic function. The greater the SRCC of CNN latencies on two devices, the better the latency monotonicity. SRCC of 0.9 to 1.0 is usually viewed as strongly dependent in terms of monotonicity [3].
(BRI: Latency monotonicity is the stability of the latency showing a stable upward/downward trend rather than erratic spikes)
In [A SUMMARY OF EVALUTIONARY SEARCH, Page 34:24]:
for each evolutionary search iteration, we select the fittest individuals as parents for reproduction, which will survive in the next generation and also breed new individuals through crossover.
(BRI: the fittest is the reward to generate the child (new individuals from the selected parent that is going to survive)
In [5.3.2, Page 34:14]:
Checking latency monotonicity. To check whether strong latency monotonicity is satisfied between the selected proxy device and a target device, we estimate the SRCC based on a small set A of sample architectures and then compare it against a threshold.
In [6.1, Page 34:16]:
the imperfection in the accuracy predictor explains why a strong, but not perfect, latency monotonicity (e.g., SRCC>0.9) is enough for our one-proxy approach to find Pareto-optimal architectures for a new target device
(BRI: SRCC (Spearman's Rank Correlation Coefficient) is a statistical method to measure how consistently the order of latencies for different tasks or models stays the same across various devices or platforms, indicating latency monotonicity)
-	 includes pruning weights of the child model, given sparsity levels of the child model
with the latency measurement samples denoted by (                        
                            
                                
                                    x
                                
                                
                                    i
                                
                            
                        
                     ,                         
                            
                                
                                    y
                                
                                
                                    i
                                
                            
                        
                    ) , we quickly adapt the proxy device’s latency predictor as 

    PNG
    media_image5.png
    22
    252
    media_image5.png
    Greyscale


 	tailored to the target device, by solving the following problem:

    PNG
    media_image6.png
    62
    672
    media_image6.png
    Greyscale

 I is the identity vector with all the elements being 1, the operator “◦” denotes the element-wise multiplication, and λ ≥ 0 is a hyperparameter controlling the weight for the sparsity regularization term |b| and tuned based on a small validation set of architectures (20 architectures in our experiment) split from the sample architecture set A.
(BRI: tuning a hyperparameter that controls the weight for sparsity regularization indirectly represents an approach to achieving pruning, as it aims to learn a sparse model, which is a key goal of pruning)
-	adjusting weights of the mutating model based on the reward score to increase a likelihood that the mutating model will make decisions that are rewarded by said generating;
In [2.1.2, Page 34:4]:
In view of the extremely diverse devices and platforms for model deployment, one-shot NAS and its variants such as few-shot NAS have recently been proposed to reduce the search cost by exploiting the weight sharing mechanism 
disc he process of discovering optimal architectures via weight-sharing Neural Architecture ( (BRI: Search (NAS) often involves adjusting (finetuning) the inherited weights. over optimal architectures that inherit the weights from the supernet. 
In [A , A.1, SUMMARY OF EVOLUTIONARY SEARCH, Page 34:24]:
For each evolutionary search iteration, we select the fittest individuals as parents for reproduction, which will survive in the next generation and also breed new individuals through crossover. For example, if our population size is 1000 and the parent ratio is 0.25, we have 250 fittest individuals as parents. Then, we randomly select a pair of parents each time for crossover and generate a child. 
In [A, A.1, SUMMARY OF EVOLUTIONARY SEARCH, Page 34:24]:
Also, based on the mutation ratio setting, part of the offsprings will further perform mutation operations. For example, with mutation ratio 0.25 and mutation probability 0.1, 250 out of 750 children have a possibility of 0.1 to mutate
To run evolutionary search, we first randomly sample the initial population of individuals according to the population size. Next, we evaluate the fitness of each individual in the population, where the fitness function is defined as
(t − 1) · accuracy + t · latency (4) 
where t ∈ [0, 1] is the weight parameter to balance the tradeoff between accuracy and latency of each individual model, and accuracy and latency are predicted values given by the accuracy and latency predictors, respectively. 
In [Abstract, Page 34:1]:
In this work, we address the scalability challenge by exploiting latency monotonicity — the architecture latency rankings on different devices are often correlated. 
In [B.1 Latency Monotonicity, Page 34:25]: 
We show the results in Fig. 18, which are in line with our experiments: latency
In [B.1 Latency Monotonicity, Page 34:26]: 
monotonicity among mobile devices is strong (>0.95), while FLOP-latency ranking correlation for mobile devices 
(BRI: fitness is the rewarding. Within hardware-aware Neural Architecture Search (NAS), the latency (or latency ranking) of a model is commonly incorporated as a reward score or part of a multi-objective fitness function)
-	updating the collection of candidate machine-trained models based on the child model; 
In [6.1.1, Page 34:16]:
NAS Method. We consider one-shot NAS and use the Once-For-All network [9] as a supernet that has the same search space
In [7, Page 34:20]:
NAS uses a super net that includes all the weights for candidate architectures
In [6.1.1, Page 34:16]:
Accuracy Predictor. The evolutionary search is assisted with by an accuracy predictor for fast architecture performance evaluation.
In [6.1.1, Page 34:16]:
Our accuracy predictor is a neural network with four fully-connected layers and updated with 176 samples on top of the predictor 
In [6.1.1, Page 34:16]:
The accuracy predictor takes a 128-dimensional feature vector (which is converted from a 21-dimensional architecture configuration within the search space) as input. Fig. 12(a) compares the actual and predicted accuracies 
(BRI: in the context of Neural Architecture Search (NAS), a supernet is essentially a large, overparameterized neural network that contains many different candidate architectures (subnetworks) within it)
 ( BRI: an "accuracy predictor" or, more accurately, that uses updated real-world data is a key component of a process called continuous learning or continuous which ultimately results in an updated collection of machine-trained models) 
In [2.1.1, Page 34:3]:
 search process is entangled with the model training process.
In [2.1.1, Page 34:3]:
Specifically, as illustrated in the left subfigure of Fig. 3, the NAS process is governed by a controller
In [2.1.1, Page 34:4]:
(e.g., a reinforcement learning agent): given each candidate architecture produced by the controller, the model is trained on the training dataset and then evaluated for its performance, based on which the controller produces another candidate architecture
(BRI: a machine learning model is trained on a training dataset and its performance is assessed using a separate validation and test dataset for  adjusting the model's hyperparameters can inform a "controller" or system which then triggers the development of an updated or new model architecture is the continuous training)
and repeating said selecting, mutating, generating, adjusting, and updating until a specified objective is achieved, to produce the chosen machine-trained model.  
In [2.1.2, Page 34:4]:
a search process based on evolutionary algorithms or reinforcement learning to find an optimal architecture
In [2.1.1, Page 34:4]:
given each candidate architecture produced by the controller, the model is trained on the training dataset and then evaluated for its performance, based on which the controller produces another candidate architecture. This process repeats until convergence or the maximum search iteration is reached. 
In [5.2, Page 34:12]:
Our scalable hardware-aware NAS approach is illustrated in Fig. 10 and described in Algorithm 1.


    PNG
    media_image7.png
    248
    761
    media_image7.png
    Greyscale

In [5.2, Page 34:12]:

    PNG
    media_image8.png
    367
    835
    media_image8.png
    Greyscale

(BRI: a controller that iteratively proposes candidate architectures which are then trained and evaluated to reach an objective and the "repeating the selecting, mutating, generating, adjusting, and updating until a specified objective is achieved," describes the general principles of an evolutionary algorithm) 
LU does not explicitly disclose:
A computer-readable storage medium for storing computer-readable instructions, one or more hardware processors executing the computer-readable instructions to perform a method that comprises:
However, MUKH discloses:
A computer-readable storage medium for storing computer-readable instructions, one or more hardware processors executing the computer-readable instructions to perform a method that comprises:
In [0008]:
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine LU, and MUKH.
LU teaches using neural architecture search to produce the chosen machine-trained and  selecting the models to meet the latency constraint, model mutations, reward score and sparsity level, pruning of weights for given sparsity level of the layers.
MUKH teaches NAS systems build comprise of processors and storage.
One of ordinary skill would have motivation to combine LU, and MUKH that provides improved performance exploring possible space for candidate architectures (MUKH[ 0053]):
In regard to claim 20: (Previously Presented)
LU does not explicitly disclose:
wherein the latency that is used to generate the reward score is produced using trainable logic that performs prediction.  
However, MUKH discloses:
wherein the latency that is used to generate the reward score is produced using trainable logic that performs prediction.  
In [0035]:
The actual latency performance metric required by the multi-objective reward function (R) can be predicted using a prediction function (P), without actually profiling the Neural Network (NN) architectures on a platform to make a NAS search faster.
In [0006]:
the method includes formulating a multi-objective reward function (R) as a function of the plurality of performance metrics, wherein each of the plurality of performance metrics is individually modulated, prioritized and thresholded based on the relative metric weightage assigned to each of the plurality of performance metric in accordance of requirements of a target application to be executed on the platform via the tiny model



Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over 
in view BINGQIAN LU et.al. (hereinafter LU) One Proxy Device Is Enough for Hardware-Aware Neural Architecture Search, Proc. ACM Meas. Anal. Comput. Syst., Vol. 5, No. 3, Article 34. Publication date: December 2021
In view of SHALINA MUKHOPADHYAY et.al. (hereinafter MUKH) US 2023/0334330 A1 
[Foreign Priority IN202221022177 Filed 2022-04-13], 	
further in view Andreas Moshovo  et.al. (hereinafter Mosho) US 2022/0092382 A1.
In regard to claim 2: (Original) 
LU , and MUKH do not explicitly disclose:
wherein the different candidate machine-trained models in the collection of candidate machine-trained models include attention layers having different numbers of attention heads and feed- forward neural network layers having different sizes.  
However, Mosho discloses:
wherein the different candidate machine-trained models in the collection of candidate machine-trained models include attention layers having different numbers of attention heads and feed- forward neural network layers having different sizes.  
In [0049]:
Memory footprint, bandwidth and energy limitations can be most acute for attention-based models in language understanding tasks. Among them, the BERT family of natural language models can deliver best-of-class accuracy. Their footprint, accesses, and execution time are dominated by the parameters (e.g., weights) of their numerous attention layers.
In [0073]:
 Knowledge Distillation: Knowledge distillation can train a smaller model (student) from a larger model (teacher). Based on what the student learns from the teacher there can be three groups of Knowledge distillation approaches for BERT. In the first group, the student learns the behaviour of the encoder layer. The student can have fewer attention heads in each layer or fewer encoder layers.
In [0072]:
 Structured pruning can remove a series of weights that correspond to a component of the model. Attention head pruning and Encoder unit pruning are examples of this approach. 
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine LU, MUKH and Mosho.
LU teaches using neural architecture search to produce the chosen machine-trained and  selecting the models to meet the latency constraint, model mutations, reward score and sparsity level, pruning of weights for given sparsity level of the layers.
MUKH teaches neural architecture search methods and systems.
Mosho  teaches reducing the weights.
One of ordinary skill would have motivation to combine LU, MUKH and Mosho to provide to provide a dataflow of each architecture to  improve performance [Mosho [0139]).
Claims 5, and 19  are rejected under 35 U.S.C. 103 as being unpatentable over 
in view BINGQIAN LU et.al. (hereinafter LU) One Proxy Device Is Enough for Hardware-Aware Neural Architecture Search, Proc. ACM Meas. Anal. Comput. Syst., Vol. 5, No. 3, Article 34. Publication date: December 2021
In view of SHALINA MUKHOPADHYAY et.al. (hereinafter MUKH) US 2023/0334330 A1 
[Foreign Priority IN202221022177 Filed 2022-04-13], 
further in view of Wouter Kool et.al. (hereinafter Kool) ATTENTION, LEARN TO SOLVE ROUTING PROBLEMS!, Published as a conference paper at ICLR 2019, arXiv:1803.08475v3 [stat.ML] 7 Feb 2019
further in view of Sairam Sundaresan et.al. (hereinafter Sundar) US 2022/0036194 A1.
In regard to claim 5: (Currently Amended) 
LU, and MUKH do not explicitly disclose:
-	wherein the mutating model includes a router that routes between an attention layer mutating process and a feed-forward mutating process 
-	depending on whether the selected layer specifies an attention layer or a feed-forward layer neural network layer, 
However, Kool discloses:
-	wherein the mutating model includes a router that routes between an attention layer mutating process and a feed-forward mutating process 
In [3,ATTENTION MODEL,  Page 2]:
We define the Attention Model in terms of the TSP. For other problems, the model is the same but the input, mask and decoder context need to be defined accordingly,
In [3,ATTENTION MODEL,  Page 2]:
We define a problem instance s as a graph with n nodes, where node i ∈ {1, . . . , n} is represented by features                         
                            
                                
                                    x
                                
                                
                                    i
                                
                            
                        
                    ,
In [3,ATTENTION MODEL,  Page 3]:
We define a solution (tour) π = (                        
                            
                                
                                    π
                                
                                
                                    1
                                
                            
                            ,
                        
                     . . . ,                         
                            
                                
                                    π
                                
                                
                                    n
                                
                            
                            ,
                        
                    ) as a permutation of the nodes, so                         
                            
                                
                                    π
                                
                                
                                    t
                                
                            
                             
                        
                    ∈ {1, . . . n} and                         
                            
                                
                                    π
                                
                                
                                    t
                                
                            
                             
                        
                                             
                            ≠
                             
                            
                                
                                    π
                                
                                
                                    
                                        
                                            t
                                        
                                        
                                            '
                                        
                                    
                                
                            
                        
                    ,  ∀t                         
                            ≠
                            
                                
                                    t
                                
                                
                                    '
                                
                            
                             
                        
                     . Our attention based encoder-decoder model defines a stochastic policy p(π| for selecting a solution π given a problem instance s. It is factorized and parameterized by θ as

    PNG
    media_image9.png
    47
    523
    media_image9.png
    Greyscale

In [D.3, Page 21]:
default parameters, only adding --op --op-ea4op to indicate that the Genetic Algorithm for the Orienteering Problem should be used. 
In [C, Page 17]:
The Capacitated Vehicle Routing Problem (CVRP) is a generalization of the TSP in which case there is a depot and multiple routes should be created, each starting and ending at the depot. In our graph based formulation, we add a special depot node with index 0 and coordinates                          
                            
                                
                                    x
                                
                                
                                    0
                                
                            
                        
                    . A vehicle (route) has capacity D > 0 and each (regular) node i ∈ {1, . . . n} has a demand 0 <                         
                            
                                
                                    δ
                                
                                
                                    i
                                
                            
                             
                        
                     ≤ D. Each route starts and ends at the depot and the total demand in each route should not exceed the                         
                            
                                
                                    ∑
                                    
                                        i
                                         
                                        ∈
                                         
                                        
                                            
                                                R
                                            
                                            
                                                j
                                            
                                        
                                        }
                                         
                                    
                                
                                
                                    
                                        
                                            δ
                                        
                                        
                                            i
                                        
                                    
                                     
                                     
                                    ≤
                                     
                                    D
                                
                            
                             
                        
                    capacity, 
where                         
                            
                                
                                    R
                                
                                
                                    j
                                
                            
                        
                     is the set of node indices assigned to route j. Without loss of generality, we assume a normalized                         
                            
                                
                                    D
                                
                                ^
                            
                        
                      = 1 as we can use normalized demands                          
                            
                                
                                    
                                        
                                            δ
                                        
                                        
                                            i
                                        
                                    
                                
                                ^
                            
                        
                     =                         
                            
                                
                                    δ
                                
                                
                                    i
                                
                            
                        
                    /D.
In [E.2, Page 23]:
Encoder Again, we use separate parameters for the depot node embedding. Additionally, we provide the node prize ρˆi and the penalty βˆ i as input features:

    PNG
    media_image10.png
    87
    798
    media_image10.png
    Greyscale

In [E.2, Page 23]:
Decoder context 
The context for the decoder for the PCTSP at time t is the current/last location                         
                            
                                
                                    π
                                
                                
                                    t
                                    -
                                    1
                                
                            
                        
                    
and the remaining prize to collect Pt. Again, we do not need placeholders if t = 1 as the route starts at the depot and we do not need to provide information about the first node as the route should end at the depot. The information about the prizes collected is implicitly provided to the model in the form of                         
                            
                                
                                    P
                                
                                
                                    t
                                
                            
                        
                     and we do not need to provide any information about the penalties as this is irrelevant for the remaining decisions:

    PNG
    media_image11.png
    88
    742
    media_image11.png
    Greyscale


-	depending on whether the selected layer specifies an attention layer or a feed-forward layer neural network layer, 
	In [3.1, Page 3]:
The encoder that we use (Figure 1) is similar to the encoder used in the Transformer architecture by Vaswani et al. (2017), but we do not use positional encoding such that the resulting node embeddings are invariant to the input order. 

    PNG
    media_image12.png
    42
    522
    media_image12.png
    Greyscale

Figure 1: Attention based encoder. Input nodes are embedded and processed by N sequential layers, each consisting of a multi-head attention (MHA) and node-wise feed-forward (FF) sub-layer. The graph embedding is computed as the mean of node embeddings. Best viewed in color.
in [6, Page 9]:
The multi-head attention mechanism can be seen as a message passing algorithm that allows nodes to communicate relevant information over different channels, such that the node embeddings from the encoder can learn to include valuable information about the node in the context of the graph. This information is important in our setting where decisions relate directly to the nodes in a graph. Being a graph based method, our model has increased scaling potential (compared to LSTMs) as it can be applied on a sparse graph and operate locally.
In [A ATTENTION MODEL DETAILS, Page 13]:
Attention mechanism 
We interpret the attention mechanism by Vaswani et al. (2017) as a weighted message passing algorithm between nodes in a graph. The weight of the message value that a node receives from a neighbor depends on the compatibility of its query with the key of the neighbor, as illustrated in Figure 4. 

    PNG
    media_image13.png
    256
    1015
    media_image13.png
    Greyscale

Figure 4: Illustration of weighted message passing using a dot-attention mechanism. Only computation of messages received by node 1 are shown for clarity
In [3.1, Page 3]:
 Attention layer 
each attention layer consist of two sublayers: a multi-head attention (MHA) layer that executes message passing between the nodes and a node-wise fully connected feed-forward (FF) layer)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine LU, MUKH and Kool.
LU teaches using neural architecture search to produce the chosen machine-trained and  selecting the models to meet the latency constraint, model mutations, reward score and sparsity level, pruning of weights for given sparsity level of the layers.
MUKH teaches neural architecture search methods and systems.
Kool teaches attention router.
One of ordinary skill would have motivation to combine LU, MUKH and Kool that provides an outperformed attention model ([Kool 5.2, Page 9])
LU, MUKH, and Kool do not explicitly disclose:
-	wherein the attention layer mutating process includes selecting a sparsity ratio that defines how many attention heads to remove in the attention layer, and wherein the feed-forward mutating process includes selecting another sparsity ratio that defines a reduction in weights in the feed-forward neural network layer.
However, Sundar discloses: 
-	wherein the attention layer mutating process includes selecting a sparsity ratio that defines how many attention heads to remove in the attention layer,
in [0037]:
 Referring to FIG. 2, an input image 220 (e.g., a still picture or a video frame) is fed to both the supernet 205 (also referred to herein as “teacher 205” or “Φ.sub.T”) and the subnet 201 (also referred to herein as “student 201” or “Φ.sub.S”) of the sparse distillation system 200. The supernet 205 may be the same or similar to the supernet 105 of FIG. 1, and the subnet 201 may be the same or similar to the subnet 101 of FIG. 1. During distillation 202, an output of the supernet 205 is used to train the subnet 201. This is discussed in more detail infra in section 1.1. FIG. 2 also shows an attention layer 210 that is part of the subnet 201. The attention layer 210 may be included as part of the subnet 201, or the subnet 201 may be communicatively coupled with the attention layer 210. The attention layer 210 includes a pruning mechanism 400 and an SA mechanism 300. FIGS. 4, 3, and 5 show the internal attention layer 210 computation mechanics of the student model Φ.sub.S. In particular, FIG. 4 shows the mechanics of the pruning mechanism 400 (discussed infra in section 1.3) and FIGS. 3 and 5 show the mechanics of the SA mechanism 300 (discussed infra in section 1.2).
 -	and wherein the feed-forward mutating process includes selecting another sparsity ratio that defines a reduction in weights in the feed-forward neural network layer.
in [0055]:
To prune the student model Φ.sub.S while simultaneously distilling knowledge from the teacher Φ.sub.T, the Φ.sub.S's total trainable parameters is/are updated, and then a mask is used to forcibly set a fraction of these parameters to zero. Here, the “mask” for individual layers is a binary tensor used to make sure the trainable parameters have a fixed number of non-zeros based on a given pruning budget.
(BRI: the fraction of the parameter set to zero provides the sparsity ratio)
In [0238]:
Model parameters are usually required by a model when making predictions, and their values define the skill of the model on a particular problem. Examples of such model parameters/parameters include weights (e.g., in an ANN); 
In [0037]:
The attention layer 210 may be included as part of the subnet 201, or the subnet 201 may be communicatively coupled with the attention layer 210. The attention layer 210 includes a pruning mechanism 400 and an SA mechanism 300. FIGS. 4, 3, and 5 show the internal attention layer 210 computation mechanics of the student model Φ.sub.S,
in [0203]:
 The terms “artificial neural network”, “neural network”, or “NN” refer to an ML technique comprising a collection of connected artificial neurons or nodes 
In [0203]:
the artificial neurons can be aggregated or grouped into one or more layers where different layers may perform different transformations on their inputs. Signals travel from the first layer (the input layer), to the last layer (the output layer), possibly after traversing the layers multiple times. NNs are usually used for supervised learning, but can be used for unsupervised learning as well. Examples of NNs include deep NN (DNN), feed forward NN (FFN). 
selecting another sparsity ratio that defines a reduction in weights in the feed-forward neural network layer.  
in [0055]:
To prune the student model Φ.sub.S while simultaneously distilling knowledge from the teacher Φ.sub.T, the Φ.sub.S's total trainable parameters is/are updated, and then a mask is used to forcibly set a fraction of these parameters to zero. Here, the “mask” for individual layers is a binary tensor used to make sure the trainable parameters have a fixed number of non-zeros based on a given pruning budget.
(BRI: the fraction of the parameter set to zero provides the sparsity ratio and the indicates reduction in weight)
In [0238]:
Model parameters are usually required by a model when making predictions, and their values define the skill of the model on a particular problem. Examples of such model parameters/parameters include weights (e.g., in an ANN); 
in [0203]:
The terms “artificial neural network”, “neural network”, or “NN” refer to an ML technique comprising a collection of connected artificial neurons or nodes 
In [0203]:
the artificial neurons can be aggregated or grouped into one or more layers where different layers may perform different transformations on their inputs. Signals travel from the first layer (the input layer), to the last layer (the output layer), possibly after traversing the layers multiple times. NNs are usually used for supervised learning, but can be used for unsupervised learning as well. Examples of NNs include deep NN (DNN), feed forward NN (FFN). 
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine LU, MUKH, Kool and Sundar.
LU teaches using neural architecture search to produce the chosen machine-trained and  selecting the models to meet the latency constraint, model mutations, reward score and sparsity level, pruning of weights for given sparsity level of the layers.
MUKH teaches neural architecture search methods and systems.
Kool teaches attention router.
Sundar teaches parent and student models.
One of ordinary skill would have motivation to combine LU, MUKH, Kool and Sundar  to improve the performance of the student model (Sundar [0039]) 
In regard to claim 19: (Previously Presented) 
LU , and MUKH do not explicitly disclose:
-	wherein the mutating model includes a router that routes between an attention layer mutating process and a feed-forward mutating process 
-	depending on whether the selected layer specifies an attention layer or a feed-forward layer neural network layer, 
However, Kool discloses:
-	wherein the mutating model includes a router that routes between an attention layer mutating process and a feed-forward mutating process 
In [3,ATTENTION MODEL,  Page 2]:
We define the Attention Model in terms of the TSP. For other problems, the model is the same but the input, mask and decoder context need to be defined accordingly,
In [3,ATTENTION MODEL,  Page 2]:
We define a problem instance s as a graph with n nodes, where node i ∈ {1, . . . , n} is represented by features                         
                            
                                
                                    x
                                
                                
                                    i
                                
                            
                        
                    ,
In [3,ATTENTION MODEL,  Page 3]:
We define a solution (tour) π = (                        
                            
                                
                                    π
                                
                                
                                    1
                                
                            
                            ,
                        
                     . . . ,                         
                            
                                
                                    π
                                
                                
                                    n
                                
                            
                            ,
                        
                    ) as a permutation of the nodes, so                         
                            
                                
                                    π
                                
                                
                                    t
                                
                            
                             
                        
                    ∈ {1, . . . n} and                         
                            
                                
                                    π
                                
                                
                                    t
                                
                            
                             
                        
                                             
                            ≠
                             
                            
                                
                                    π
                                
                                
                                    
                                        
                                            t
                                        
                                        
                                            '
                                        
                                    
                                
                            
                        
                    ,  ∀t                         
                            ≠
                            
                                
                                    t
                                
                                
                                    '
                                
                            
                             
                        
                     . Our attention based encoder-decoder model defines a stochastic policy p(π| for selecting a solution π given a problem instance s. It is factorized and parameterized by θ as

    PNG
    media_image9.png
    47
    523
    media_image9.png
    Greyscale

In [D.3, Page 21]:
default parameters, only adding --op --op-ea4op to indicate that the Genetic Algorithm for the Orienteering Problem should be used. 
In [C, Page 17]:
The Capacitated Vehicle Routing Problem (CVRP) is a generalization of the TSP in which case there is a depot and multiple routes should be created, each starting and ending at the depot. In our graph based formulation, we add a special depot node with index 0 and coordinates                          
                            
                                
                                    x
                                
                                
                                    0
                                
                            
                        
                    . A vehicle (route) has capacity D > 0 and each (regular) node i ∈ {1, . . . n} has a demand 0 <                         
                            
                                
                                    δ
                                
                                
                                    i
                                
                            
                             
                        
                     ≤ D. Each route starts and ends at the depot and the total demand in each route should not exceed the                         
                            
                                
                                    ∑
                                    
                                        i
                                         
                                        ∈
                                         
                                        
                                            
                                                R
                                            
                                            
                                                j
                                            
                                        
                                        }
                                         
                                    
                                
                                
                                    
                                        
                                            δ
                                        
                                        
                                            i
                                        
                                    
                                     
                                     
                                    ≤
                                     
                                    D
                                
                            
                             
                        
                    capacity, 
where                         
                            
                                
                                    R
                                
                                
                                    j
                                
                            
                        
                     is the set of node indices assigned to route j. Without loss of generality, we assume a normalized                         
                            
                                
                                    D
                                
                                ^
                            
                        
                      = 1 as we can use normalized demands                          
                            
                                
                                    
                                        
                                            δ
                                        
                                        
                                            i
                                        
                                    
                                
                                ^
                            
                        
                     =                         
                            
                                
                                    δ
                                
                                
                                    i
                                
                            
                        
                    /D.
In [E.2, Page 23]:
Encoder Again, we use separate parameters for the depot node embedding. Additionally, we provide the node prize ρˆi and the penalty βˆ i as input features:

    PNG
    media_image10.png
    87
    798
    media_image10.png
    Greyscale

In [E.2, Page 23]:
Decoder context 
The context for the decoder for the PCTSP at time t is the current/last location                         
                            
                                
                                    π
                                
                                
                                    t
                                    -
                                    1
                                
                            
                        
                    
and the remaining prize to collect Pt. Again, we do not need placeholders if t = 1 as the route starts at the depot and we do not need to provide information about the first node as the route should end at the depot. The information about the prizes collected is implicitly provided to the model in the form of                         
                            
                                
                                    P
                                
                                
                                    t
                                
                            
                        
                     and we do not need to provide any information about the penalties as this is irrelevant for the remaining decisions:

    PNG
    media_image11.png
    88
    742
    media_image11.png
    Greyscale


-	depending on whether the selected layer specifies an attention layer or a feed-forward layer neural network layer, 
	In [3.1, Page 3]:
The encoder that we use (Figure 1) is similar to the encoder used in the Transformer architecture by Vaswani et al. (2017), but we do not use positional encoding such that the resulting node embeddings are invariant to the input order. 

    PNG
    media_image12.png
    42
    522
    media_image12.png
    Greyscale

Figure 1: Attention based encoder. Input nodes are embedded and processed by N sequential layers, each consisting of a multi-head attention (MHA) and node-wise feed-forward (FF) sub-layer. The graph embedding is computed as the mean of node embeddings. Best viewed in color.
in [6, Page 9]:
The multi-head attention mechanism can be seen as a message passing algorithm that allows nodes to communicate relevant information over different channels, such that the node embeddings from the encoder can learn to include valuable information about the node in the context of the graph. This information is important in our setting where decisions relate directly to the nodes in a graph. Being a graph based method, our model has increased scaling potential (compared to LSTMs) as it can be applied on a sparse graph and operate locally.
In [A ATTENTION MODEL DETAILS, Page 13]:
Attention mechanism 
We interpret the attention mechanism by Vaswani et al. (2017) as a weighted message passing algorithm between nodes in a graph. The weight of the message value that a node receives from a neighbor depends on the compatibility of its query with the key of the neighbor, as illustrated in Figure 4. 

    PNG
    media_image13.png
    256
    1015
    media_image13.png
    Greyscale

Figure 4: Illustration of weighted message passing using a dot-attention mechanism. Only computation of messages received by node 1 are shown for clarity
In [3.1, Page 3]:
 Attention layer 
each attention layer consist of two sublayers: a multi-head attention (MHA) layer that executes message passing between the nodes and a node-wise fully connected feed-forward (FF) layer)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine LU, MUKH and Kool.
LU teaches using neural architecture search to produce the chosen machine-trained and  selecting the models to meet the latency constraint, model mutations, reward score and sparsity level, pruning of weights for given sparsity level of the layers.
MUKH teaches neural architecture search methods and systems.
Kool teaches attention router.
One of ordinary skill would have motivation to combine LU, MUKH and Kool that provides an outperformed attention model ([Kool 5.2, Page 9])
LU, MUKH and Kool do not explicitly disclose:
-	wherein the attention layer mutating process includes selecting a sparsity ratio that defines how many attention heads to remove in the attention layer, and wherein the feed-forward mutating process includes selecting another sparsity ratio that defines a reduction in weights in the feed-forward neural network layer.
However, Sundar discloses: 
-	wherein the attention layer mutating process includes selecting a sparsity ratio that defines how many attention heads to remove in the attention layer,
in [0037]:
 Referring to FIG. 2, an input image 220 (e.g., a still picture or a video frame) is fed to both the supernet 205 (also referred to herein as “teacher 205” or “Φ.sub.T”) and the subnet 201 (also referred to herein as “student 201” or “Φ.sub.S”) of the sparse distillation system 200. The supernet 205 may be the same or similar to the supernet 105 of FIG. 1, and the subnet 201 may be the same or similar to the subnet 101 of FIG. 1. During distillation 202, an output of the supernet 205 is used to train the subnet 201. This is discussed in more detail infra in section 1.1. FIG. 2 also shows an attention layer 210 that is part of the subnet 201. The attention layer 210 may be included as part of the subnet 201, or the subnet 201 may be communicatively coupled with the attention layer 210. The attention layer 210 includes a pruning mechanism 400 and an SA mechanism 300. FIGS. 4, 3, and 5 show the internal attention layer 210 computation mechanics of the student model Φ.sub.S. In particular, FIG. 4 shows the mechanics of the pruning mechanism 400 (discussed infra in section 1.3) and FIGS. 3 and 5 show the mechanics of the SA mechanism 300 (discussed infra in section 1.2).
 -	and wherein the feed-forward mutating process includes selecting another sparsity ratio that defines a reduction in weights in the feed-forward neural network layer.
in [0055]:
To prune the student model Φ.sub.S while simultaneously distilling knowledge from the teacher Φ.sub.T, the Φ.sub.S's total trainable parameters is/are updated, and then a mask is used to forcibly set a fraction of these parameters to zero. Here, the “mask” for individual layers is a binary tensor used to make sure the trainable parameters have a fixed number of non-zeros based on a given pruning budget.
(BRI: the fraction of the parameter set to zero provides the sparsity ratio)
In [0238]:
Model parameters are usually required by a model when making predictions, and their values define the skill of the model on a particular problem. Examples of such model parameters/parameters include weights (e.g., in an ANN); 
In [0037]:
The attention layer 210 may be included as part of the subnet 201, or the subnet 201 may be communicatively coupled with the attention layer 210. The attention layer 210 includes a pruning mechanism 400 and an SA mechanism 300. FIGS. 4, 3, and 5 show the internal attention layer 210 computation mechanics of the student model Φ.sub.S,
in [0203]:
 The terms “artificial neural network”, “neural network”, or “NN” refer to an ML technique comprising a collection of connected artificial neurons or nodes 
In [0203]:
the artificial neurons can be aggregated or grouped into one or more layers where different layers may perform different transformations on their inputs. Signals travel from the first layer (the input layer), to the last layer (the output layer), possibly after traversing the layers multiple times. NNs are usually used for supervised learning, but can be used for unsupervised learning as well. Examples of NNs include deep NN (DNN), feed forward NN (FFN). 
selecting another sparsity ratio that defines a reduction in weights in the feed-forward neural network layer.  
in [0055]:
To prune the student model Φ.sub.S while simultaneously distilling knowledge from the teacher Φ.sub.T, the Φ.sub.S's total trainable parameters is/are updated, and then a mask is used to forcibly set a fraction of these parameters to zero. Here, the “mask” for individual layers is a binary tensor used to make sure the trainable parameters have a fixed number of non-zeros based on a given pruning budget.
(BRI: the fraction of the parameter set to zero provides the sparsity ratio and the indicates reduction in weight)
In [0238]:
Model parameters are usually required by a model when making predictions, and their values define the skill of the model on a particular problem. Examples of such model parameters/parameters include weights (e.g., in an ANN); 
in [0203]:
The terms “artificial neural network”, “neural network”, or “NN” refer to an ML technique comprising a collection of connected artificial neurons or nodes 
In [0203]:
the artificial neurons can be aggregated or grouped into one or more layers where different layers may perform different transformations on their inputs. Signals travel from the first layer (the input layer), to the last layer (the output layer), possibly after traversing the layers multiple times. NNs are usually used for supervised learning, but can be used for unsupervised learning as well. Examples of NNs include deep NN (DNN), feed forward NN (FFN). 
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine LU, MUKH, Kool and Sundar.
LU teaches using neural architecture search to produce the chosen machine-trained and  selecting the models to meet the latency constraint, model mutations, reward score and sparsity level, pruning of weights for given sparsity level of the layers.
MUKH teaches neural architecture search methods and systems.
Kool teaches attention router.
Sundar teaches parent and student models.
One of ordinary skill would have motivation to combine LU, MUKH, Kool and Sundar  to improve the performance of the student model (Sundar [0039]) 
Claims 10-12 are rejected under 35 U.S.C. 103 as being unpatentable over 
in view BINGQIAN LU et.al. (hereinafter LU) One Proxy Device Is Enough for Hardware-Aware Neural Architecture Search, Proc. ACM Meas. Anal. Comput. Syst., Vol. 5, No. 3, Article 34. Publication date: December 2021
In view of SHALINA MUKHOPADHYAY et.al. (hereinafter MUKH) US 2023/0334330 A1 
[Foreign Priority IN202221022177 Filed 2022-04-13], 
further in view of Suthee Chaidaroon et.al (hereinafter Chai) US 2022/0245706 A1. 
In regard to claim 10: (Currently Amended) 
LU, and MUKH do not explicitly disclose:
-	determining that a data store does not yet store an analysis result produced by a preliminary process; 
-	and applying the chosen machine-trained model to perform the application task in response to the determining that the data store does yet not store the analysis result
However, Chai discloses: 
-	determining that a data store does not yet store an analysis result produced by a preliminary process; 
in [0004]:
the methods and apparatuses of the present disclosure deliver improved results over existing systems by delivering items with a higher probability of relevancy to the customer's query.
In [0054]:
process the query information 502 to make the query information more suitable for processing by the query encoder 504,
in [0060]:
Both the query-item list and the position information can be stored in a datastore, such as in database 208.
In [0066]:
 A predetermined engagement criteria can be used to label the query-item pairs. In one example, the query-item pairs can be labelled as follows. An engagement score between 31 to 21 can be assigned to query-item pairs that the engagement data indicates that a customer actually clicked on or selected the item when it was presented to the user in a listing of search results
and applying the chosen machine-trained model to perform the application task in response to the determining that the data store does yet not store the analysis result 
In [0053]:
 The retrieval computing device 202 can also be coupled to the database 208. The retrieval computing device 202 can access various types and quantities of data from the database 208. The database 208 can include query information 410, item information 412 and popularity information 414. In addition, the final search results that are determined by the retrieval computing device 202 can also be stored in the database 208.
In [0058]:
shown in FIG. 7, the embedding model 702 and the blending model (as will be further described below) can be trained offline.  For example, the query-item data pairs and the item information can be collected and used to train the embedding model. The trained model can then be implemented to generate a query-item list. The query-item list can determine items to return as relevant search results when a query is entered by a customer. 
In [0053]:
the final search results that are determined by the retrieval computing device 202 can also be stored in the database 208.
(BRI: the offline trained is not stored and the query entered final search results are stored)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine LU and Chai.
LU teaches using neural architecture search to produce the chosen machine-trained and selecting the models to meet the latency constraint, model mutations, teaching and reward score.
Chai teaches relevancy score and encoding vector representation.
One of ordinary skill would have motivation to combine  LU,  and Chai to provide to provide improved results with high probability of customer’s relevancy (Chai [0004]).
In regard to claim 11: (Previously Presented)
LU does not explicitly disclose:
wherein said applying includes:
receiving a query from a user; 
forming a combination of the query and a first target item; 
and based on the combination, determining a relevance score for the first target item using the chosen machine-trained model, the relevance score measuring a relevance of the query to the first target item.  
However, Chai discloses:
receiving a query from a user; 
in [0056]:
a search server 654 coupled to a search index 652. The legacy retrieval system 650 can receive a query from a customer 602.
forming a combination of the query and a first target item; 
in [0049]:
Query-item pair information can be information that includes the content of the user's query (e.g., the words entered in the search string) and the item that was clicked or purchased after search results were displayed to the user.
In [0012]:
 a method can include obtaining query information characterizing a query initiated by a customer on an ecommerce marketplace and determining embedding-based search results comprising a first list of items.
and based on the combination, determining a relevance score for the first target item using the chosen machine-trained model, the relevance score measuring a relevance of the query to the first target item.  
in [0004]:
The methods and apparatuses of the present disclosure deliver improved results over existing systems by delivering items with a higher probability of relevancy to the customer's query.
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine LU and Chai.
LU teaches using neural architecture search to produce the chosen machine-trained and selecting the models to meet the latency constraint, model mutations, teaching and reward score.
Chai teaches relevancy score and encoding vector representation.
One of ordinary skill would have motivation to combine  LU,  and Chai to provide to provide improved results with high probability of customer’s relevancy (Chai [0004]).
In regard to claim 12: (Previously Presented)
LU  does not explicitly disclose:
wherein said applying further includes:
and having been generated in an offline process prior to receipt of the query
and determining a relevance score for the second target item using another machine-trained model, different from the chosen machine-trained model, based on the item encoding vector that is retrieved, 
the relevance score for the second target item measuring a relevance of the query to the second target item, wherein the chosen machine-trained model is used in response to determining that an item encoding vector has not yet been generated for the first target item.  
retrieving an item encoding vector for a second target item, the item encoding vector representing semantic content in the second target item  
However, Chai discloses:
and having been generated in an offline process prior to receipt of the query; 
in [0058]:
 As previously stated, the determination of the second set of search results by the embedding-based product retrieval system 600 can be performed offline or in real time. In one implementation, the processing of the embedding-based product retrieval system 600 can be performed offline. In such an example, shown in FIG. 7, the embedding model 702 and the blending model (as will be further described below) can be trained offline. For example, the query-item data pairs and the item information can be collected and used to train the embedding model. The trained model can then be implemented to generate a query-item list. The query-item list can determine items to return as relevant search results when a query is entered by a customer
the relevance score for the second target item measuring a relevance of the query to the second target item, wherein the chosen machine-trained model is used in response to determining that an item encoding vector has not yet been generated for the first target item.  
in [0027]:
The search string often includes multiple words that are then searched against the catalog of items available on the ecommerce marketplace. The search tool can then return a list of items that are available on the ecommerce marketplace. The user can then select (e.g., click) the items in the search results listing to view or purchase the item.
In [0029]:
The problems associated with many existing search methods and systems is that existing systems can use lexical matching in which the tokens in the user's search query are compared against tokens in item titles to determine a relevance between the query and the item. A token is a separable element in the search query or in the item title such as a word, number, symbol or the like. For example, in a search string that is entered by a user such as “running gear”, the search can include two tokens “running” and “gear”. 
In [0050]:
The feature generator 402 can also operate to tokenize a query and to tokenize a product title. The feature generator 402 can, for example, generate a vector from a search string by separating the search string into the individual words or other tokens. The feature generator 402 can use any suitable tokenization process including tokenizing the query into unigrams, bi-grams, or tri-grams, for example. The feature generator 402 can also partition the obtained data into training data, test data and development data. Each of these sets can be partitioned by a query.
In [0011]:
In another aspect, the embedding-based machine learning model can be trained using a training method comprising obtaining query-item pair data comprising a queries and item titles and tokenizing the queries and the item titles in the query-item pair data. 
(BRI: the tokenization method has not used the encoding vector yet. The embedding model don't always use encoder vectors)
In [0030]:
The present disclosure uses methods and apparatuses that can address these problems and instead can include a query encoder and an item encoder that can map the query and the item titles into a semantic space to determine the similarity between the query and the items
(BRI: the problem now addressed using encoding vector)
retrieving an item encoding vector for a second target item, the item encoding vector representing semantic content in the second target item 
in [0027]:
The search string often includes multiple words that are then searched against the catalog of items available on the ecommerce marketplace. The search tool can then return a list of items that are available on the ecommerce marketplace. The user can then select (e.g., click) the items in the search results listing to view or purchase the item.
the product retrieval system 500 can also include the item encoder 512 that can determine an item vector 514 based on the item information 510.
In [0030]:
The item encoder 112 can determine an item vector that is projected onto the semantic space 114. As further shown, the query 102 is projected into the semantic space 114 and is represented by query projection 116,
and determining a relevance score for the second target item using another machine-trained model, different from the chosen machine-trained model, based on the item encoding vector that is retrieved, 
in [0027]:
The search string often includes multiple words that are then searched against the catalog of items available on the ecommerce marketplace. The search tool can then return a list of items that are available on the ecommerce marketplace. The user can then select (e.g., click) the items in the search results listing to view or purchase the item.
In [0030]:
 The present disclosure uses methods and apparatuses that can address these problems and instead can include a query encoder and an item encoder that can map the query and the item titles into a semantic space to determine the similarity between the query and the items to better determine a relevancy between the query and the items in the catalog.
In [0004]:
The methods and apparatuses of the present disclosure deliver improved results over existing systems by delivering items with a higher probability of relevancy to the customer's query.
Claim 24 is rejected under 35 U.S.C. 103 as being unpatentable over 
in view BINGQIAN LU et.al. (hereinafter LU) One Proxy Device Is Enough for Hardware-Aware Neural Architecture Search, Proc. ACM Meas. Anal. Comput. Syst., Vol. 5, No. 3, Article 34. Publication date: December 2021
In view of SHALINA MUKHOPADHYAY et.al. (hereinafter MUKH) US 2023/0334330 A1 
further in view of  Dong Yang et.al ( hereinafter Yang) US 2022/0284582 A1.
In regard to claim 24: (New)
LU, and MUKH do not explicitly disclose:
-	wherein the second neural network also receives a sparsity level of the parent model for the selected layer.
However, Yang discloses:
-	wherein the second neural network also receives a sparsity level of the parent model for the selected layer.
In [0508]:
Neural architecture search (NAS) focuses on designing neural network automatically.
In [0508]:
A search space defines what architectures (e.g., which neural networks) can be searched, which can be further divided into a network topology level and cell level.
In [0515]:
In at least one embodiment, a cell search space comprises a set of candidate operations for a second selection of one or more of a set of candidate operations at each of a set of candidate edges or paths. In at least one embodiment, each of a set of candidate operations of a candidate feature node is to receive an input feature map and provides an output feature map after performing a candidate operation. 
In [0515]:
at least one embodiment, a first candidate feature node of a set of candidate feature nodes comprises: a first candidate edge comprising a downsample operation
(BRI: set of candidate feature nodes can include edges where a downsample operation connects a parent node to a child node)
 In [0336]:
In at least one embodiment, neurons 1902 may be organized into one or more layers
In [0336]:
In at least one embodiment, neuron outputs 1906 of neurons 1902 in a first layer 1910 may be connected to neuron inputs 1904 of neurons 1902 in a second layer 1912. In at least one embodiment, layer 1910 may be referred to as a “feed-forward layer.” In at least one embodiment, each instance of neuron 1902 in an instance of first layer 1910 may fan out to each instance of neuron 1902 in second layer 1912. In at least one embodiment, first layer 1910 may be referred to as a “fully connected feed-forward layer.” In at least one embodiment, each instance of neuron 1902 in an instance of second layer 1912 may fan out to fewer than all instances of neuron 1902 in a third layer 1914. In at least one embodiment, second layer 1912 may be referred to as a “sparsely connected feed-forward layer.” 
In [0336]:
In at least one embodiment, neuromorphic processor 1900 may include, without 	, any suitable combination of recurrent layers and feed-forward layers, including, without limitation, both sparsely connected feed-forward layers and fully connected feed-forward layers.
In [0074]:
in at least one embodiment, training framework 204 trains untrained neural network 
206 until untrained neural network 206 achieves a desired accuracy.
In [0475]:
In at least one embodiment, an application may run on a GPU-accelerated instance generated in cloud 3126, and an inference service may perform inferencing on a GPU.
(BRI: an inference service running on a GPU inherently has latency, as latency is the time delay from the moment an input is received until a prediction is returned). 
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine LU, MUKH, and Yang.
LU teaches model mutations,  and reward score.
MUKH teaches neural architecture search methods and systems.
Yang teaches sparsity of layers.
One of ordinary skill would have motivation to combine LU, MUKH, and Yang to provide improved performance for memory accesses (Yang [0435]).
Claim 25 is rejected under 35 U.S.C. 103 as being unpatentable over 
in view BINGQIAN LU et.al. (hereinafter LU) One Proxy Device Is Enough for Hardware-Aware Neural Architecture Search, Proc. ACM Meas. Anal. Comput. Syst., Vol. 5, No. 3, Article 34. Publication date: December 2021
In view of SHALINA MUKHOPADHYAY et.al. (hereinafter MUKH) US 2023/0334330 A1 
[Foreign Priority IN202221022177 Filed 2022-04-13], 
futher in view of Suthee Chaidaroon et.al (hereinafter Chai) US 2022/0245706 A1. 
further in view of  Dong Yang et.al ( hereinafter Yang) US 2022/0284582 A1.
In regard to claim 25: (New)
LU, and MUKH do not explicitly disclose:
-	a first relevance-assessing process generates relevance scores for a first class of target items using a component having two processing paths, 
-	wherein the first class of target items includes associated encoding vectors stored in the data store, 
-	and wherein the second class of target items do not yet include associated encoding vectors stored in the data store.
However, Chai discloses:
-	wherein the first class of target items includes associated encoding vectors stored in the data store, and wherein the second class of target items do not yet include associated encoding vectors stored in the data store.
In [0055]:
the product retrieval system 500 can also include the item encoder 512 that can determine an item vector 514 based on the item information 510. 
In [0033]:
The marketplace computing device 214 can collect information such as queries that are entered by customers as well as collect information regarding how customers interacted with the search results that are returned by the retrieval computing device 202. The marketplace computing device 214 can store such information and/or send such information for storage in the database 208 or in other components of the product retrieval system 200.
In [0029]:
The problems associated with many existing search methods and systems is that existing systems can use lexical matching in which the tokens in the user's search query are compared against tokens in item titles to determine a relevance between the query and the item. A token is a separable element in the search query or in the item title such as a word, number, symbol or the like. For example, in a search string that is entered by a user such as “running gear”, the search can include two tokens “running” and “gear”. 
In [0011]:
In another aspect, the embedding-based machine learning model can be trained using a training method comprising obtaining query-item pair data comprising a queries and item titles and tokenizing the queries and the item titles in the query-item pair data. 
(BRI: the tokenization method has not used the encoding vector yet. The embedding model don't always use encoder vectors)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine LU, MUKH, and Chai.
MUKH teaches using neural architecture search to produce the chosen machine-trained and  selecting the models to meet the latency constraint.
LU teaches model mutations, and reward score.
Chai teaches relevancy score and encoding vector representation.
One of ordinary skill would have motivation to combine LU, MUKH and Chai to provide to provide improved results with high probability of customer’s relevancy (Chai [0004]).
LU, MUKH and Chai do not explicitly disclose:
-	wherein a first relevance-assessing process generates relevance scores for a first class of target items using a component having two processing paths, 
-	and wherein a second relevance-assessing process generates relevance scores for a second class of target items using a component having a single processing path, 
However, Yang discloses:
-	wherein a first relevance-assessing process generates relevance scores for a first class of target items using a component having two processing paths, 
In [0050]:
FIG. 35 is a visual representation of a search space with fully connected edges between adjacent layers for a differentiable NAS method, according to at least one embodiment;
In [0051]:
 FIG. 36 illustrates a multi-path topology, a single-path topology, a multi-path topology with four input resolutions, and a multi-path topology with two input resolutions, according to at least one embodiment;
In [0103]:
an image processing chip that may measure distance from vehicle 400 to target object and use generated information (e.g., metadata) to activate autonomous emergency braking and lane departure warning functions. In at least one embodiment, other types of stereo camera(s) 468 may be used in addition to, or alternatively from, those described herein.
In [0119]:
In at least one embodiment, accelerator(s) 414 could be used for targeted workloads (e.g., perception, convolutional neural networks (“CNNs”), recurrent neural networks (“RNNs”), etc.) that are stable enough to be amenable to acceleration. 
In [0525]:
differentiable NAS method 3700 generates a set of connection patterns, where there is a separate connection pattern for each possible combination of candidate input edges (paths) that can connect two layers. In at least one embodiment, differentiable NAS method 3700 scores each connection pattern, 
(BRI: Neural Architecture Search (NAS) methods, such as DARTS, assign a 
relevance score (the score is generated by method 3700) to each potential operation/connection pattern within a search space. These methods use architectural parameters)
-	and wherein a second relevance-assessing process generates relevance scores for a second class of target items using a component having a single processing path, 
In [0103]:
an image processing chip that may measure distance from vehicle 400 to target object and use generated information (e.g., metadata) to activate autonomous emergency braking and lane departure warning functions. In at least one embodiment, other types of stereo camera(s) 468 may be used in addition to, or alternatively from, those described herein.
In [0119]:
In at least one embodiment, accelerator(s) 414 could be used for targeted workloads (e.g., perception, convolutional neural networks (“CNNs”), recurrent neural networks (“RNNs”), etc.) that are stable enough to be amenable to acceleration. 
In [0050]:
FIG. 35 is a visual representation of a search space with fully connected edges between adjacent layers for a differentiable NAS method, according to at least one embodiment;
In [0051]:
 FIG. 36 illustrates a multi-path topology, a single-path topology, a multi-path topology with four input resolutions, and a multi-path topology with two input resolutions, according to at least one embodiment;
In [0525]:
In at least one embodiment, instead of determining a score for each candidate input edge at a node and selecting a candidate input edge having a highest score for a single-path topology, differentiable NAS method 3700 considers all candidate input edges in sets of connection patterns between two layers as possibilities and selects an input connection pattern between each set of two layers.
(BRI: the highest score for a single path topology is the “relevance score”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine LU, MUKH, Chai and Yang.
LU teaches model mutations,  and reward score.
MUKH teaches NAS systems.
Chai teaches relevancy score and encoding vector representation.
Yang teaches single path NAS.
One of ordinary skill would have motivation to combine LU, MUKH, Chai and Yang to provide improved performance for memory accesses (Yang [0435]).
Claims 13 14 and 16 is rejected under 35 U.S.C. 103 as being unpatentable over 
in view BINGQIAN LU et.al. (hereinafter LU) One Proxy Device Is Enough for Hardware-Aware Neural Architecture Search, Proc. ACM Meas. Anal. Comput. Syst., Vol. 5, No. 3, Article 34. Publication date: December 2021
in view of Suthee Chaidaroon et.al (hereinafter Chai) US 2022/0245706 A1. 
In regard to claim 13: (Previously Presented) *
Note:*:[ the claim was amended on 9/18/25 and the previously presented cited on 9/26/25 as related to the 9/18/25 amendment] 
LU discloses:
-	A computing system, comprising: a computer-implemented application system having hardware logic circuitry configured to perform an application task, the computer-implemented application system including a chosen machine-trained model, the chosen machine-trained model having been automatically generated by a neural network search (NAS) system, the NAS system including other hardware logic circuitry that is configured to perform operations of:
In [Abstract], Page 34:1]:
Convolutional neural networks (CNNs) are used in numerous real-world applications such as vision-based autonomous driving and video content analysis. To run CNN inference on various target devices, hardware-aware neural architecture search (NAS) is crucial
In [ 3.1 Problem Formulation , Page 34:6]:
The general problem of hardware-aware NAS can be formulated as follows:

    PNG
    media_image1.png
    77
    615
    media_image1.png
    Greyscale

where x represents the architecture, X is the search space under consideration,                         
                            
                                
                                    w
                                
                                
                                    x
                                
                            
                        
                    
is the network weight given architecture x,                          
                            
                                
                                    L
                                
                                
                                    d
                                
                            
                        
                      is the average inference latency constraint, and d ∈ D denotes a device with D being the device set.
In [5.1, Page 34:12]:
performing NAS on two other (proxy) devices — 4790 (Desktop CPU) and T4 (Desktop GPU) 
in [4.3, Page 34:10]:
if CNN models fall into the compute-bound region for two devices, then we can also establish latency monotonicity using a similar logic. For search spaces with models that span across both memory-bound and compute-bound regions, the latency monotonicity may not be strong 
-	receiving a specified latency constraint; 
In [3.1, Page 34:6]:
	the inference latency and energy of an architecture on a device are very strongly correlated. That is, an energy constraint can be implicitly mapped to a corresponding latency constraint.
(BRI: Within the context of searching, this requires evaluation stage where architects or algorithms are selected from a large set of possibilities to receive the latency constraint)
-	and using a neural architecture search to produce the chosen machine-trained model that satisfies the latency constraint, based on a collection of candidate machine-trained models,
In [3, Page 34:6]:
 PROBLEM FORMULATION, INSIGHTS, AND PRACTICAL CONSIDERATION We present the problem formulation for hardware-aware NAS, show the key insights for when we can reduce the latency evaluation cost to O(1), and finally discuss practical considerations
In [3.1 Problem Formulation, Page 34:6]:
The general problem of hardware-aware NAS can be formulated as follows:

    PNG
    media_image1.png
    77
    615
    media_image1.png
    Greyscale

where x represents the architecture, X is the search space under consideration,                         
                            
                                
                                    w
                                
                                
                                    x
                                
                            
                        
                    
is the network weight given architecture x,                          
                            
                                
                                    L
                                
                                
                                    d
                                
                            
                        
                      is the average inference latency constraint, and d ∈ D denotes a device with D being the device set.
In [2.1.1, Page 34:3]:
Specifically, as illustrated in the left subfigure of Fig. 3, the NAS process is governed by a controller
In [2.1.1, Page 34:4]:
(e.g., a reinforcement learning agent): given each candidate architecture produced by the controller, the model is trained on the training dataset and then evaluated for its performance, based on which the controller produces another candidate architecture. This process repeats until convergence or the maximum search iteration is reached

    PNG
    media_image2.png
    241
    803
    media_image2.png
    Greyscale

In [2.1 , Page 34:3]:
Overview Neural architecture is a key design hyperparameter that affects the inference accuracy and latency of DNN models. In Fig. 2, we show an example architecture, which is found by searching over the possible layer-wise kernel sizes

    PNG
    media_image3.png
    221
    822
    media_image3.png
    Greyscale


-	wherein different candidate machine-trained models in the collection of machine-trained models  specify different respective ways of removing weights in a shared neural network architecture, on a layer-by-layer basis,
In [3, Page 34:6]:
 PROBLEM FORMULATION, INSIGHTS, AND PRACTICAL CONSIDERATION We present the problem formulation for hardware-aware NAS, show the key insights for when we can reduce the latency evaluation cost to O(1), and finally discuss practical considerations
In [3.1 Problem Formulation, Page 34:6]:
The general problem of hardware-aware NAS can be formulated as follows:

    PNG
    media_image1.png
    77
    615
    media_image1.png
    Greyscale

where x represents the architecture, X is the search space under consideration,                         
                            
                                
                                    w
                                
                                
                                    x
                                
                            
                        
                    
is the network weight given architecture x,                          
                            
                                
                                    L
                                
                                
                                    d
                                
                            
                        
                      is the average inference latency constraint, and d ∈ D denotes a device with D being the device set.
In [2.1.1, Page 34:3]:
Specifically, as illustrated in the left subfigure of Fig. 3, the NAS process is governed by a controller
In [2.1.1, Page 34:4]:
(e.g., a reinforcement learning agent): given each candidate architecture produced by the controller, the model is trained on the training dataset and then evaluated for its performance, based on which the controller produces another candidate architecture. This process repeats until convergence or the maximum search iteration is reached

    PNG
    media_image2.png
    241
    803
    media_image2.png
    Greyscale

In [2.1 , Page 34:3]:
Overview Neural architecture is a key design hyperparameter that affects the inference accuracy and latency of DNN models. In Fig. 2, we show an example architecture, which is found by searching over the possible layer-wise kernel sizes

    PNG
    media_image3.png
    221
    822
    media_image3.png
    Greyscale

In [2.1.2 , Page 34:4]:
One-shot NAS. In view of the extremely diverse devices and platforms for model deployment, one-shot NAS and its variants such as few-shot NAS have recently been proposed to reduce the search cost by exploiting the weight sharing mechanism.
In [2.1.2 , Page 34:4]:
 Concretely, as illustrated in the right subfigure of Fig. 3, the key idea of one-shot NAS is to decouple the training process from the search process: pre-train a super large model (called supernet) whose weight is shared among all the candidate architectures, and then use a separate search process to discover optimal architectures that inherit the weights from the supernet. 
In [5.2, Page 34:12]:
Removing non-Pareto-optimal architectures. 
We measure the actual latencies of Pareto-optimal architectures (obtained for either the paroxy or adapted proxy device) on the target device, and remove non-Pareto-optimal architectures
(BRI: process of removing "non-optimal models" or components from a neural network deployed on a target device is fundamentally an application of model pruning, which directly involves the removal or zeroing out of weights)
LU does not explicitly disclose:
wherein the hardware logic circuitry of the application system is configured to perform operations of:
receiving a query from a user; 
forming a combination of the query and a first target item; 
based on the combination, determining a relevance score for the first target item using the chosen machine-trained model, the relevance score measuring a relevance of the query to the first target item; 
retrieving an item encoding vector for a second target item, the item encoding vector representing semantic content in the second target item 
and determining a relevance score for the second target item using another machine-trained model, different from the chosen machine-trained model, based on the item encoding vector that is retrieved, the relevancy score for the second target item measuring a relevance of the query to the second target item
wherein the chosen machine-trained model is used in response to determining that an item encoding vector has not yet been generated for the first target item.
However, Chai discloses:
wherein the hardware logic circuitry of the application system is configured to perform operations of:
in [0005]:
 In accordance with various embodiments, exemplary systems may be implemented in any suitable hardware or hardware and software, such as in any suitable computing device. For example, in some embodiments, an embedding-based retrieval system can include a computing device configured to a computing device configured to obtain query information characterizing a query initiated by a customer on an ecommerce marketplace and to determine embedding-based search results comprising a first list of items
receiving a query from a user; 
in [0056]:
a search server 654 coupled to a search index 652. The legacy retrieval system 650 can receive a query from a customer 602.
forming a combination of the query and a first target item; 
in [0027]:
The search string often includes multiple words that are then searched against the catalog of items available on the ecommerce marketplace. The search tool can then return a list of items that are available on the ecommerce marketplace. The user can then select (e.g., click) the items in the search results listing to view or purchase the item.
in [0049]:
Query-item pair information can be information that includes the content of the user's query (e.g., the words entered in the search string) and the item that was clicked or purchased after search results were displayed to the user.
In [0012]:
 a method can include obtaining query information characterizing a query initiated by a customer on an ecommerce marketplace and determining embedding-based search results comprising a first list of items.
based on the combination, determining a relevance score for the first target item using the chosen machine-trained model, the relevance score measuring a relevance of the query to the first target item; 
in [0004]:
The methods and apparatuses of the present disclosure deliver improved results over existing systems by delivering items with a higher probability of relevancy to the customer's query.
retrieving an item encoding vector for a second target item, the item encoding vector representing semantic content in the second target item 
in [0027]:
The search string often includes multiple words that are then searched against the catalog of items available on the ecommerce marketplace. The search tool can then return a list of items that are available on the ecommerce marketplace. The user can then select (e.g., click) the items in the search results listing to view or purchase the item.
in [0005]:
The computing device can also be configured to obtain legacy search results comprising a second list of items and to blend the embedding-based search results with the legacy search results to obtain blended search results,
In [0030]:
 The present disclosure uses methods and apparatuses that can address these problems and instead can include a query encoder and an item encoder that can map the query and the item titles into a semantic space to determine the similarity between the query and the items to better determine a relevancy between the query and the items in the catalog.
In [0030]:
The item encoder 112 can determine an item vector that is projected onto the semantic space 114. As further shown, the query 102 is projected into the semantic space 114 and is represented by query projection 116,
In [0030]:
The item 104 is projected onto the semantic space 114 and is represented by the projection 118 and the item 106 is projected onto the semantic space 114 and is represented by the projection 120. The distance between the projection 116 and the projection 118 and the distance between the projection 116 and the projection 118 can both be measured.
and determining a relevance score for the second target item using another machine-trained model, different from the chosen machine-trained model, based on the item encoding vector that is retrieved,
in [0030]:
The separation between the projections can correspond to a relevancy between the query and the items. As can be seen, the running shoe, item 104 with projection 118, is positioned closer to query 102 with projection 116 than the running book, item 106 with projection 120. As such, the product retrieval systems of the present disclosure can determine that the running shoe is more relevant to the user query than the running book. In this manner, more relevant search results can be returned to the user when the user enters a user query.
In [0058]:
 the determination of the second set of search results by the embedding-based product retrieval system 600 can be performed offline or in real time. In one implementation, the processing of the embedding-based product retrieval system 600 can be performed offline. In such an example, shown in FIG. 7, the embedding model 702 and the blending model (as will be further described below) can be trained offline,
in [0058]:
the query-item data pairs and the item information can be collected and used to train the embedding model. The trained model can then be implemented to generate a query-item list. The query-item list can determine items to return as relevant search results when a query is entered by a customer.
wherein the chosen machine-trained model is used in response to determining that an item encoding vector has not yet been generated for the first target item.
In [0029]:
The problems associated with many existing search methods and systems is that existing systems can use lexical matching in which the tokens in the user's search query are compared against tokens in item titles to determine a relevance between the query and the item. A token is a separable element in the search query or in the item title such as a word, number, symbol or the like. For example, in a search string that is entered by a user such as “running gear”, the search can include two tokens “running” and “gear”. 
In [0050]:
 The feature generator 402 can also operate to tokenize a query and to tokenize a product title. The feature generator 402 can, for example, generate a vector from a search string by separating the search string into the individual words or other tokens. The feature generator 402 can use any suitable tokenization process including tokenizing the query into unigrams, bi-grams, or tri-grams, for example. The feature generator 402 can also partition the obtained data into training data, test data and development data. Each of these sets can be partitioned by a query.
In [0011]:
 In another aspect, the embedding-based machine learning model can be trained using a training method comprising obtaining query-item pair data comprising a queries and item titles and tokenizing the queries and the item titles in the query-item pair data. 
(BRI: the tokenization method has not used the encoding vector yet. The embedding model don't always use encoder vectors)
In [0030]:
 The present disclosure uses methods and apparatuses that can address these problems and instead can include a query encoder and an item encoder that can map the query and the item titles into a semantic space to determine the similarity between the query and the items
(BRI: the problem now addressed using encoding vector)
-	determining a relevance score for the first target item using the chosen machine-trained model, the relevance score measuring a relevance of the query to the first target item
in [0004]:
The methods and apparatuses of the present disclosure deliver improved results over existing systems by delivering items with a higher probability of relevancy to the customer's query.
retrieving an item encoding vector for a second target item, the item encoding vector representing semantic content in the second target item 
in [0027]:
The search string often includes multiple words that are then searched against the catalog of items available on the ecommerce marketplace. The search tool can then return a list of items that are available on the ecommerce marketplace. The user can then select (e.g., click) the items in the search results listing to view or purchase the item.
in [0005]:
The computing device can also be configured to obtain legacy search results comprising a second list of items and to blend the embedding-based search results with the legacy search results to obtain blended search results,
In [0030]:
 The present disclosure uses methods and apparatuses that can address these problems and instead can include a query encoder and an item encoder that can map the query and the item titles into a semantic space to determine the similarity between the query and the items to better determine a relevancy between the query and the items in the catalog.
In [0030]:
The item encoder 112 can determine an item vector that is projected onto the semantic space 114. As further shown, the query 102 is projected into the semantic space 114 and is represented by query projection 116,
In [0030]:
The item 104 is projected onto the semantic space 114 and is represented by the projection 118 and the item 106 is projected onto the semantic space 114 and is represented by the projection 120. The distance between the projection 116 and the projection 118 and the distance between the projection 116 and the projection 118 can both be measured.
-	retrieving an item encoding vector for a second target item, the item encoding vector representing semantic content in the second target item and having been generated in an offline process prior to receipt of the query: 
in [0027]:
The search string often includes multiple words that are then searched against the catalog of items available on the ecommerce marketplace. The search tool can then return a list of items that are available on the ecommerce marketplace. The user can then select (e.g., click) the items in the search results listing to view or purchase the item.
in [0005]:
The computing device can also be configured to obtain legacy search results comprising a second list of items and to blend the embedding-based search results with the legacy search results to obtain blended search results,
In [0030]:
 The present disclosure uses methods and apparatuses that can address these problems and instead can include a query encoder and an item encoder that can map the query and the item titles into a semantic space to determine the similarity between the query and the items to better determine a relevancy between the query and the items in the catalog.
In [0030]:
The item encoder 112 can determine an item vector that is projected onto the semantic space 114. As further shown, the query 102 is projected into the semantic space 114 and is represented by query projection 116,
In [0030]:
The item 104 is projected onto the semantic space 114 and is represented by the projection 118 and the item 106 is projected onto the semantic space 114 and is represented by the projection 120. The distance between the projection 116 and the projection 118 and the distance between the projection 116 and the projection 118 can both be measured.
-	and determining a relevance score for the second target item using another machine- trained model, different from the chosen machine-trained model, based on the item encoding vector that is retrieved, the relevance score for the second target item measuring a relevance of  the query to the second target item,
In [0027]:
The search string often includes multiple words that are then searched against the catalog of items available on the ecommerce marketplace. The search tool can then return a list of items that are available on the ecommerce marketplace. The user can then select (e.g., click) the items in the search results listing to view or purchase the item.
in [0005]:
The computing device can also be configured to obtain legacy search results comprising a second list of items and to blend the embedding-based search results with the legacy search results to obtain blended search results,
In [0030]:
 The present disclosure uses methods and apparatuses that can address these problems and instead can include a query encoder and an item encoder that can map the query and the item titles into a semantic space to determine the similarity between the query and the items to better determine a relevancy between the query and the items in the catalog.
In [0030]:
The item encoder 112 can determine an item vector that is projected onto the semantic space 114. As further shown, the query 102 is projected into the semantic space 114 and is represented by query projection 116,
In [0030]:
The item 104 is projected onto the semantic space 114 and is represented by the projection 118 and the item 106 is projected onto the semantic space 114 and is represented by the projection 120. The distance between the projection 116 and the projection 118 and the distance between the projection 116 and the projection 118 can both be measured.
-	determining that an item encoding vector has not yet been generated for the first target item, wherein the-chosen machine-trained model issued, and not said another machine-trained model. in response to the determining that the item encoding vector has not yet been generated for the first target item.
In [0029]:
The problems associated with many existing search methods and systems is that existing systems can use lexical matching in which the tokens in the user's search query are compared against tokens in item titles to determine a relevance between the query and the item. A token is a separable element in the search query or in the item title such as a word, number, symbol or the like. For example, in a search string that is entered by a user such as “running gear”, the search can include two tokens “running” and “gear”. 
In [0050]:
 The feature generator 402 can also operate to tokenize a query and to tokenize a product title. The feature generator 402 can, for example, generate a vector from a search string by separating the search string into the individual words or other tokens. The feature generator 402 can use any suitable tokenization process including tokenizing the query into unigrams, bi-grams, or tri-grams, for example. The feature generator 402 can also partition the obtained data into training data, test data and development data. Each of these sets can be partitioned by a query.
In [0011]:
 In another aspect, the embedding-based machine learning model can be trained using a training method comprising obtaining query-item pair data comprising a queries and item titles and tokenizing the queries and the item titles in the query-item pair data. 
(BRI: the tokenization method has not used the encoding vector yet. The embedding model don't always use encoder vectors)
In [0030]:
 The present disclosure uses methods and apparatuses that can address these problems and instead can include a query encoder and an item encoder that can map the query and the item titles into a semantic space to determine the similarity between the query and the items
(BRI: the problem now addressed using encoding vector)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine LU and Chai.
LU teaches using neural architecture search to produce the chosen machine-trained and  selecting the models to meet the latency constraint.
Chai teaches relevancy score and encoding vector representation.
One of ordinary skill would have motivation to combine LU and Chai to provide to provide improved results with high probability of customer’s relevancy (Chai [0004]).
	In regard to claim 14:  (Currently Amended)  	
 LU discloses:
-	selecting a parent model from the collection of candidate machine-trained models, the parent model being a neural network having plural layers;
In [A , A.1, SUMMARY OF EVOLUTIONARY SEARCH, Page 34:24]:
For each evolutionary search iteration, we select the fittest individuals as parents for reproduction, which will survive in the next generation and also breed new individuals through crossover. For example, if our population size is 1000 and the parent ratio is 0.25, we have 250 fittest individuals as parents. Then, we randomly select a pair of parents each time for crossover and generate a child. Within the crossover process, each element in the child’s vector is chosen randomly from one of the parents’. 
In [B.4, Page 34:30]:
Search Space. Similar to MobileNet-V2, the FBNet search space is also layer-wise with a fixed macro-architecture, which defines the number of layers and input/output dimensions of each layer and fixes the first and last three layers, with the remaining layers to be searched. 
In [6.1.1 , Page 34:16]:
The search space consists of depth of each stage, kernel size of convolutional layers, and expansion ratio of each block. 
-	mutating the parent model using a mutating model, to produce a child model, the mutating model including two neural networks that operate in two consecutive stages, wherein a first neural network of the two neural networks has been trained to select a level of the parent model, 
In [6.1.1, Page 34:16]:
NAS Method. We consider one-shot NAS and use the Once-For-All network [9] as a supernet that has the same search space as ours. We run evolutionary search to find optimal architectures
Our parameter settings are: population size is 1000, parent ratio is 0.25, mutation probability is 0.1, mutation ratio is 0.25, and we search for 50 generations given each latency constraint. 
(BRI: neural network supernet (or SuperNet) contains multiple sub-networks (SubNets) within one large, overparameterized network, allowing efficient exploration of many architectures)
	In [A SUMMARY OF EVALUTIONARY SEARCH, Page 34:24]:
To run evolutionary search, we first randomly sample the initial population of individuals according to the population size. Next, we evaluate the fitness of each individual in the population, where the fitness function is defined as: 
(t − 1) · accuracy + t · latency (4) 

where t ∈ [0, 1] is the weight parameter to balance the tradeoff between accuracy and latency of each individual model, and accuracy and latency are predicted values given by the accuracy and latency predictors, respectively. 
In [A SUMMARY OF EVALUTIONARY SEARCH, Page 34:24]:
For each evolutionary search iteration, we select the fittest individuals as parents for reproduction, which will survive in the next generation and also breed new individuals through crossover. For example, if our population size is 1000 and the parent ratio is 0.25, we have 250 fittest individuals as parents. Then, we randomly select a pair of parents each time for crossover and generate a child. Within the crossover process, each element in the child’s vector is chosen randomly from one of the parents’. Also, based on the mutation ratio setting, part of the offsprings will further perform mutation operations. For example, with mutation ratio 0.25 and mutation probability 0.1, 250 out of 750 children have a possibility of 0.1 to mutate. If a child is chosen to mutate, its kernel size, expansion ratio, and depth will be randomly sampled out of all the possible values for exploration. After crossover and mutation, we have a new population consisting of parents, bred children, and mutated children. Next, the fittest individuals are selected as new parents for next iteration. The above crossover and mutation steps will be repeated for the maximum evolutionary search iteration number
-	given sparsity levels of the parent model, to provide a selected level, and wherein a second neural network of the two neural networks has been trained to vary a sparsity level of the selected layer, given the selected layer produced by the first neural network
In [5.4.2, Page 34:15]:
we consider the proxy device’s latency predictor in a linear form: Ld0 (x) = wT x, where w is the weight and x is the architecture representation (e.g., one-hot encoding of the searchable operators, penultimate layer output in a neural network-based predictor,3 or encoding of the execution units). We measure the latencies of a small set of sample architectures x ∈ A on the target device, noting that this step is also needed to check the SRCC value and incurs a negligible overhead compared to SOTA approaches (i.e., tens of hours of latency measurement. 
In [5.4.2, Page 34:15]:
Then, with the latency measurement samples denoted by (xi ,yi), we quickly adapt the proxy device’s latency predictor as 


    PNG
    media_image4.png
    63
    700
    media_image4.png
    Greyscale

where I is the identity vector with all the elements being 1, the operator “◦” denotes the element-wise multiplication, and λ ≥ 0 is a hyperparameter controlling the weight for the sparsity regularization term |b| and tuned based on a small validation set of architectures (20 architectures in our experiment) split from the sample architecture set A.
The interpretation of using Eqn. (3) is as follows. First, the scaling factor α reflects our intuition that a more complex operator that is slower on one device is generally also slower on another device. Second, the sparsity term b accounts for the fact that the slow-down factors for an operator on two devices are not necessarily the same.
(BRI: adapting a latency predictor with sparsity regularization is often designed to provide variation of sparsity levels across different layers, rather than a uniform sparsity. The latency predictor, by incorporating real-world hardware feedback (or a learned model of it), identifies which layers benefit most from increased sparsity in terms of actual speedup)
in [5.4.2, Page  34:15]:
we consider the proxy device’s latency predictor in a linear form: Ld0 (x) = wT x, where w is the weight and x is the architecture representation (e.g., one-hot encoding of the searchable operators, penultimate layer output in a neural network-based predictor or encoding of the execution units)
-	generating a reward score for the child model that takes into consideration at least accuracy and latency of the child model, wherein producing the accuracy that is used to generate the reward score 
In [A SUMMARY OF EVALUTIONARY SEARCH, Page 34:24]:
To run evolutionary search, we first randomly sample the initial population of individuals according to the population size. Next, we evaluate the fitness of each individual in the population, where the fitness function is defined as: 
(t − 1) · accuracy + t · latency (4) 

where t ∈ [0, 1] is the weight parameter to balance the tradeoff between accuracy and latency of each individual model, and accuracy and latency are predicted values given by the accuracy and latency predictors, respectively. 
(BRI: the fitness is the reward that depends on the accuracy and latency (see equation (4))
In [7, Page 34:20]:
fast evaluation of accuracy and inference latency to rank different architectures is crucial for efficient hardware-aware NAS
In [7, Page 34:20]:
Given many diverse devices, scalability of latency evaluation is critically important. A straight forward approach is to build a meta latency predictor that incorporates hardware features as additional input
In [3.3, Page 34:7]:
 To quantify the degree of latency monotonicity in practice, we use the metric of Spearman’s Rank Correlation Coefficient (SRCC), which lies between -1 and 1 and assesses statistical dependence between the rankings of two variables using a monotonic function. The greater the SRCC of CNN latencies on two devices, the better the latency monotonicity. SRCC of 0.9 to 1.0 is usually viewed as strongly dependent in terms of monotonicity [3].
(BRI: Latency monotonicity is the stability of the latency showing a stable upward/downward trend rather than erratic spikes)
In [A SUMMARY OF EVALUTIONARY SEARCH, Page 34:24]:
For each evolutionary search iteration, we select the fittest individuals as parents for reproduction, which will survive in the next generation and also breed new individuals through crossover.
(BRI: the fittest is the reward to generate the child (new individuals from the selected parent that is going to survive)
In [5.3.2, Page 34:14]:
 Checking latency monotonicity. To check whether strong latency monotonicity is satisfied between the selected proxy device and a target device, we estimate the SRCC based on a small set A of sample architectures and then compare it against a threshold.
In [6.1, Page 34:16]:
As a result, the imperfection in the accuracy predictor explains why a strong, but not perfect, latency monotonicity (e.g., SRCC>0.9) is enough for our one-proxy approach to find Pareto-optimal architectures for a new target device
(BRI: SRCC (Spearman's Rank Correlation Coefficient) is a statistical method to measure how consistently the order of latencies for different tasks or models stays the same across various devices or platforms, indicating latency monotonicity)
-	 includes pruning weights of the child model, given sparsity levels of the child model
with the latency measurement samples denoted by (                        
                            
                                
                                    x
                                
                                
                                    i
                                
                            
                        
                     ,                         
                            
                                
                                    y
                                
                                
                                    i
                                
                            
                        
                    ) , we quickly adapt the proxy device’s latency predictor as 

    PNG
    media_image5.png
    22
    252
    media_image5.png
    Greyscale


 	tailored to the target device, by solving the following problem:

    PNG
    media_image6.png
    62
    672
    media_image6.png
    Greyscale

 I is the identity vector with all the elements being 1, the operator “◦” denotes the element-wise multiplication, and λ ≥ 0 is a hyperparameter controlling the weight for the sparsity regularization term |b| and tuned based on a small validation set of architectures (20 architectures in our experiment) split from the sample architecture set A.
(BRI: tuning a hyperparameter that controls the weight for sparsity regularization indirectly represents an approach to achieving pruning, as it aims to learn a sparse model, which is a key goal of pruning)
In [2.1.2, Page 34:4]:
In view of the extremely diverse devices and platforms for model deployment, one-shot NAS and its variants such as few-shot NAS have recently been proposed to reduce the search cost by exploiting the weight sharing mechanism 
(BRI: the process of discovering optimal architectures via weight-sharing Neural Architecture 
often involves adjusting (finetuning) the inherited weights. over optimal architectures that inherit the weights from the supernet). 
In [A , A.1, SUMMARY OF EVOLUTIONARY SEARCH, Page 34:24]:
For each evolutionary search iteration, we select the fittest individuals as parents for reproduction, which will survive in the next generation and also breed new individuals through crossover. For example, if our population size is 1000 and the parent ratio is 0.25, we have 250 fittest individuals as parents. Then, we randomly select a pair of parents each time for crossover and generate a child. 
In [A, A.1, SUMMARY OF EVOLUTIONARY SEARCH, Page 34:24]:
Also, based on the mutation ratio setting, part of the offsprings will further perform mutation operations. For example, with mutation ratio 0.25 and mutation probability 0.1, 250 out of 750 children have a possibility of 0.1 to mutate
To run evolutionary search, we first randomly sample the initial population of individuals according to the population size. Next, we evaluate the fitness of each individual in the population, where the fitness function is defined as
(t − 1) · accuracy + t · latency (4) 

where t ∈ [0, 1] is the weight parameter to balance the tradeoff between accuracy and latency of each individual model, and accuracy and latency are predicted values given by the accuracy and latency predictors, respectively. 
In [Abstract, Page 34:1]:
In this work, we address the scalability challenge by exploiting latency monotonicity — the architecture latency rankings on different devices are often correlated. 
In [B.1 Latency Monotonicity, Page 34:25]: 
We show the results in Fig. 18, which are in line with our experiments: latency
In [B.1 Latency Monotonicity, Page 34:26]: 
monotonicity among mobile devices is strong (>0.95), while FLOP-latency ranking correlation for mobile devices 
(BRI: fitness is the rewarding. Within hardware-aware Neural Architecture Search (NAS), the latency (or latency ranking) of a model is commonly incorporated as a reward score or part of a multi-objective fitness function)
-	adjusting weights of the mutating model that performs said mutating based on the reward score to increase a likelihood that the mutating model will make decisions that are rewarded by said generating;
In [2.1.2, Page 34:4]:
In view of the extremely diverse devices and platforms for model deployment, one-shot NAS and its variants such as few-shot NAS have recently been proposed to reduce the search cost by exploiting the weight sharing mechanism 
(BRI: the process of discovering optimal architectures via weight-sharing Neural Architecture 
often involves adjusting (finetuning) the inherited weights. over optimal architectures that inherit the weights from the supernet). 
In [A , A.1, SUMMARY OF EVOLUTIONARY SEARCH, Page 34:24]:
For each evolutionary search iteration, we select the fittest individuals as parents for reproduction, which will survive in the next generation and also breed new individuals through crossover. For example, if our population size is 1000 and the parent ratio is 0.25, we have 250 fittest individuals as parents. Then, we randomly select a pair of parents each time for crossover and generate a child. 
In [A, A.1, SUMMARY OF EVOLUTIONARY SEARCH, Page 34:24]:
Also, based on the mutation ratio setting, part of the offsprings will further perform mutation operations. For example, with mutation ratio 0.25 and mutation probability 0.1, 250 out of 750 children have a possibility of 0.1 to mutate
To run evolutionary search, we first randomly sample the initial population of individuals according to the population size. Next, we evaluate the fitness of each individual in the population, where the fitness function is defined as
(t − 1) · accuracy + t · latency (4) 

where t ∈ [0, 1] is the weight parameter to balance the tradeoff between accuracy and latency of each individual model, and accuracy and latency are predicted values given by the accuracy and latency predictors, respectively. 
In [Abstract, Page 34:1]:
In this work, we address the scalability challenge by exploiting latency monotonicity — the architecture latency rankings on different devices are often correlated. 
In [B.1 Latency Monotonicity, Page 34:25]: 
We show the results in Fig. 18, which are in line with our experiments: latency
In [B.1 Latency Monotonicity, Page 34:26]: 
monotonicity among mobile devices is strong (>0.95), while FLOP-latency ranking correlation for mobile devices 
(BRI: fitness is the rewarding. Within hardware-aware Neural Architecture Search (NAS), the latency (or latency ranking) of a model is commonly incorporated as a reward score or part of a multi-objective fitness function)
-	updating the collection of candidate machine-trained models based on the child model; 
In [6.1.1, Page 34:16]:
NAS Method. We consider one-shot NAS and use the Once-For-All network [9] as a supernet that has the same search space
In [7, Page 34:20]:
NAS uses a super net that includes all the weights for candidate architectures
In [6.1.1, Page 34:16]:
Accuracy Predictor. The evolutionary search is assisted with by an accuracy predictor for fast architecture performance evaluation.
In [6.1.1, Page 34:16]:
Our accuracy predictor is a neural network with four fully-connected layers and updated with 176 samples on top of the predictor 
In [6.1.1, Page 34:16]:
The accuracy predictor takes a 128-dimensional feature vector (which is converted from a 21-dimensional architecture configuration within the search space) as input. Fig. 12(a) compares the actual and predicted accuracies 
(BRI: in the context of Neural Architecture Search (NAS), a supernet is essentially a large, overparameterized neural network that contains many different candidate architectures (subnetworks) within it)
 ( BRI: an "accuracy predictor" or, more accurately, that uses updated real-world data is a key component of a process called continuous learning or continuous which ultimately results in an updated collection of machine-trained models) 
In [2.1.1, Page 34:3]:
 search process is entangled with the model training process.
In [2.1.1, Page 34:3]:
Specifically, as illustrated in the left subfigure of Fig. 3, the NAS process is governed by a controller
In [2.1.1, Page 34:4]:
(e.g., a reinforcement learning agent): given each candidate architecture produced by the controller, the model is trained on the training dataset and then evaluated for its performance, based on which the controller produces another candidate architecture
(BRI: a machine learning model is trained on a training dataset and its performance is assessed using a separate validation and test dataset for  adjusting the model's hyperparameters can inform a "controller" or system which then triggers the development of an updated or new model architecture is the continuous training)
-	and repeating the selecting, mutating, generating, adjusting, and updating until a specified objective is achieved, to produce the chosen machine-trained model.
In [2.1.2, Page 34:4]:
a search process based on evolutionary algorithms or reinforcement learning to find an optimal architecture
In [2.1.1, Page 34:4]:
given each candidate architecture produced by the controller, the model is trained on the training dataset and then evaluated for its performance, based on which the controller produces another candidate architecture. This process repeats until convergence or the maximum search iteration is reached. 
In [5.2, Page 34:12]:
Our scalable hardware-aware NAS approach is illustrated in Fig. 10 and described in Algorithm 1.




    PNG
    media_image7.png
    248
    761
    media_image7.png
    Greyscale

In [5.2, Page 34:12]:

    PNG
    media_image8.png
    367
    835
    media_image8.png
    Greyscale

(BRI: a controller that iteratively proposes candidate architectures which are then trained and evaluated to reach an objective and the "repeating the selecting, mutating, generating, adjusting, and updating until a specified objective is achieved," describes the general principles of an evolutionary algorithm) 
In regard to claim 16: (Currently Amended)
LU discloses:
-	wherein the latency that is used to generate the reward score is produced using trainable logic that performs prediction 
In [A SUMMARY OF EVALUTIONARY SEARCH, Page 34:24]:
To run evolutionary search, we first randomly sample the initial population of individuals according to the population size. Next, we evaluate the fitness of each individual in the population, where the fitness function is defined as: 
(t − 1) · accuracy + t · latency (4) 

where t ∈ [0, 1] is the weight parameter to balance the tradeoff between accuracy and latency of each individual model, and accuracy and latency are predicted values given by the accuracy and latency predictors, respectively. 
In [2.3, Page 34:5]:
While the actual execution time of NAS may be further reduced by parallel processing, the total cost in terms of machine-hours does not decrease. For example, latency measurements on multiple devices in parallel and assigning more GPUs for supernet training can both speed up the overall NAS process, but the total resources needed by NAS still remain unchanged (or possibly even higher due to communications overheads among GPUs for distributed training)
Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over 
in view BINGQIAN LU et.al. (hereinafter LU) One Proxy Device Is Enough for Hardware-Aware Neural Architecture Search, Proc. ACM Meas. Anal. Comput. Syst., Vol. 5, No. 3, Article 34. Publication date: December 2021
in view of Suthee Chaidaroon et.al (hereinafter Chai) US 2022/0245706 A1. 
further in view of Wouter Kool et.al. (hereinafter Kool) ATTENTION, LEARN TO SOLVE ROUTING PROBLEMS!, Published as a conference paper at ICLR 2019, arXiv:1803.08475v3 [stat.ML] 7 Feb 2019
further in view of Sairam Sundaresan et.al. (hereinafter Sundar) US 2022/0036194 A1.

In regard to claim 15: (Previously Presented) 
LU , and Chai do not explicitly disclose:
-	wherein the mutating model includes a router that routes between an attention layer mutating process and a feed-forward mutating process 
-	depending on whether the selected layer specifies an attention layer or a feed-forward layer neural network layer, 
However, Kool discloses:
-	wherein the mutating model includes a router that routes between an attention layer mutating process and a feed-forward mutating process 
In [3,ATTENTION MODEL,  Page 2]:
We define the Attention Model in terms of the TSP. For other problems, the model is the same but the input, mask and decoder context need to be defined accordingly,
In [3,ATTENTION MODEL,  Page 2]:
We define a problem instance s as a graph with n nodes, where node i ∈ {1, . . . , n} is represented by features                 
                    
                        
                            x
                        
                        
                            i
                        
                    
                
            ,
In [3,ATTENTION MODEL,  Page 3]:
We define a solution (tour) π = (                
                    
                        
                            π
                        
                        
                            1
                        
                    
                    ,
                
             . . . ,                 
                    
                        
                            π
                        
                        
                            n
                        
                    
                    ,
                
            ) as a permutation of the nodes, so                 
                    
                        
                            π
                        
                        
                            t
                        
                    
                     
                
            ∈ {1, . . . n} and                 
                    
                        
                            π
                        
                        
                            t
                        
                    
                     
                
                             
                    ≠
                     
                    
                        
                            π
                        
                        
                            
                                
                                    t
                                
                                
                                    '
                                
                            
                        
                    
                
            ,  ∀t                 
                    ≠
                    
                        
                            t
                        
                        
                            '
                        
                    
                     
                
             . Our attention based encoder-decoder model defines a stochastic policy p(π| for selecting a solution π given a problem instance s. It is factorized and parameterized by θ as

    PNG
    media_image9.png
    47
    523
    media_image9.png
    Greyscale

In [D.3, Page 21]:
default parameters, only adding --op --op-ea4op to indicate that the Genetic Algorithm for the Orienteering Problem should be used. 
In [C, Page 17]:
The Capacitated Vehicle Routing Problem (CVRP) is a generalization of the TSP in which case there is a depot and multiple routes should be created, each starting and ending at the depot. In our graph based formulation, we add a special depot node with index 0 and coordinates                  
                    
                        
                            x
                        
                        
                            0
                        
                    
                
            . A vehicle (route) has capacity D > 0 and each (regular) node i ∈ {1, . . . n} has a demand 0 <                 
                    
                        
                            δ
                        
                        
                            i
                        
                    
                     
                
             ≤ D. Each route starts and ends at the depot and the total demand in each route should not exceed the                 
                    
                        
                            ∑
                            
                                i
                                 
                                ∈
                                 
                                
                                    
                                        R
                                    
                                    
                                        j
                                    
                                
                                }
                                 
                            
                        
                        
                            
                                
                                    δ
                                
                                
                                    i
                                
                            
                             
                             
                            ≤
                             
                            D
                        
                    
                     
                
            capacity, 
where                 
                    
                        
                            R
                        
                        
                            j
                        
                    
                
             is the set of node indices assigned to route j. Without loss of generality, we assume a normalized                 
                    
                        
                            D
                        
                        ^
                    
                
              = 1 as we can use normalized demands                  
                    
                        
                            
                                
                                    δ
                                
                                
                                    i
                                
                            
                        
                        ^
                    
                
             =                 
                    
                        
                            δ
                        
                        
                            i
                        
                    
                
            /D.
In [E.2, Page 23]:
Encoder Again, we use separate parameters for the depot node embedding. Additionally, we provide the node prize ρˆi and the penalty βˆ i as input features:

    PNG
    media_image10.png
    87
    798
    media_image10.png
    Greyscale

In [E.2, Page 23]:
Decoder context 
The context for the decoder for the PCTSP at time t is the current/last location                 
                    
                        
                            π
                        
                        
                            t
                            -
                            1
                        
                    
                
            
and the remaining prize to collect Pt. Again, we do not need placeholders if t = 1 as the route starts at the depot and we do not need to provide information about the first node as the route should end at the depot. The information about the prizes collected is implicitly provided to the model in the form of                 
                    
                        
                            P
                        
                        
                            t
                        
                    
                
             and we do not need to provide any information about the penalties as this is irrelevant for the remaining decisions:

    PNG
    media_image11.png
    88
    742
    media_image11.png
    Greyscale


-	depending on whether the selected layer specifies an attention layer or a feed-forward layer neural network layer, 
	In [3.1, Page 3]:
The encoder that we use (Figure 1) is similar to the encoder used in the Transformer architecture by Vaswani et al. (2017), but we do not use positional encoding such that the resulting node embeddings are invariant to the input order. 

    PNG
    media_image12.png
    42
    522
    media_image12.png
    Greyscale

Figure 1: Attention based encoder. Input nodes are embedded and processed by N sequential layers, each consisting of a multi-head attention (MHA) and node-wise feed-forward (FF) sub-layer. The graph embedding is computed as the mean of node embeddings. Best viewed in color.
in [6, Page 9]:
The multi-head attention mechanism can be seen as a message passing algorithm that allows nodes to communicate relevant information over different channels, such that the node embeddings from the encoder can learn to include valuable information about the node in the context of the graph. This information is important in our setting where decisions relate directly to the nodes in a graph. Being a graph based method, our model has increased scaling potential (compared to LSTMs) as it can be applied on a sparse graph and operate locally.
In [A ATTENTION MODEL DETAILS, Page 13]:
Attention mechanism 
We interpret the attention mechanism by Vaswani et al. (2017) as a weighted message passing algorithm between nodes in a graph. The weight of the message value that a node receives from a neighbor depends on the compatibility of its query with the key of the neighbor, as illustrated in Figure 4. 

    PNG
    media_image13.png
    256
    1015
    media_image13.png
    Greyscale

Figure 4: Illustration of weighted message passing using a dot-attention mechanism. Only computation of messages received by node 1 are shown for clarity
In [3.1, Page 3]:
 Attention layer 
each attention layer consist of two sublayers: a multi-head attention (MHA) layer that executes message passing between the nodes and a node-wise fully connected feed-forward (FF) layer)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine LU, Chai, and Kool.
LU teaches using neural architecture search to produce the chosen machine-trained and  selecting the models to meet the latency constraint.
Chai teaches relevancy score and encoding vector representation.
Kool teaches attention router.
One of ordinary skill would have motivation to combine LU, Chai, and Kool that provides an outperformed attention model ([Kool 5.2, Page 9])
LU, Chai, and Kool do not explicitly disclose:
-	wherein the attention layer mutating process includes selecting a sparsity ratio that defines how many attention heads to remove in the attention layer, and wherein the feed-forward mutating process includes selecting another sparsity ratio that defines a reduction in weights in the feed-forward neural network layer.
However, Sundar discloses: 
-	wherein the attention layer mutating process includes selecting a sparsity ratio that defines how many attention heads to remove in the attention layer,
in [0037]:
 Referring to FIG. 2, an input image 220 (e.g., a still picture or a video frame) is fed to both the supernet 205 (also referred to herein as “teacher 205” or “Φ.sub.T”) and the subnet 201 (also referred to herein as “student 201” or “Φ.sub.S”) of the sparse distillation system 200. The supernet 205 may be the same or similar to the supernet 105 of FIG. 1, and the subnet 201 may be the same or similar to the subnet 101 of FIG. 1. During distillation 202, an output of the supernet 205 is used to train the subnet 201. This is discussed in more detail infra in section 1.1. FIG. 2 also shows an attention layer 210 that is part of the subnet 201. The attention layer 210 may be included as part of the subnet 201, or the subnet 201 may be communicatively coupled with the attention layer 210. The attention layer 210 includes a pruning mechanism 400 and an SA mechanism 300. FIGS. 4, 3, and 5 show the internal attention layer 210 computation mechanics of the student model Φ.sub.S. In particular, FIG. 4 shows the mechanics of the pruning mechanism 400 (discussed infra in section 1.3) and FIGS. 3 and 5 show the mechanics of the SA mechanism 300 (discussed infra in section 1.2).
 -	and wherein the feed-forward mutating process includes selecting another sparsity ratio that defines a reduction in weights in the feed-forward neural network layer.
in [0055]:
To prune the student model Φ.sub.S while simultaneously distilling knowledge from the teacher Φ.sub.T, the Φ.sub.S's total trainable parameters is/are updated, and then a mask is used to forcibly set a fraction of these parameters to zero. Here, the “mask” for individual layers is a binary tensor used to make sure the trainable parameters have a fixed number of non-zeros based on a given pruning budget.
(BRI: the fraction of the parameter set to zero provides the sparsity ratio)
In [0238]:
Model parameters are usually required by a model when making predictions, and their values define the skill of the model on a particular problem. Examples of such model parameters/parameters include weights (e.g., in an ANN); 
In [0037]:
The attention layer 210 may be included as part of the subnet 201, or the subnet 201 may be communicatively coupled with the attention layer 210. The attention layer 210 includes a pruning mechanism 400 and an SA mechanism 300. FIGS. 4, 3, and 5 show the internal attention layer 210 computation mechanics of the student model Φ.sub.S,
in [0203]:
 The terms “artificial neural network”, “neural network”, or “NN” refer to an ML technique comprising a collection of connected artificial neurons or nodes 
In [0203]:
the artificial neurons can be aggregated or grouped into one or more layers where different layers may perform different transformations on their inputs. Signals travel from the first layer (the input layer), to the last layer (the output layer), possibly after traversing the layers multiple times. NNs are usually used for supervised learning, but can be used for unsupervised learning as well. Examples of NNs include deep NN (DNN), feed forward NN (FFN). 
selecting another sparsity ratio that defines a reduction in weights in the feed-forward neural network layer.  
in [0055]:
To prune the student model Φ.sub.S while simultaneously distilling knowledge from the teacher Φ.sub.T, the Φ.sub.S's total trainable parameters is/are updated, and then a mask is used to forcibly set a fraction of these parameters to zero. Here, the “mask” for individual layers is a binary tensor used to make sure the trainable parameters have a fixed number of non-zeros based on a given pruning budget.
(BRI: the fraction of the parameter set to zero provides the sparsity ratio and the indicates reduction in weight)
In [0238]:
Model parameters are usually required by a model when making predictions, and their values define the skill of the model on a particular problem. Examples of such model parameters/parameters include weights (e.g., in an ANN); 
in [0203]:
The terms “artificial neural network”, “neural network”, or “NN” refer to an ML technique comprising a collection of connected artificial neurons or nodes 
In [0203]:
the artificial neurons can be aggregated or grouped into one or more layers where different layers may perform different transformations on their inputs. Signals travel from the first layer (the input layer), to the last layer (the output layer), possibly after traversing the layers multiple times. NNs are usually used for supervised learning, but can be used for unsupervised learning as well. Examples of NNs include deep NN (DNN), feed forward NN (FFN). 
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine LU, Chai, Kool and Sundar.
LU teaches using neural architecture search to produce the chosen machine-trained and  selecting the models to meet the latency constraint.
Chai teaches relevancy score and encoding vector representation.
Kool teaches attention model.
Sundar teaches parent and student models.
One of ordinary skill would have motivation to combine LU, Chi, Kool and Sundar  to improve the performance of the student model (Sundar [0039]) 

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the
examiner should be directed to TIRUMALE KRISHNASWAMY RAMESH whose telephone number is (571)272-4605. The examiner can normally be reached by phone.
Examiner interviews are available via telephone, in-person, and video conferencing
using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at
http://www.uspto.gov/interviewpractice. 
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached on phone (571-272-3768). The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be
obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit:
https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for
information about filing in DOCX format. 



For additional questions, contact the Electronic
Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO
Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/TIRUMALE K RAMESH/Examiner, Art Unit 2121     


/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121
Read full office action
Prosecution Timeline

Show 1 earlier event
Jun 18, 2025
Non-Final Rejection mailed — §103
Sep 18, 2025
Response Filed
Sep 22, 2025
Applicant Interview (Telephonic)
Sep 24, 2025
Examiner Interview Summary
Dec 19, 2025
Final Rejection mailed — §103
Mar 03, 2026
Response after Non-Final Action
May 19, 2026
Request for Continued Examination
May 22, 2026
Response after Non-Final Action
Precedent Cases

Applications granted by this same examiner with similar technology

16/739,694
Patent 12518153
TRAINING MACHINE LEARNING SYSTEMS
5y 12m to grant Granted Jan 06, 2026
17/136,054
Patent 12293284
META COOPERATIVE TRAINING PARADIGMS
4y 4m to grant Granted May 06, 2025
17/064,561
Patent 12229651
BLOCK-BASED INFERENCE METHOD FOR MEMORY-EFFICIENT CONVOLUTIONAL NEURAL NETWORK IMPLEMENTATION AND SYSTEM THEREOF
4y 4m to grant Granted Feb 18, 2025
17/039,178
Patent 12131244
HARDWARE-OPTIMIZED NEURAL ARCHITECTURE SEARCH
4y 1m to grant Granted Oct 29, 2024
16/844,335
Patent 11803745
TERMINAL DEVICE AND METHOD FOR ESTIMATING FIREFIGHTING DATA
3y 6m to grant Granted Oct 31, 2023
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

2-3
Expected OA Rounds
18%
Grant Probability
20%
With Interview (+2.1%)
4y 7m (~5m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 40 resolved cases by this examiner. Grant probability derived from career allowance rate.