Office Action Analysis: 17751089 — Automated Selection of Neural Architecture Using a Smoothed Super-Net

Examiner Intelligence

RAMESH, TIRUMALE K View full profile →
Grants only 18% of cases
Career Allowance Rate
7 granted / 40 resolved
-37.5% vs TC avg
Minimal +2% lift
Without
With
+2.1%
Interview Lift
resolved cases with interview
Typical timeline
4y 7m
Avg Prosecution
17 currently pending
Career history
80
Total Applications
across all art units
Statute-Specific Performance

§101
1.2%
-38.8% vs TC avg
§103
98.4%
+58.4% vs TC avg
§102
0.4%
-39.6% vs TC avg
Black line = Tech Center average estimate • Based on career data from 40 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 12/12/2025 has been entered.
 					Response to Amendment
Applicant’s arguments with respect to claims 1, 9 and 15  have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
In regard to 101 rejections	
-	The applicant argues on Page 9 that the claim 1 provides a technical solution to a technical problem supported by [0024] reciting”  Weight-sharing NAS has become very important for hardware-aware NAS. Hardware-aware NAS is an automated technique to build neural networks and produces very efficient deep networks for a given hardware. However, instability in weight-sharing NAS is a critical problem. In addition to NAS, weight-sharing also presents significant challenges to other well-known and challenging problems like multi-task learning, multi-modal learning, etc” and states that the solution to the problem is to determine a second loss (smoothing loss) and further argues on Page 10 citing [0023] that the accuracy improves as a result of reduction of Lipschitz constant (higher smoothness). 
Examiner’s Response	
The examiner recognizes that reducing the Lipschitz constant in a machine algorithm can provide a technical improvement to computer functionality as it demonstrates a specific, technical improvement—such as increased stability, faster convergence, or reduced computational complexity—rather than just an abstract mathematical calculation. It shows that a model is more stable and less sensitive to small changes in input (noise). This can be argued as a technical improvement in machine learning architecture, such as requiring less processing power or improving the reliability of the output, which directly relates to better computer functionality and move the claim beyond Abstract Ideas.
In CONCLUSION, the examiner WITHDRAWS the 101 rejections on claims 1-7, 9-13, and 15-19.  The examiner submits that claims 21-22 are new dependent claims added under the current amendment. 
In regard to 103 rejections	
-	The applicant argues in regard to the amended claims on Page 13 for the teachings of reference  Jiang2 suggesting that the Roth and Jiang2 fail to teach the problem of instability. The applicant also argues on Page 13 that the portions of supernet is not same as “sample sub-network) of the claim 1 as amended.
Examiner’s Response
Without conceding the merits of the previous prior art, the examiner submits that the applicant’s arguments are MOOT because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. The examiner has used new references “ Peng”, “Mok” and “Dong” which strongly teaches all the amendments of the claims.
Claim Rejections – 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-4, 7, 9-13, 15-19 and 21  are rejected under 35 U.S.C. 103 as being unpatentable over 
Holger Roth et.al. (hereinafter Roth ) 2021/0374502 A1, 
In view of Jiefeng Peng et.al. (hereinafter Peng)  Pi-NAS:  Improving Neural Architecture Search by Reducing Supernet Training Consistency Shift, Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.
In view Jisoo Mok et.al. (hereinafter Mok)  AdvRush: Searching for adversarially robust neural architectures,  Proceedings of the IEEE/CVF international conference on computer vision. 2021.
In regard to Claim 1: (Currently Amended)
Roth  discloses:
-	A computer-implemented method of automated selection of neural architecture, the method comprising: 
In [0078]:
 one or more processors selecting a sub-network for an input 302, at each FL client site, is performed by passing input (accessible only to each individual FL client site) through supernet 304 and choosing an optimal path to construct sub-network accordingly. In at least one embodiment, selecting a sub-network in this manner leverages concepts from a Neural Architecture Search (NAS), which is used to design neural network automatically with limited human heuristics to meet different user requirements (e.g., light-weight model, or small amount of computation)
In [0061] :
In one embodiment, once a supernet has been trained sufficiently, one or more processors at each FL client site selects a neural network. In one embodiment, a selected neural network is an optimal neural network, which may also be referred to as a sub-network, with a best path selected from a plurality of neural networks. In one embodiment, for each unseen data point, at each FL client site, one or more processors determine, with guidance of additional unsupervised loss functions at inference, which neural network to select as a sub-network
(BRI: the additional guidance provided for an unseen data point is a stability consideration for the selection)
-	accessing training data for a chosen task, the training data including a plurality of training inputs and corresponding training outputs for a chosen task; 
In [0227]:
In at least one embodiment, server(s) 1278 may be used to train machine learning models (e.g., neural networks) based at least in part on training data. In at least one embodiment, training data may be generated by vehicles, and/or may be generated in a simulation (e.g., using a game engine).
	In [0071]:
a supernet 204 is a combination of two or more networks with super blocks 106, which contains candidate block choices that are formed into a larger network.
in [0059]:
 In at least one embodiment, techniques described herein are applicable such that a neural network from a plurality of neural networks is selected for an image and, more specifically, for a medical image; however, techniques described herein are also applicable to other types of inputs (non-limiting examples include video, integers, audio, or characters) inferenced by neural networks. 
In [0088]:
In at least one embodiment, when doing computer vision tasks, one or more processors, at each FL client site, process different images using different neural networks.
In [0185]:
 Embodiments described herein allow for multiple neural networks to be performed simultaneously and/or sequentially, and for results to be combined together to enable Level 3-5 autonomous driving functionality. For example, in at least one embodiment, a CNN executing on a DLA or a discrete GPU (e.g., GPU(s) 1220) may include text and word recognition, allowing reading and understanding of traffic signs, including signs for which a neural network has not been specifically trained. 
In [0128]:
In at least one embodiment, a steering wheel may be optional for full automation (Level 5) functionality. In at least one embodiment, a brake sensor system 1246 may be used to operate vehicle brakes in response to receiving signals from brake actuator(s) 1248 and/or brake sensors.
(BRI: all the above (image, text, sensors) constitute plurality of training inputs)
in [0078]:
 In one embodiment, a supernet 304 is a neural network that comprises a plurality of neural networks. In one embodiment, a supernet 304 is a large neural network with candidate modules 312 in parallel at different levels. In one embodiment, a network is trained jointly or with sampled paths/modules from an entire network, using Reinforcement Learning (RL) algorithms,
In [0078]:
selecting a sub-network in this manner leverages concepts from a Neural Architecture Search (NAS), which is used to design neural network automatically
	In [0324]:
 	In one embodiment, inference and/or training logic 915 are used to select a neural network for a data point in a federated learning (FL) setting. In one embodiment, inference and/or training logic 915 provides results from training different portions of a supernet, at different computing systems, to train a supernet. Once a supernet has been trained, an optimal neural network for data point, at each different computing system, is determined. In one embodiment, inference and/or training logic 915 determines an optimal neural network, at a computing system, with guidance from a local validation set and/or loss functions.
	in [0052]:
FIG. 39 is a system diagram for an example system for training, adapting, instantiating and deploying machine learning models in an advanced computing pipeline, in accordance with at least one embodiment
	In [0094]:
In at least one embodiment, with respect to implementation, supernet is trained using randomly cropped patches of size 256×256×32 from input images and labels. In at least one embodiment, a mini-batch size of 4 is used by selecting two random crops from any two random input image and label pairs. 
Examiner’s BRI
(selecting two or more random crops from an input image, when coupled with a proper labeling strategy (such as ensuring the label reflects the visible content in each crop). It is a powerful data augmentation technique that represents training outputs corresponding to a plurality of potential inputs)
In [0070]:
In at least one embodiment,                         
                            
                                    p
                                
                                    i
                                     
                    is a predicted probability from a final sigmoid activated output layer of supernet f (X) and                         
                            
                                    g
                                
                                    i
                                     
                    is a ground truth label map at a given voxel i. 
in [0110]:
 In at least one embodiment, untrained neural network 1006 is trained using supervised learning, wherein training dataset 1002 includes an input paired with a desired output for an input, or where training dataset 1002 includes input having a known output and an output of neural network 1006 is manually graded.
In [0081] :
In at least one embodiment, after one or more processors at client server trains a supernet, a unique sub-network for each input, at each FL client's computing system
In [0063]:
In at least one embodiment, neural network 104 is a supernet, which may also be referred to as a supernetwork, model architecture, and/or a neural network comprising a plurality of neural networks. I
In [0077]: 
In at least one embodiment, during training, one or more processors at each FL client site sample one path from each searched layer from super blocks 206 of a supernet 204 uniformly at each iteration, and parameters of new sub-networks are updated during gradient back-propagation,
In [0077]:
weights of paths are also updated during training
in [0078]:
FIG. 3 illustrates a diagram 300 of an overall framework on how a neural network (e.g., sub-network) is selected for an input (e.g., 3D images) 302 at a federated learning (FL) client site, according to at least one embodiment. In at least one embodiment, one or more processors selecting a sub-network for an input 302, at each FL client site, is performed by passing input (accessible only to each individual FL client site) through supernet 304 and choosing an optimal path to construct sub-network accordingly. In at least one embodiment, selecting a sub-network in this manner leverages concepts from a Neural Architecture Search (NAS), which is used to design neural network automatically with limited human heuristics to meet different user requirements (e.g., light-weight model, or small amount of computation).
Examiner’s BRI
( the paths and modules (configuration) are “architecture parameters”. The concept of "optimal path" or "subnetwork selection" is crucial in computer networking and related fields. It refers to finding the best route or subset of a larger network to optimize performance, manageability, security, or other desired criteria). 
-	configuring, by a supervised learning controller, a super-net to implement a sample of sub-networks of the super-net, the super-net having a plurality of network weights and a plurality of architecture parameters, where the architecture parameters represent the importance of different architecture choices at various locations inside the super-net;
In [0098]:
 In at least one embodiment, inference and/or training logic 915 may include, without limitation, code and/or data storage 901 to store forward and/or output weight and/or input/output data, and/or other parameters to configure neurons or layers of a neural network trained and/or used for inferencing in aspects of one or more embodiments. In at least one embodiment, training logic 915 may include, or be coupled to code and/or data storage 901 to store graph code or other software to control timing and/or order, in which weight and/or other parameter information is to be loaded to configure, logic, including integer and/or floating point units (collectively, arithmetic logic units (ALUs). In at least one embodiment, code, such as graph code, loads weight or other parameter information into processor ALUs based on an architecture of a neural network to which such code corresponds.
In [0094]:
In at least one embodiment, with respect to implementation, supernet is trained using randomly cropped patches of size 256×256×32 from input images and labels. In at least one embodiment, a mini-batch size of 4 is used by selecting two random crops from any two random input image and label pairs. 
In [0068]:
In at least one embodiment, during training, at each client 106, 108, one or more processors chose an arbitrary path m from module candidates M  following a uniform sampling scheme (as shown and described in more detail in FIG. 3) to define a sub-network s sampled from supernet.

    PNG
    media_image1.png
    722
    977
    media_image1.png
    Greyscale

In [0061] :
In one embodiment, once a supernet has been trained sufficiently, one or more processors at each FL client site selects a neural network. In one embodiment, a selected neural network is an optimal neural network, which may also be referred to as a sub-network, with a best path selected from a plurality of neural networks.
in [0110]:
 In at least one embodiment, untrained neural network 1006 is trained using supervised learning, wherein training dataset 1002 includes an input paired with a desired output for an input, or where training dataset 1002 includes input having a known output and an output of neural network 1006 is manually graded.
Examiner’s BRI
(supervised learning to train a supernet (a network comprising a plurality of neural network paths or sub-networks) is a foundational technique in (NAS) that provides a mechanism to implement, evaluate, and select from a sample of sub-networks)(
In [0064]:
a trained portion of supernet 104 is a selected neural network, which may also be referred to as an optimal neural network, and/or a sub-network for each FL client site
In [0064]:
after one or more processors conduct several training rounds in a FL setting, trained portions from each client A 106 and client B 108 are converged. In at least one embodiment, each client 106, 108 is allowed to select a locally best model (e.g., sub-network) by monitoring a certain performance metric on a local hold out validation set.
Examiner’s BRI
(Selecting the best sub-network (or sub-architecture) is a critical component of architecture choices, as it directly impacts model performance, computational efficiency, and generalization)
-	generating, by the super-net, sub-network outputs responsive to the training inputs for the sample of sub-networks;
In [0098]:
In at least one embodiment, code and/or data storage 901 stores weight parameters and/or input/output data of each layer of a neural network trained or used in conjunction with one or more embodiments during forward propagation of input/output data and/or weight parameters during training and/or inferencing using aspects of one or more embodiments.
In [0089]:
In at least one embodiment, second image is fed into a trained supernet, and a reconstruction decoder generates segmentation masks accordingly. In at least one embodiment, during training, one path is sampled from each searched layer from super blocks of supernet uniformly at each iteration, and parameters of new sub-networks are updated during gradient back-propagation. In at least one embodiment, sampling one path is achieved by setting weights of selected path to 1, and remaining to 0. In at least one embodiment, a path with large weights will have enough updates, and ones with less weight did not process enough training samples. In at least one embodiment, a subnetwork (e.g., path, optimal neural network) is selected for second image based on updated weights.
-	training, by the supervised learning controller,  network weights and architecture parameters of a super-net including a plurality of sub-networks, where the architecture parameters of [[a]] super-net represent, the training including: 
In [0063]:
In at least one embodiment, neural network 104 is a supernet, which may also be referred to as a supernetwork, model architecture, and/or a neural network comprising a plurality of neural networks. 
In [0077]: 
In at least one embodiment, during training, one or more processors at each FL client site sample one path from each searched layer from super blocks 206 of a supernet 204 uniformly at each iteration, and parameters of new sub-networks are updated during gradient back-propagation,
In [0077]:
weights of paths are also updated during training
-	accessing the training data for [[a]] the chosen task;
in [0110]:
 In at least one embodiment, untrained neural network 1006 is trained using supervised learning, wherein training dataset 1002 includes an input paired with a desired output for an input, or where training dataset 1002 includes input having a known output and an output of neural network 1006 is manually graded.
In [0062]:
In at least one embodiment, supernet includes a plurality of neural network models, where each of these neural network models are adapted according to inputs or domains for 3D medical image segmentation 
-	[[and]] selecting a sub-network of the plurality of sub-networks for the chosen task based on the largest adjusted architecture parameters; 
In [0061]:
 In at least one embodiment, once a supernet has been trained sufficiently, one or more processors at each FL client site selects a neural network. In at least one embodiment, a selected neural network is an optimal neural network, which may also be referred to as a sub-network, with a best path selected from a plurality of neural networks. 
In [0087]:
In at least one embodiment, during training, one path is sampled from each searched layer from super blocks of supernet uniformly at each iteration, and parameters of new sub-networks are updated during gradient back-propagation. 
In [0121]: 
For example, in at least one embodiment, a machine learning model may be trained by calculating weight parameters according to a neural network architecture
-	and outputting a description of the network architecture of the selected sub-network for the chosen task.
In [0067]:
 In at least one embodiment, training different portions of supernet 104, at each FL client site, is performed by a processor at each FL client site passing a data point through supernet 104 that results in selecting a sub-network from supernet 104. In at least one embodiment, selected sub-network is a trained portion of supernet 104. In at least one embodiment, supernet  S  comprises various DL module candidates  M suitable for 3D medical imaging tasks shown in Table 1 below

    PNG
    media_image2.png
    257
    543
    media_image2.png
    Greyscale

Roth does not explicitly disclose:
-	determining, from the sub-network outputs generated by the super-net hardware, a first loss based on accumulated differences between the sub-network outputs and corresponding training outputs
-	accumulating a second loss over the sample of sub-networks, the second loss based, at least in part, on a sum, over layers of a sub-network, of the measures of smoothness of the layers;
-	and adjusting network weights and architecture parameters of the super-net to reduce  a combination of the first and second losses, where the network weights of the super-net are shared with one or more sub-networks of the plurality of sub-networks and are trained jointly, where the second loss penalizes sub- networks with lower smoothness and improves stability of the adjustment; 
However, Peng discloses:
-	determining, from the sub-network outputs generated by the super-net hardware, a first loss based on accumulated differences between the sub-network outputs and corresponding training outputs
In [1, Page 12355]:
In this paper, we attribute the ranking correlation problem to the supernet training consistency shift, including feature shift and parameter shift. Feature shift is identified as dynamic input distributions of a hidden layer. Specifically, a given layer’s input feature maps always have an uncertain distribution due to random path sampling (see Figure 1a, left). This distribution uncertainty can hurt the architecture ranking correlation. Precisely, we can use the loss to measure the architecture accuracy, and we can link the accuracy ascent to gradient descent. Based on the back-propagation rule, a stable input distribution can guarantee a good rankng correlation. In contrast, the input distribution dynamic affects the loss descent and finally affects architecture ranking. Parameter shift is identified as contradictory parameter updates for a given layer. In supernet training, a given layer will always be present in different paths from iteration to iteration (see Figure 1b, left). The parameter in this layer may have a contradictory update from iteration to iteration. These unstable updates lead to varying parameters’ distributions, hurting the architecture ranking correlation in two ways. 
Examiner’s BRI
( parameter shifts (updates) via gradient descent are specifically designed to reduce the training loss, while feature shifts generally refer to changes in data distribution that can cause the training loss to become unreliable (increase or diverge). The shift (loss) is the first loss)
-	accumulating a second loss over the sample of sub-networks, the second loss based, at least in part, on a sum, over layers of a sub-network, of the measures of smoothness of the layers; 
In [1, Page 12355]:
In this paper, we attribute the ranking correlation problem to the supernet training consistency shift, including feature shift and parameter shift. 
Examiner’s BRI
(parameter shifts (updates) via gradient descent are specifically designed to reduce the training loss, while feature shifts generally refer to changes in data distribution that can cause the training loss to become unreliable (increase or diverge). The shift (loss) is the first loss)
Motivated by consistency regularization methods [29, 44], we propose a nontrivial 
supernet-II model , called II- NAS, to reduce these two shifts simultaneously. Specifically, to cope with the feature shift, we propose a novel supernet-II model. We evaluate each data point through two randomly sampled paths, then apply a consistency cost between the two predictions to penalize the feature consistency shift between different paths.
As shown in Figure 1a (right), our method can significantly reduce the feature shit and thus can improve the architecture ranking correlation. To address the parameter shift, we propose a novel non-trivial mean teacher model by maintaining an exponential moving average of weights in supernet teacher.
Examiner’s BRI
(evaluating each data point through two randomly sampled paths (or perturbations) and applying a consistency cost (or loss) between the two predictions is a standard technique used to penalize loss of smoothness, often called consistency regularization or constancy training which is the second loss)
In [3.2, Page 12358]:
we propose to maintain an exponential moving average weights for teacher model rather than barely replicate from student model in supernet-II model training.  Formally, we denote                         
                            
                                    W
                                
                                    t
                                
                    as parameters of student mapping function f at training step t. Then, weights of mean teacher model f’ can be defined as:
can be defined as

    PNG
    media_image3.png
    46
    560
    media_image3.png
    Greyscale

where                         
                            λ
                        
                            ∈
                            
                                    0,1
                                
                     is a smoothing coefficient hyper-parameter
A low                         
                            λ
                        
                     close to 0  provide greater smoothing and higher                         
                            λ
                        
                     close to 1 provides low smoothing.  With high smoothing                         
                            λ
                        
                     closer to 0, the weights are same.
Examiner’s BRI
( architecture ranking in Neural Architecture Search (NAS) can provide measures of search space smoothness, particularly in the context of ensuring that small, incremental changes to an architecture)
-	and adjusting network weights and architecture parameters of the super-net to reduce  a combination of the first and second losses, where the network weights of the super-net are shared with one or more sub-networks of the plurality of sub-networks and are trained jointly, where the second loss penalizes sub- networks with lower smoothness and improves stability of the adjustment; 
In [3.2, Page 12358]:
we propose to maintain an exponential moving average weights for teacher model rather than barely replicate from student model in supernet-II model training.  Formally, we denote                         
                            
                                    W
                                
                                    t
                                
                    as parameters of student mapping function f at training step t. Then, weights of mean teacher model f’ can be defined as:
can be defined as

    PNG
    media_image3.png
    46
    560
    media_image3.png
    Greyscale

where                         
                            λ
                        
                            ∈
                            
                                    0,1
                                
                     is a smoothing coefficient hyper-parameter
A low                         
                            λ
                        
                     close to 0  provide greater smoothing and higher                         
                            λ
                        
                     close to 1 provides low smoothing.  With high smoothing                         
                            λ
                        
                     closer to 0, the weights are same.
In [1, Page 12355]:
Parameter shift is identified as contradictory parameter updates for a given layer. In supernet training, a given layer will always be present in different paths from iteration to iteration (see Figure 1b, left). 
In [1, Page 12354]:

    PNG
    media_image4.png
    46
    684
    media_image4.png
    Greyscale

(b) Parameter shift. Different colors represent the distribution of parameters in different iterations. Left: without our nontrivial mean teacher, the parameter has significantly varying distributions in training. Right: with our nontrivial mean teacher, the parameter shift is significantly reduced.
Figure 1: Illustration of supernet training consistency shift
In [1, 12355]:
The parameter in this layer may have a contradictory update from iteration to iteration. These unstable updates lead to varying parameters’ distributions, hurting the architecture ranking correlation in two ways. On the one hand, stable parameters can ensure a correct loss descent and guarantee an accurate architecture ranking, while frequent parameter change could not preserve architecture ranking. On the other hand, varying parameters can also result in a feature shift, further hurting architecture ranking correlation. In summary, both feature shift and parameter shift can hurt the architecture ranking correlation. 
In [1, 12355]:
Motivated by consistency regularization methods [29, 44], we propose a nontrivial supernet-II model, called II- NAS, to reduce these two shifts simultaneously. Specifically, to cope with the feature shift, we propose a novel supernet-II model. We evaluate each data point through two randomly sampled paths, then apply a consistency cost between the two predictions to penalize the feature consistency shift between different paths. As shown in Figure 1a (right), our method can significantly reduce the feature shit and thus can improve the architecture ranking correlation.

In [2, Page 12355]:
To alleviate the computational overhead caused by the training process, researchers starts to share the weights among candidate arch-
In [2, Page 12356]:
tectures.
In [2, Page 12356]:
Gradient-based weight sharing methods [36, 9, 54, 63] jointly optimize the shared network parameters and the architecture choosing factors by gradient descent
In [2, Page 12356]:
the supernet is first optimized with path sampling, and then sub-models are sampled and evaluated with the weights inherited from the supernet
Examiner’s BRI
(Perhaps known to the POSITA that gradient-based weight sharing methods, particularly in Neural Architecture Search (NAS), are designed to jointly optimize a shared network (often called a supernet) which represents multiple, potential subnetworks trained simultaneously. that optimize one set of shared weights, allowing different subnetworks to converge together)
The examiner interprets the invention as “ using neural network to learn to extend the quantization and provide a high accuracy to reflect original signal in which the learning is performed by computing and inferring the loss function”.
Within the context of the core of the invention, the prior art combinations teaches the invention making the case for “motivation to combine”.
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Roth and Peng.  
Roth teaches a super-net including a plurality of sub-networks and first loss.
Peng  teaches second loss accumulating a loss over a sample of sub-networks, and adjusting network weights to reduce the accumulated loss and measures of smoothness. 
Within the context of the theme and teaching of Roth and Peng, it may be obvious to POSTA to combine Roth and Peng.
One of ordinary skill would have motivation to combine Roth and Peng that combining a first loss related to training data (e.g., standard cross-entropy or MSE) with a second loss related to a sample of sub-networks (e.g., ranking loss) generally provides improved architecture ranking correlation and better performance in Neural Architecture Search (NAS) (Peng [1, 12355])
	Roth and Peng do not explicitly disclose:
-	for layers of sub-networks in a sample of sub-networks, determining a measure of a smoothness of a layer from network weights in the layer, the measure of smoothness related to the maximum difference between outputs from the layer relative to any possible difference between inputs to the layer, a higher maximum change in output indicating a lower smoothness; 
However, Mok discloses:
-	for layers of sub-networks in a sample of sub-networks, determining a measure of a smoothness of a layer from network weights in the layer, the measure of smoothness related to the maximum difference between outputs from the layer relative to any possible difference between inputs to the layer, a higher maximum change in output indicating a lower smoothness; 
In [4, Page 12325]:
the problem of searching for an adversarially robust neural architecture can be re-formulated into the problem of searching for a neural architecture with a smooth input loss landscape. Since it is computationally infeasible to calculate the curvature of                         
                            
                                    f
                                
                                    A
                                
                     (                        
                            
                                    w
                                
                                    t
                                
                    )
for every                          
                            
                                    w
                                
                                    t
                                
                            ,
                             
                     we opt to evaluate candidate architectures under after standard training:
                        
                            N
                            o
                            t
                            e
                            :
                             
                                    f
                                
                                    A
                                
                     (                        
                            
                                    w
                                
                                    t
                                
                    ) is the input loss landscape
AdvRush penalizes candidate architectures with large                         
                            
                                    λ
                                
                                    m
                                    a
                                    x
                                
                     (                        
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                    ).  By favoring a candidate neural architecture with a smoother loss landscape, AdvRush effectively searches for a robust neural architecture.
(Note: Within the context of adversarial loss the                         
                            
                                    λ
                                
                                    m
                                    a
                                    x
                                
                    is a hyperparameter as a weighting factor to balance the importance of the adversarial loss over other losses. Lower                         
                            
                                    λ
                                
                                    m
                                    a
                                    x
                                
                    ) smoother)
	In [3.2, Page 12324]:
Consider                         
                            
                                    f
                                
                                    A
                                
                     (                        
                            
                                    w
                                
                                    s
                                    t
                                    d
                                
                    ) and                        
                            
                                    f
                                
                                    A
                                
                     (                        
                            
                                    w
                                
                                    a
                                    d
                                    v
                                
                    ),  two independently trained neural networks with an identical neural architecture 
                        
                                    f
                                
                                    A
                                
                     (·). We define Ω to be a set of                         
                            
                                    w
                                
                                    t
                                
                    which interpolates between                         
                            
                                    w
                                
                                    s
                                    t
                                    d
                                
                    and                        
                            
                                    w
                                
                                    a
                                    d
                                    v
                                
along some parametric curve.
In [3.2, Page 12325]:
Geometrically speaking, adversarial attack methods perturb                         
                            
                                    x
                                
                                    s
                                    t
                                    d
                                
                     in a direction that maximizes the change in                        
                            
                                    L
                                
                                    s
                                    t
                                    d
                                
                    .  The resulting adversarial examples                         
                            
                                    x
                                
                                    a
                                    d
                                    v
                                
                      fool                        
                            
                                    f
                                
                                    A
                                
                     (                        
                            
                                    w
                                
                                    t
                                
                    )
by targeting the steep trajectories on the input loss landscape, crossing the decision boundary of a neural network with as little effort as possible. 
In [3.2, Page 12325]:
Therefore, the more curved the input loss landscape of                         
                            
                                    f
                                
                                    A
                                
                     (                        
                            
                                    w
                                
                                    t
                                
                    ) is, its predictions are more likely to be corrupted by adversarial examples.
Examiner’s BRI:
(BRI: A parametric curve can represent a layer in a neural network, specifically in the context of modeling, or designing certain architectures. Based on the description, the concept refers to using Lipschitz continuity as a proxy for measuring the smoothness of neural network layers. Specifically, this involves analyzing the spectral norm or maximum singular value of the weight matrices within each layer, which determines the maximum possible ratio between the difference in outputs and the difference in inputs (i.e., the Lipschitz constant) . Perhaps it is known to the POSITA that a spectral norm is the largest singular value of the matrix)
In [3.2, Page 12325]:
Consider two types of loss functions for updating ω of an arbitrary neural network                         
                            
                                    f
                                
                                    w
                                
                     (ω): a standard loss and an adversarial loss. On one hand, standard training uses clean input data                         
                            
                                    x
                                
                                    s
                                    t
                                    d
                                
                    to update ω, such that the standard loss                         
                            
                                    L
                                
                                    s
                                    t
                                    d
                                
                     = L(                        
                            
                                    f
                                
                                    w
                                
                     (ω),                         
                            
                                    x
                                
                                    s
                                    t
                                    d
                                
                    )  is minimized. On the other hand, adversarial training uses adversarially perturbed input data                         
                            
                                    x
                                
                                    a
                                    d
                                    v
                                
to update ω, such that the adversarial loss                         
                            
                                    L
                                
                                    a
                                    d
                                    v
                                
                     = L(                        
                            
                                    f
                                
                                    w
                                
                     (ω),                         
                            
                                    x
                                
                                    a
                                    d
                                    v
                                
                    )  is minimized. From here on, we refer to                         
                            
                                    f
                                
                                    w
                                
                     (ω) after standard training and after adversarial training as                          
                            
                                    f
                                
                                    w
                                
                     (                        
                            
                                    w
                                
                                    s
                                    t
                                    d
                                
                    )
and as                         
                            
                                    f
                                
                                    w
                                
                     (                        
                            
                                    w
                                
                                    s
                                    t
                                    d
                                
                    )
In [2.2, Page 12324]:
a gradient-based method, suggests to use the Lipschitz characteristics of the architecture parameters to achieve the target Lipschitz constant.
In [4, Page 123245]:
robust neural architecture can be re-formulated into the problem of searching for a neural architecture with a smooth input loss landscape. Since it is computationally infeasible to calculate the curvature of                         
                            
                                    f
                                
                                    A
                                
                    (                        
                            
                                    w
                                
                                    t
                                
                    ) for every                        
                            
                                    w
                                
                                    t
                                
                    , we opt to evaluate candidate architectures under after standard training:

    PNG
    media_image5.png
    102
    526
    media_image5.png
    Greyscale

where                         
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                    refers to the Hessian of                         
                            
                                    L
                                
                                    s
                                    t
                                    d
                                
                     of                         
                            
                                    f
                                
                                    A
                                
                    (                        
                            
                                    w
                                
                                    t
                                
                     ) at clean input                        
                            
                                    x
                                
                                    s
                                    t
                                    d
                                
                     and                         
                            
                                    λ
                                
                                    m
                                    a
                                    x
                                
                     (                        
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                    ) refers to the larges eigen-value of                          
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                    . 
In [4.2, Page 12326]: Approximation of                         
                            
                                    L
                                
                                    λ
                                
                                                    H
                                                
                                                    s
                                                    t
                                                    d
                                                
                                    F
                                
                     can be expressed in terms of                         
                            
                                    l
                                
                                    2
                                
                     norm: 

    PNG
    media_image6.png
    4
    146
    media_image6.png
    Greyscale

where the expectation is taken over z ∼ N(0,                        
                            
                                    I
                                
                                    d
                                
                    ). Because the direct computation of                          
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
is expensive, we linearly approximate it through the finite difference approximation of 
the Hessian:

    PNG
    media_image7.png
    77
    422
    media_image7.png
    Greyscale

where h controls the scale of the loss landscape on which we induce smoothness. However, computing multiple                         
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                     z in directions drawn from z ∼ N(0,                        
                            
                                    I
                                
                                    d
                                
                    ) and taking its average would be computationally inefficient because each computation of                         
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
requires calculation of the gradient. Therefore, we minimize the input loss landscape along the the high curvature direction, 
In [4.1, Page 12325]:
AdvRush accomplishes the above objective by driving the eigenvalues of                         
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                            o
                            f
                        
                                    f
                                
                                    s
                                    u
                                    p
                                    e
                                    r
                                
                     to be small. Consequently, their maximum,                         
                            
                                    λ
                                
                                    m
                                    a
                                    x
                                
                     (                        
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                    ) ) will also be small. 
Examiner’s BRI
(The standard loss is the  l2 operator norm is the same as the spectral norm that is defined as the maximum singular value of a matrix representing the maximum expansion applied to a vector and is calculated as the square root of the largest eigenvalue. The adversarial robustness is linked to the smoothness of a loss where flatter (lower Lipschitz constant) prevent small input perturbations from altering predictions. The maximum eigen values represents the maximum curvature of a function at a specific point. It is fundamental in optimization, determining stability and learning rates.  A smaller                         
                            
                                    λ
                                
                                    m
                                    a
                                    x
                                
                    indicates a flatter and smoother function where the gradient changes slowly and the large                         
                            
                                    λ
                                
                                    m
                                    a
                                    x
                                
                    indicates sharp, high-curvature region indicating less smooth)
The examiner interprets the invention as “ using neural network to learn to extend the 	 and provide a high accuracy to reflect original signal in which the learning is performed by computing and inferring the loss function”.
Within the context of the core of the invention, the prior art combinations teaches the invention making the case for “motivation to combine”.
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Roth, Peng and Mok.
Roth teaches a super-net including a plurality of sub-networks and first loss.
Peng  teaches second loss accumulating a loss over a sample of sub-networks, and adjusting network weights to reduce the accumulated loss and measures of smoothness. 
Mok teaches relating the measure of smoothness to the maximum difference between outputs from the layer relative to any possible difference between inputs to the layer.
One of ordinary skill would have motivation to combine Roth, Peng and Mok that can place a proper choice of an architectures to improve the performance of the neural network on a target task (Mok [in 1, Page 12323]).

In regard to Claim 2: (Previously Presented)
Roth discloses:
-	training network weights of the selected sub-network using the training data; 
in [0052]:
FIG. 39 is a system diagram for an example system for training, adapting, instantiating and deploying machine learning models in an advanced computing pipeline, in accordance with at least one embodiment;
In [0005]:
FIG. 3 illustrates a diagram of an overall framework on how a neural network is selected for an input at a federated learning (FL) client site, according to at least one embodiment;
In [0005]:
FIG. 3 illustrates a diagram of an overall framework on how a neural network is selected for an input at a federated learning (FL) client site, according to at least one embodiment;
In [0061]: 
a selected neural network is an optimal neural network, which may also be referred to as a sub-network, with a best path selected from a plurality of neural networks.
In [0085]: 
atter constructing sub-network, path weights from each FL client site are sent to a client server for aggregation to train supernet.
-	storing the trained network weights of the selected sub-network  
	In [0064]:
	in at least one embodiment, results 114, 120 include updated model weights (or their gradients) from trained portion of supernet for client A 112 and trained portion of supernet for client B 118, and updated model weights are sent to client server 102 for aggregation. In at least one embodiment, after aggregation, new weights are redistributed to client A 106 and client B 108 and a next round of local training is executed. 
(BRI: Within the  context of learning, updated model weights may be sent to a central server, and then those stored (or aggregated) weights are used for redistribution
In regard to Claim 3: (Currently Amended)
Roth and Peng do not explicitly disclose:
-	and the second loss is a product of a smoothness scale factor and the sum, over layers of the sample sub-network, of measure of smoothness of the layers
However, Mok discloses:
-	and the second loss is a product of a smoothness scale factor and the sum, over layers of the sample sub-network, of measures of smoothness of the layers
In [2.1, Page 12323]:
Our work is closely related to the defense approaches that utilize a regularization term 
derived from the curvature information of the neural network’s loss landscape
In [6.1. Page 12327]: Effect of Regularization Strength 
The regularization strength γ is empirically set to be 0.01 to match the scale of                         
                            
                                    L
                                
                                    v
                                    a
                                    l
                                
                     and                         
                            
                                    L
                                
                                    λ
                                
                    .
(BRI: The regularization strength is a smoothness scale factor as a result of its matching the scale)
In [4.2 , Page 12326]:
                        
                                                    H
                                                
                                                    s
                                                    t
                                                    d
                                                
                                    F
                                
                     can be expressed in terms of l2 norm: 

    PNG
    media_image6.png
    4
    146
    media_image6.png
    Greyscale

where the expectation is taken over z ∼ N(0, Id). Because the direct computation of                          
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
is expensive, we linearly approximate it through the finite difference approximation of 
the Hessian:

    PNG
    media_image7.png
    77
    422
    media_image7.png
    Greyscale

where h controls the scale of the loss landscape on which we induce smoothness. However, computing multiple                         
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                     z in directions drawn from z ∼ N(0,                        
                            
                                    I
                                
                                    d
                                
                    ) and taking its average would be computationally inefficient because each computation of Hstdz requires calculation of the gradient. Therefore, we minimize the input loss landscape along the the high curvature direction, 

    PNG
    media_image8.png
    27
    178
    media_image8.png
    Greyscale

to maximize the effect of                         
                            
                                    L
                                
                                    λ
                                    ,
                                
                    .  With the approximated                         
                            
                                    L
                                
                                    λ
                                    ,
                                
                    the bi-level optimization problem of AdvRush can be expressed as:

    PNG
    media_image9.png
    121
    472
    media_image9.png
    Greyscale

                                    x
                                
                                    v
                                    a
                                    l
                                
                    is the clean input data from Dval. The value of h in the denominator of Eq. (9) is absorbed by the regularization strength γ. 
The examiner interprets the invention as “ using neural network to learn to extend and provide a high accuracy to reflect original signal in which the learning is performed by computing and inferring the loss function”.
Within the context of the core of the invention, the prior art combinations teaches the invention making the case for “motivation to combine”.
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Roth, Peng and Mok.
Roth teaches a super-net including a plurality of sub-networks and first loss.
Peng  teaches second loss accumulating a loss over a sample of sub-networks, and adjusting network weights to reduce the accumulated loss and measures of smoothness. 
Mok teaches relating the measure of smoothness to the maximum difference between outputs from the layer relative to any possible difference between inputs to the layer.
One of ordinary skill would have motivation to combine Roth, Peng and Mok that can place a proper choice of an architectures to improve the performance of the neural network on a target task (Mok [in 1, Page 12323]).
In regard to Claim 4: (Original)
Roth, and Peng do not explicitly disclose:
-	the second loss for sub-network k is:  

    PNG
    media_image10.png
    47
    318
    media_image10.png
    Greyscale

                            λ
                             
                    is a smoothness scale factor,                          
                            
                                    W
                                
                                    k
                                    ,
                                     
                                    j
                                     
                    is a matrix of network weights in layer j of sub-network k and                          
                            
                                    σ
                                
                                    m
                                    a
                                    x
                                
                                    (
                                    W
                                
                                    k
                                    ,
                                     
                                    j
                                     
                            )
                        
                    is the maximum singular of the matrix                         
                            
                                    W
                                
                                    k
                                    ,
                                     
                                    j
                                     
                     or an estimate thereof. 
However, Mok discloses:

    PNG
    media_image10.png
    47
    318
    media_image10.png
    Greyscale

                            λ
                             
                    is a smoothness scale factor,                          
                            
                                    W
                                
                                    k
                                    ,
                                     
                                    j
                                     
                    is a matrix of network weights in layer j of sub-network k and                          
                            
                                    σ
                                
                                    m
                                    a
                                    x
                                
                                    (
                                    W
                                
                                    k
                                    ,
                                     
                                    j
                                     
                            )
                        
                    is the maximum singular of the matrix                         
                            
                                    W
                                
                                    k
                                    ,
                                     
                                    j
                                     
                     or an estimate thereof. 
In [4.2, Page 12326]: Approximation of                         
                            
                                    L
                                
                                    λ
                                
                                                    H
                                                
                                                    s
                                                    t
                                                    d
                                                
                                    F
                                
                     can be expressed in terms of l2 norm: 

    PNG
    media_image6.png
    4
    146
    media_image6.png
    Greyscale

where the expectation is taken over z ∼ N(0,                        
                            
                                    I
                                
                                    d
                                
                    ). Because the direct computation of                          
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
is expensive, we linearly approximate it through the finite difference approximation of 
the Hessian:

    PNG
    media_image7.png
    77
    422
    media_image7.png
    Greyscale

where h controls the scale of the loss landscape on which we induce smoothness. However, computing multiple                         
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                     z in directions drawn from z ∼ N(0,                        
                            
                                    I
                                
                                    d
                                
                    ) and taking its average would be computationally inefficient because each computation of                         
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                     z
requires calculation of the gradient. Therefore, we minimize the input loss landscape along the the high curvature direction, 

    PNG
    media_image11.png
    27
    182
    media_image11.png
    Greyscale

to maximize the effect of 
In [4.2, Page 12326]: Approximation of                         
                            
                                    L
                                
                                    λ
                                
                                                    H
                                                
                                                    s
                                                    t
                                                    d
                                                
                                    F
                                
                     can be expressed in terms of l2 norm: 

    PNG
    media_image6.png
    4
    146
    media_image6.png
    Greyscale

where the expectation is taken over z ∼ N(0,                        
                            
                                    I
                                
                                    d
                                
                    ). Because the direct computation of                          
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
is expensive, we linearly approximate it through the finite difference approximation of 
the Hessian:

    PNG
    media_image7.png
    77
    422
    media_image7.png
    Greyscale

where h controls the scale of the loss landscape on which we induce smoothness. However, computing multiple                         
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                     z in directions drawn from z ∼ N(0,                        
                            
                                    I
                                
                                    d
                                
                    ) and taking its average would be computationally inefficient because each computation of                         
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                     z
requires calculation of the gradient. Therefore, we minimize the input loss landscape along the high curvature direction.
(BRI :The standard loss is the  l2 operator norm is the same as the spectral norm that is defined as the maximum singular value of a matrix representing the maximum expansion applied to a vector and is calculated as the square root of the largest eigenvalue. 

In regard to Claim 7: (Previously Presented) 
	Roth and Pend do not explicitly disclose:
-	where said adjust the architecture parameters and the network weights includes: 
determining gradients of the accumulated first and second loss functions; 
However, Mok discloses:
-	where said adjust the architecture parameters and the network weights includes: 
determining gradients of the accumulated first and second loss functions; 
In [4.2, Page 12326]: Approximation of                         
                            
                                    L
                                
                                    λ
                                
                                                    H
                                                
                                                    s
                                                    t
                                                    d
                                                
                                    F
                                
                     can be expressed in terms of l2 norm: 

    PNG
    media_image6.png
    4
    146
    media_image6.png
    Greyscale

where the expectation is taken over z ∼ N(0,                        
                            
                                    I
                                
                                    d
                                
                    ). Because the direct computation of                          
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
is expensive, we linearly approximate it through the finite difference approximation of 
the Hessian:

    PNG
    media_image7.png
    77
    422
    media_image7.png
    Greyscale

where h controls the scale of the loss landscape on which we induce smoothness. However, computing multiple                         
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                     z in directions drawn from z ∼ N(0,                        
                            
                                    I
                                
                                    d
                                
                    ) and taking its average would be computationally inefficient because each computation of Hstdz requires calculation of the gradient. Therefore, we minimize the input loss landscape along the the high curvature direction, 
in [4.2, Page 12326]:
each computation of                         
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                     z requires calculation of the gradient. Therefore, we minimize the input loss landscape along the  high curvature direction,

    PNG
    media_image12.png
    23
    176
    media_image12.png
    Greyscale

Examiner’s BRI
(minimizing the input loss (specifically, the loss in the vicinity of training data, often termed "input loss curvature" and along high-curvature directions generally provides gradients that 
act as a regularization term, facilitating both the minimization of the standard training loss and the enforcement of smoothness)
-	and update the architecture parameters and the network weights of the super-net based on the gradients.  
	In [4.1, Page 12325]:
To make the search space continuous for gradient-based optimization, categorical choice of a particular operation is continuously relaxed by applying a softmax function over all the possible operations:

    PNG
    media_image13.png
    63
    442
    media_image13.png
    Greyscale

where                         
                            
                                    α
                                     
                                    (
                                    i
                                    ,
                                    j
                                    )
                                
                     is a set of operation mixing weights (i.e., architecture parameters). O is the pre-defined set of operations that are used to construct the supernet. By definition, the size of                         
                            
                                    α
                                     
                                    (
                                    i
                                    ,
                                    j
                                    )
                                
                    must be equal to |O|.
Through continuous relaxation, both the architecture parameters α and the weight parameters ω in the supernet can be updated via gradient descent.
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Roth, Peng and Mok.
Roth teaches a super-net including a plurality of sub-networks and first loss.
Peng  teaches second loss accumulating a loss over a sample of sub-networks, and adjusting network weights to reduce the accumulated loss and measures of smoothness. 
Mok teaches relating the measure of smoothness to the maximum difference between outputs from the layer relative to any possible difference between inputs to the layer.
One of ordinary skill would have motivation to combine Roth, Peng and Mok that can place a proper choice of an architectures to improve the performance of the neural network on a target task (Mok [in 1, Page 12323]).
In regard to Claim 9: (Currently Amended) 
Roth discloses:
-	 A system for automated selection of neural architecture comprising: a supervised learning controller, a super-net operatively coupled to the supervised learning controller and including a plurality of sub-networks
In [0078] :
FIG. 3 illustrates a diagram 300 of an overall framework on how a neural network (e.g., sub-network) is selected for an input (e.g., 3D images) 302 
In [0078]:
in at least one embodiment, selecting a sub-network in this manner leverages concepts from a Neural Architecture Search (NAS), which is used to design neural network automatically with limited human heuristics to meet different user requirements
in [0063]:
In at least one embodiment, each client 106, 108 comprises one or more computing devices that execute instructions submitted by a user and/or according to instructions submitted by an automated process.
In [0063]:
in at least one embodiment, each client 106, 108 trains supernet 104 locally using data points 110, 116.
Examiner’s BRI
(Fig 3 is an automated selection system)
In [0110] :
In at least one embodiment, untrained neural network 1006 is trained using supervised learning, wherein training dataset 1002 includes an input paired with a desired output for an input,
In [0060]:
In at least one embodiment, a processor with one or more circuits generates a supernet, which may also be referred to as a supernetwork or a neural network comprising a plurality of neural networks, to enable a mixture of candidate modules in parallel to represent multi-scale appearance features at different network levels, respectively. 
In [0109]:
in at least one embodiment, training framework 1004 trains an untrained neural network 1006 and enables it to be trained using processing resources described herein to generate a trained neural network 1008. In at least one embodiment, weights may be chosen randomly or by pre-training using a deep belief network. In at least one embodiment, training may be performed in either a supervised, partially supervised, or unsupervised manner.
In [0081] :
In at least one embodiment, after one or more processors at client server trains a supernet, a unique sub-network for each input, at each FL client's computing system
in [0233]:
Embodiments may be used in other devices such as handheld devices and embedded applications. 
In [0233]:
In at least one embodiment, embedded applications may include a microcontroller, a digital signal processor (“DSP”), system on a chip
In [0078]:
 one or more processors selecting a sub-network for an input 302, at each FL client site, is performed by passing input (accessible only to each individual FL client site) through supernet 304 and choosing an optimal path to construct sub-network accordingly. 
(BRI: a input passing through the supernet is the coupling within a client site)
In [0061] :
In one embodiment, once a supernet has been trained sufficiently, one or more processors at each FL client site selects a neural network. In one embodiment, a selected neural network is an optimal neural network, which may also be referred to as a sub-network, with a best path selected from a plurality of neural networks. In one embodiment, for each unseen data point, at each FL client site, one or more processors determine, with guidance of additional unsupervised loss functions at inference, which neural network to select as a sub-network
-	a data loader configured to provide the training inputs to the super-net hardware to generate corresponding super-net outputs
In [0098]:
 In at least one embodiment, inference and/or training logic 915 may include, without limitation, code and/or data storage 901 to store forward and/or output weight and/or input/output data, and/or other parameters to configure neurons or layers of a neural network trained and/or used for inferencing in aspects of one or more embodiments. In at least one embodiment, training logic 915 may include, or be coupled to code and/or data storage 901 to store graph code or other software to control timing and/or order, in which weight and/or other parameter information is to be loaded to configure, logic, including integer and/or floating point units (collectively, arithmetic logic units (ALUs). In at least one embodiment, code, such as graph code, loads weight or other parameter information into processor ALUs based on an architecture of a neural network to which such code corresponds.
In [0005]:
FIG. 3 illustrates a diagram of an overall framework on how a neural network is selected for an input at a federated learning (FL) client site, according to at least one embodiment;
In [0061]:
one or more processors determine, with guidance of additional unsupervised loss functions at inference, which neural network to select as a sub-network.
In [0098]:
training logic 915 may include, or be coupled to code and/or data storage 901 to store graph code or other software to control timing and/or order, in which weight and/or other parameter information is to be loaded to configure, logic, including integer and/or floating point units (collectively, arithmetic logic units (ALUs).
In [0227]:
In at least one embodiment, server(s) 1278 may be used to train machine learning models (e.g., neural networks) based at least in part on training data. In at least one embodiment, training data may be generated by vehicles, and/or may be generated in a simulation (e.g., using a game engine).
-	wherein the supervised learning controller is configured to train weights and architecture parameters of the super-net, where the architecture parameters represent the importance of different architecture choices at various locations inside the super-net, the training including: 
In [0098]:
 In at least one embodiment, inference and/or training logic 915 may include, without limitation, code and/or data storage 901 to store forward and/or output weight and/or input/output data, and/or other parameters to configure neurons or layers of a neural network trained and/or used for inferencing in aspects of one or more embodiments. In at least one embodiment, training logic 915 may include, or be coupled to code and/or data storage 901 to store graph code or other software to control timing and/or order, in which weight and/or other parameter information is to be loaded to configure, logic, including integer and/or floating point units (collectively, arithmetic logic units (ALUs). In at least one embodiment, code, such as graph code, loads weight or other parameter information into processor ALUs based on an architecture of a neural network to which such code corresponds.
In [0005]:
FIG. 3 illustrates a diagram of an overall framework on how a neural network is selected for an input at a federated learning (FL) client site, according to at least one embodiment;
In [0061] :
In one embodiment, once a supernet has been trained sufficiently, one or more processors at each FL client site selects a neural network. In one embodiment, a selected neural network is an optimal neural network, which may also be referred to as a sub-network, with a best path selected from a plurality of neural networks.
in [0110]:
 In at least one embodiment, untrained neural network 1006 is trained using supervised learning, wherein training dataset 1002 includes an input paired with a desired output for an input, or where training dataset 1002 includes input having a known output and an output of neural network 1006 is manually graded.
In [0064]:
in at least one embodiment, results 114, 120 include updated model weights (or their gradients) from trained portion of supernet for client A 112 and trained portion of supernet for client B 118, and updated model weights are sent to client server 102 for aggregation.
In [0079]:
 In at least one embodiment, input and output of convolutional operations 306 share a same spatial shape.
Examiner’s BRI
(spatial shape sharing (such as in convolutional neural networks) is a direct form of weight sharing in which  set of weights is applied across different spatial locations of an input, enabling translational invariance and reducing the total number of parameters.) 
In [0094]:
In at least one embodiment, with respect to implementation, supernet is trained using randomly cropped patches of size 256×256×32 from input images and labels.
In [0064]:
a trained portion of supernet 104 is a selected neural network, which may also be referred to as an optimal neural network, and/or a sub-network for each FL client site
In [0064]:
after one or more processors conduct several training rounds in a FL setting, trained portions from each client A 106 and client B 108 are converged. In at least one embodiment, each client 106, 108 is allowed to select a locally best model (e.g., sub-network) by monitoring a certain performance metric on a local hold out validation set.
-	for a plurality of training inputs and corresponding training outputs in the training data for the chosen task
In [0227]:
In at least one embodiment, server(s) 1278 may be used to train machine learning models (e.g., neural networks) based at least in part on training data. In at least one embodiment, training data may be generated by vehicles, and/or may be generated in a simulation (e.g., using a game engine).
In [0013]:
FIG. 10 illustrates training and deployment of a neural network, according to at least one embodiment;
In [0059]:
building robust deep learning (DL) based models requires large amounts of training data.
In [0059]:
a processor executes a supernet training strategy that is performed in a FL setting. In at least one embodiment, FL is configured to communicate model gradients after a local round of training, at each of a FL client's site
in [0062]:
 In at least one embodiment, supernet includes a plurality of neural network models, where each of these neural network models are adapted according to inputs or domains for 3D medical image segmentation tasks. 
-	configuring, by the supervised learning controller, the super-net to implement a sample of sub-networks;
In [0098]:
 In at least one embodiment, inference and/or training logic 915 may include, without limitation, code and/or data storage 901 to store forward and/or output weight and/or input/output data, and/or other parameters to configure neurons or layers of a neural network trained and/or used for inferencing in aspects of one or more embodiments. In at least one embodiment, training logic 915 may include, or be coupled to code and/or data storage 901 to store graph code or other software to control timing and/or order, in which weight and/or other parameter information is to be loaded to configure, logic, including integer and/or floating point units (collectively, arithmetic logic units (ALUs). In at least one embodiment, code, such as graph code, loads weight or other parameter information into processor ALUs based on an architecture of a neural network to which such code corresponds.
In [0078]:
FIG. 3 illustrates a diagram 300 of an overall framework on how a neural network (e.g., sub-network) is selected for an input (e.g., 3D images) 302.
In [0094]:
In at least one embodiment, with respect to implementation, supernet is trained using randomly cropped patches of size 256×256×32 from input images and labels.
In [0068]:
In at least one embodiment, during training, at each client 106, 108, one or more processors chose an arbitrary path m from module candidates M  following a uniform sampling scheme (as shown and described in more detail in FIG. 3) to define a sub-network s sampled from supernet.

    PNG
    media_image1.png
    722
    977
    media_image1.png
    Greyscale

In [0061] :
In one embodiment, once a supernet has been trained sufficiently, one or more processors at each FL client site selects a neural network. In one embodiment, a selected neural network is an optimal neural network, which may also be referred to as a sub-network, with a best path selected from a plurality of neural networks.
in [0110]:
 In at least one embodiment, untrained neural network 1006 is trained using supervised learning, wherein training dataset 1002 includes an input paired with a desired output for an input, or where training dataset 1002 includes input having a known output and an output of neural network 1006 is manually graded.
Examiner’s BRI
(supervised learning to train a supernet (a network comprising a plurality of neural network paths or sub-networks) is a foundational technique in (NAS) that provides a mechanism to implement, evaluate, and select from a sample of sub-networks)(
-	generating, by the super-net, sub-network outputs responsive to the training inputs for the sample of sub-networks;
In [0070]:
predicted probability from a final sigmoid activated output layer of supernet f (X) and g.sub.i is a ground truth label map at a given voxel i. In at least one embodiment, once supernet 104 is trained, a sub-network s.sub.0 is found
Examiner’s BRI
	(the predicted probability from a final sigmoid activated output layer of a trained supernet, when evaluating a specific sub-network, is intended to represent that sub-network's output response (i.e., its prediction for the given input))
-	and where the supervised learning controller is further configured to: select a sub-network of the plurality of sub-networks based on the largest adjusted architecture parameters.
In [0098]:
 In at least one embodiment, inference and/or training logic 915 may include, without limitation, code and/or data storage 901 to store forward and/or output weight and/or input/output data, and/or other parameters to configure neurons or layers of a neural network trained and/or used for inferencing in aspects of one or more embodiments. In at least one embodiment, training logic 915 may include, or be coupled to code and/or data storage 901 to store graph code or other software to control timing and/or order, in which weight and/or other parameter information is to be loaded to configure, logic, including integer and/or floating point units (collectively, arithmetic logic units (ALUs). In at least one embodiment, code, such as graph code, loads weight or other parameter information into processor ALUs based on an architecture of a neural network to which such code corresponds.
In [0094]:
 In at least one embodiment, with respect to implementation, supernet is trained using randomly cropped patches of size 256×256×32 from input images and labels. In at least one embodiment, a mini-batch size of 4 is used by selecting two random crops from any two random input image and label pairs
Examiner’s BRI
(a SuperNet trained using randomly cropped patches from input images and labels represents a supervised learning approach)
In [0061]:
 In at least one embodiment, once a supernet has been trained sufficiently, one or more processors at each FL client site selects a neural network. In at least one embodiment, a selected neural network is an optimal neural network, which may also be referred to as a sub-network, with a best path selected from a plurality of neural networks. 
In [0087]:
In at least one embodiment, during training, one path is sampled from each searched layer from super blocks of supernet uniformly at each iteration, and parameters of new sub-networks are updated during gradient back-propagation. 
In [0121]: 
For example, in at least one embodiment, a machine learning model may be trained by calculating weight parameters according to a neural network architecture
-	and output a description of the network architecture of the selected sub-network for the chosen task.
In [0067]:
 In at least one embodiment, training different portions of supernet 104, at each FL client site, is performed by a processor at each FL client site passing a data point through supernet 104 that results in selecting a sub-network from supernet 104. In at least one embodiment, selected sub-network is a trained portion of supernet 104. In at least one embodiment, supernet  S  comprises various DL module candidates  M suitable for 3D medical imaging tasks shown in Table 1 below

    PNG
    media_image2.png
    257
    543
    media_image2.png
    Greyscale

	Roth does not explicitly disclose:
-	accumulate a first loss based on accumulated differences between the sub-network outputs and corresponding training outputs
-	accumulate a second loss over the sample of sub-networks, the accumulated loss based, at least in part, on a sum, over layers of sub-networks in the sample of sub-networks, of measures of smoothness based on network weights in the layers, where a measure of smoothness relates to the maximum change in output from a layer relative to any possible change in input to the layer, a higher maximum change in output indicating a lower smoothness
-	and adjust the architecture parameters and network weights of the super-net to reduce [[the]] a combination of the accumulated first and second losses, where the network weights of the super-net are shared with one or more sub- networks of the plurality of sub-networks and are trained jointly[[;]], where the second loss penalizes sub-networks with lower smoothness and improves stability of the adjustment, 
 However, Peng discloses:
-	accumulate a first loss based on accumulated differences between the sub-network outputs and corresponding training outputs
In [1, Page 12355]:
In this paper, we attribute the ranking correlation problem to the supernet training consistency shift, including feature shift and parameter shift. Feature shift is identified as dynamic input distributions of a hidden layer. Specifically, a given layer’s input feature maps always have an uncertain distribution due to random path sampling (see Figure 1a, left). This distribution uncertainty can hurt the architecture ranking correlation. Precisely, we can use the loss to measure the architecture accuracy, and we can link the accuracy ascent to gradient descent. Based on the back-propagation rule, a stable input distribution can guarantee a good rankng correlation. In contrast, the input distribution dynamic affects the loss descent and finally affects architecture ranking. Parameter shift is identified as contradictory parameter updates for a given layer. In supernet training, a given layer will always be present in different paths from iteration to iteration (see Figure 1b, left). The parameter in this layer may have a contradictory update from iteration to iteration. These unstable updates lead to varying parameters’ distributions, hurting the architecture ranking correlation in two ways. 
Examiner’s BRI
( parameter shifts (updates) via gradient descent are specifically designed to reduce the training loss, while feature shifts generally refer to changes in data distribution that can cause the training loss to become unreliable (increase or diverge). The shift (loss) is the first loss)
-	accumulating a second loss over the sample of sub-networks, the second loss based, at least in part, on a sum, over layers of a sub-network, of the measures of smoothness of the layers; 
In [1, Page 12355]:
In this paper, we attribute the ranking correlation problem to the supernet training consistency shift, including feature shift and parameter shift. 
Examiner’s BRI
(parameter shifts (updates) via gradient descent are specifically designed to reduce the training loss, while feature shifts generally refer to changes in data distribution that can cause the training loss to become unreliable (increase or diverge). The shift (loss) is the first loss)
Motivated by consistency regularization methods [29, 44], we propose a nontrivial 
supernet-II model , called II- NAS, to reduce these two shifts simultaneously. Specifically, to cope with the feature shift, we propose a novel supernet-II model. We evaluate each data point through two randomly sampled paths, then apply a consistency cost between the two predictions to penalize the feature consistency shift between different paths.
As shown in Figure 1a (right), our method can significantly reduce the feature shit and thus can improve the architecture ranking correlation. To address the parameter shift, we propose a novel non-trivial mean teacher model by maintaining an exponential moving average of weights in supernet teacher.
Examiner’s BRI
(evaluating each data point through two randomly sampled paths (or perturbations) and applying a consistency cost (or loss) between the two predictions is a standard technique used to penalize loss of smoothness, often called consistency regularization or constancy training which is the second loss)
In [3.2, Page 12358]:
we propose to maintain an exponential moving average weights for teacher model rather than barely replicate from student model in supernet-II model training.  Formally, we denote                         
                            
                                    W
                                
                                    t
                                
                    as parameters of student mapping function f at training step t. Then, weights of mean teacher model f’ can be defined as:
can be defined as

    PNG
    media_image3.png
    46
    560
    media_image3.png
    Greyscale

where                         
                            λ
                        
                            ∈
                            
                                    0,1
                                
                     is a smoothing coefficient hyper-parameter
A low                         
                            λ
                        
                     close to 0  provide greater smoothing and higher                         
                            λ
                        
                     close to 1 provides low smoothing.  With high smoothing                         
                            λ
                        
                     closer to 0, the weights are same.
Examiner’s BRI
( architecture ranking in Neural Architecture Search (NAS) can provide measures of search space smoothness, particularly in the context of ensuring that small, incremental changes to an architecture)
-	and adjust the architecture parameters and network weights of the super-net to reduce [[the]] a combination of the accumulated first and second losses, where the network weights of the super-net are shared with one or more sub- networks of the plurality of sub-networks and are trained jointly[[;]], where the second loss penalizes sub-networks with lower smoothness and improves stability of the adjustment, 
In [3.2, Page 12358]:
we propose to maintain an exponential moving average weights for teacher model rather than barely replicate from student model in supernet-II model training.  Formally, we denote                         
                            
                                    W
                                
                                    t
                                
                    as parameters of student mapping function f at training step t. Then, weights of mean teacher model f’ can be defined as:
can be defined as

    PNG
    media_image3.png
    46
    560
    media_image3.png
    Greyscale

where                         
                            λ
                        
                            ∈
                            
                                    0,1
                                
                     is a smoothing coefficient hyper-parameter
A low                         
                            λ
                        
                     close to 0  provide greater smoothing and higher                         
                            λ
                        
                     close to 1 provides low smoothing.  With high smoothing                         
                            λ
                        
                     closer to 0, the weights are same.
In [1, Page 12355]:
Parameter shift is identified as contradictory parameter updates for a given layer. In supernet training, a given layer will always be present in different paths from iteration to iteration (see Figure 1b, left). 
In [1, Page 12354]:

    PNG
    media_image4.png
    46
    684
    media_image4.png
    Greyscale

(b) Parameter shift. Different colors represent the distribution of parameters in different iterations. Left: without our nontrivial mean teacher, the parameter has significantly varying distributions in training. Right: with our nontrivial mean teacher, the parameter shift is significantly reduced.
Figure 1: Illustration of supernet training consistency shift
In [1, 12355]:
The parameter in this layer may have a contradictory update from iteration to iteration. These unstable updates lead to varying parameters’ distributions, hurting the architecture ranking correlation in two ways. On the one hand, stable parameters can ensure a correct loss descent and guarantee an accurate architecture ranking, while frequent parameter change could not preserve architecture ranking. On the other hand, varying parameters can also result in a feature shift, further hurting architecture ranking correlation. In summary, both feature shift and parameter shift can hurt the architecture ranking correlation. 
In [1, 12355]:
Motivated by consistency regularization methods [29, 44], we propose a nontrivial supernet-II model, called II- NAS, to reduce these two shifts simultaneously. Specifically, to cope with the feature shift, we propose a novel supernet-II model. We evaluate each data point through two randomly sampled paths, then apply a consistency cost between the two predictions to penalize the feature consistency shift between different paths. As shown in Figure 1a (right), our method can significantly reduce the feature shit and thus can improve the architecture ranking correlation.
In [2, Page 12355]:
To alleviate the computational overhead caused by the training process, researchers starts to share the weights among candidate arch-
In [2, Page 12356]:
tectures.
In [2, Page 12356]:
Gradient-based weight sharing methods [36, 9, 54, 63] jointly optimize the shared network parameters and the architecture choosing factors by gradient descent
In [2, Page 12356]:
the supernet is first optimized with path sampling, and then sub-models are sampled and evaluated with the weights inherited from the supernet
Examiner’s BRI
(Perhaps known to the POSITA that gradient-based weight sharing methods, particularly in Neural Architecture Search (NAS), are designed to jointly optimize a shared network (often called a supernet) which represents multiple, potential subnetworks trained simultaneously. that optimize one set of shared weights, allowing different subnetworks to converge together)

The examiner interprets the invention as “ using neural network to learn to extend the quantization and provide a high accuracy to reflect original signal in which the learning is performed by computing and inferring the loss function”.
Within the context of the core of the invention, the prior art combinations teaches the invention making the case for “motivation to combine”.
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Roth and Peng.  
Roth teaches a super-net including a plurality of sub-networks and first loss.
Peng  teaches second loss accumulating a loss over a sample of sub-networks, and adjusting network weights to reduce the accumulated loss and measures of smoothness. 
Within the context of the theme and teaching of Roth and Peng, it may be obvious to POSTA to combine Roth and Peng.
One of ordinary skill would have motivation to combine Roth and Peng that combining a first loss related to training data (e.g., standard cross-entropy or MSE) with a second loss related to a sample of sub-networks (e.g., ranking loss) generally provides improved architecture ranking correlation and better performance in Neural Architecture Search (NAS) (Peng [1, 12355])
	Roth and Peng do not explicitly disclose:
-	the measure of smoothness related to the maximum difference between outputs from the layer relative to any possible difference between inputs to the layer, a higher maximum change in output indicating a lower smoothness; 
However, Mok discloses:
-	 the measure of smoothness related to the maximum difference between outputs from the layer relative to any possible difference between inputs to the layer, a higher maximum change in output indicating a lower smoothness; 
(BRI: Based on the description, the concept refers to using Lipschitz continuity as a proxy for measuring the smoothness of neural network layers. Specifically, this involves analyzing the spectral norm or maximum singular value of the weight matrices within each layer, which determines the maximum possible ratio between the difference in outputs and the difference in inputs (i.e., the Lipschitz constant) . Perhaps it is known to the POSITA that a spectral norm is the largest singular value of the matrix)
In [3.2, Page 12325]:
Consider two types of loss functions for updating ω of an arbitrary neural network                         
                            
                                    f
                                
                                    w
                                
                     (ω): a standard loss and an adversarial loss. On one hand, standard training uses clean input data                         
                            
                                    x
                                
                                    s
                                    t
                                    d
                                
                    to update ω, such that the standard loss                         
                            
                                    L
                                
                                    s
                                    t
                                    d
                                
                     = L(                        
                            
                                    f
                                
                                    w
                                
                     (ω),                         
                            
                                    x
                                
                                    s
                                    t
                                    d
                                
                    )  is minimized. On the other hand, adversarial training uses adversarially perturbed input data                         
                            
                                    x
                                
                                    a
                                    d
                                    v
                                
to update ω, such that the adversarial loss                         
                            
                                    L
                                
                                    a
                                    d
                                    v
                                
                     = L(                        
                            
                                    f
                                
                                    w
                                
                     (ω),                         
                            
                                    x
                                
                                    a
                                    d
                                    v
                                
                    )  is minimized. From here on, we refer to                         
                            
                                    f
                                
                                    w
                                
                     (ω) after standard training and after adversarial training as                          
                            
                                    f
                                
                                    w
                                
                     (                        
                            
                                    w
                                
                                    s
                                    t
                                    d
                                
                    )
and as                         
                            
                                    f
                                
                                    w
                                
                     (                        
                            
                                    w
                                
                                    s
                                    t
                                    d
                                
                    )
In [2.2, Page 12324]:
a gradient-based method, suggests to use the Lipschitz characteristics of the architecture parameters to achieve the target Lipschitz constant.
In [4, Page 123245]:
robust neural architecture can be re-formulated into the problem of searching for a neural architecture with a smooth input loss landscape. Since it is computationally infeasible to calculate the curvature of                         
                            
                                    f
                                
                                    A
                                
                    (                        
                            
                                    w
                                
                                    t
                                
                    ) for every                        
                            
                                    w
                                
                                    t
                                
                    , we opt to evaluate candidate architectures under after standard training:

    PNG
    media_image5.png
    102
    526
    media_image5.png
    Greyscale

where                         
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                    refers to the Hessian of                         
                            
                                    L
                                
                                    s
                                    t
                                    d
                                
                     of                         
                            
                                    f
                                
                                    A
                                
                    (                        
                            
                                    w
                                
                                    t
                                
                     ) at clean input                        
                            
                                    x
                                
                                    s
                                    t
                                    d
                                
                     and                         
                            
                                    λ
                                
                                    m
                                    a
                                    x
                                
                     (                        
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                    ) refers to the larges eigen-value of                          
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                    . 
In [4.2, Page 12326]: Approximation of                         
                            
                                    L
                                
                                    λ
                                
                                                    H
                                                
                                                    s
                                                    t
                                                    d
                                                
                                    F
                                
                     can be expressed in terms of l2 norm: 

    PNG
    media_image6.png
    4
    146
    media_image6.png
    Greyscale

where the expectation is taken over z ∼ N(0,                        
                            
                                    I
                                
                                    d
                                
                    ). Because the direct computation of                          
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
is expensive, we linearly approximate it through the finite difference approximation of 
the Hessian:

    PNG
    media_image7.png
    77
    422
    media_image7.png
    Greyscale

where h controls the scale of the loss landscape on which we induce smoothness. However, computing multiple                         
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                     z in directions drawn from z ∼ N(0,                        
                            
                                    I
                                
                                    d
                                
                    ) and taking its average would be computationally inefficient because each computation of Hstdz requires calculation of the gradient. Therefore, we minimize the input loss landscape along the the high curvature direction, 
In [4.1, Page 12325]:
AdvRush accomplishes the above objective by driving the eigenvalues of                         
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                            o
                            f
                        
                                    f
                                
                                    s
                                    u
                                    p
                                    e
                                    r
                                
                     to be small. Consequently, their maximum,                         
                            
                                    λ
                                
                                    m
                                    a
                                    x
                                
                     (                        
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                    ) ) will also be small. 
Examiner’s BRI
(The standard loss is the  l2 operator norm is the same as the spectral norm that is defined as the maximum singular value of a matrix representing the maximum expansion applied to a vector and is calculated as the square root of the largest eigenvalue. The adversarial robustness is linked to the smoothness of a loss where flatter (lower Lipschitz constant) prevent small input perturbations from altering predictions. The maximum eigen values represents the maximum curvature of a function at a specific point. It is fundamental in optimization, determining stability and learning rates.  A smaller                         
                            
                                    λ
                                
                                    m
                                    a
                                    x
                                
                    indicates a flatter and smoother function where the gradient changes slowly and the large                         
                            
                                    λ
                                
                                    m
                                    a
                                    x
                                
                    indicates sharp, high-curvature region indicating less smooth)
The examiner interprets the invention as “ using neural network to learn to extend the 	 and provide a high accuracy to reflect original signal in which the learning is performed by computing and inferring the loss function”.
Within the context of the core of the invention, the prior art combinations teaches the invention making the case for “motivation to combine”.
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Roth, Peng and Mok.
Roth teaches a super-net including a plurality of sub-networks and first loss.
Peng  teaches second loss accumulating a loss over a sample of sub-networks, and adjusting network weights to reduce the accumulated loss and measures of smoothness. 
Mok teaches relating the measure of smoothness to the maximum difference between outputs from the layer relative to any possible difference between inputs to the layer.
One of ordinary skill would have motivation to combine Roth, Peng and Mok that can place a proper choice of an architectures to improve the performance of the neural network on a target task (Mok [in 1, Page 12323]).
In regard to Claim 10: (Currently Amended)
Roth discloses:
-	where the supervised learning controller further configured to train network weights of the selected sub-network using the training data; 
In [0098]:
 In at least one embodiment, inference and/or training logic 915 may include, without limitation, code and/or data storage 901 to store forward and/or output weight and/or input/output data, and/or other parameters to configure neurons or layers of a neural network trained and/or used for inferencing in aspects of one or more embodiments. In at least one embodiment, training logic 915 may include, or be coupled to code and/or data storage 901 to store graph code or other software to control timing and/or order, in which weight and/or other parameter information is to be loaded to configure, logic, including integer and/or floating point units (collectively, arithmetic logic units (ALUs). In at least one embodiment, code, such as graph code, loads weight or other parameter information into processor ALUs based on an architecture of a neural network to which such code corresponds.
In [0052]:
FIG. 39 is a system diagram for an example system for training, adapting, instantiating and deploying machine learning models in an advanced computing pipeline, in accordance with at least one embodiment;
In [0005]:
FIG. 3 illustrates a diagram of an overall framework on how a neural network is selected for an input at a federated learning (FL) client site, according to at least one embodiment;
	In [0005]:
	 FIG. 3 illustrates a diagram of an overall framework on how a neural network is selected for an input at a federated learning (FL) client site, according to at least one embodiment;
	In [0061]: 
	a selected neural network is an optimal neural network, which may also be referred to as a sub-network, with a best path selected from a plurality of neural networks.
	In [0085]: 
	atter constructing sub-network, path weights from each FL client site are sent to a client server for aggregation to train supernet.
In regard to Claim 11: (Previously Presented) 
Roth and Peng do not explicitly disclose:
-	and the second loss is a product of a smoothness scale factor and the sum, over layers of the sample sub-network, of measure of smoothness of the layers
However, Mok discloses:
-	and the second loss is a product of a smoothness scale factor and the sum, over layers of the sample sub-network, of measures of smoothness of the layers
In [2.1, Page 12323]:
Our work is closely related to the defense approaches that utilize a regularization term 
derived from the curvature information of the neural network’s loss landscape
In [6.1. Page 12327]: Effect of Regularization Strength 
The regularization strength γ is empirically set to be 0.01 to match the scale of                         
                            
                                    L
                                
                                    v
                                    a
                                    l
                                
                     and                         
                            
                                    L
                                
                                    λ
                                
                    .
(BRI: The regularization strength is a smoothness scale factor as a result of its matching the scale)
In [4.2 , Page 12326]:
                        
                                                    H
                                                
                                                    s
                                                    t
                                                    d
                                                
                                    F
                                
                     can be expressed in terms of l2 norm: 

    PNG
    media_image6.png
    4
    146
    media_image6.png
    Greyscale

where the expectation is taken over z ∼ N(0, Id). Because the direct computation of                          
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
is expensive, we linearly approximate it through the finite difference approximation of 
the Hessian:

    PNG
    media_image7.png
    77
    422
    media_image7.png
    Greyscale

where h controls the scale of the loss landscape on which we induce smoothness. However, computing multiple                         
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                     z in directions drawn from z ∼ N(0,                        
                            
                                    I
                                
                                    d
                                
                    ) and taking its average would be computationally inefficient because each computation of Hstdz requires calculation of the gradient. Therefore, we minimize the input loss landscape along the the high curvature direction, 

    PNG
    media_image8.png
    27
    178
    media_image8.png
    Greyscale

to maximize the effect of                         
                            
                                    L
                                
                                    λ
                                    ,
                                
                    .  With the approximated                         
                            
                                    L
                                
                                    λ
                                    ,
                                
                    the bi-level optimization problem of AdvRush can be expressed as:

    PNG
    media_image9.png
    121
    472
    media_image9.png
    Greyscale

                                    x
                                
                                    v
                                    a
                                    l
                                
                    is the clean input data from Dval. The value of h in the denominator of Eq. (9) is absorbed by the regularization strength γ. 
The examiner interprets the invention as “ using neural network to learn to extend and provide a high accuracy to reflect original signal in which the learning is performed by computing and inferring the loss function”.
Within the context of the core of the invention, the prior art combinations teaches the invention making the case for “motivation to combine”.
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Roth, Peng and Mok.
Roth teaches a super-net including a plurality of sub-networks and first loss.
Peng  teaches second loss accumulating a loss over a sample of sub-networks, and adjusting network weights to reduce the accumulated loss and measures of smoothness. 
Mok teaches relating the measure of smoothness to the maximum difference between outputs from the layer relative to any possible difference between inputs to the layer.
One of ordinary skill would have motivation to combine Roth, Peng and Mok that can place a proper choice of an architectures to improve the performance of the neural network on a target task (Mok [in 1, Page 12323]).
In regard to Claim 12: (Original) 
Roth and Peng do not explicitly disclose:

    PNG
    media_image10.png
    47
    318
    media_image10.png
    Greyscale

                            λ
                             
                    is a smoothness scale factor,                          
                            
                                    W
                                
                                    k
                                    ,
                                     
                                    j
                                     
                    is a matrix of network weights in layer j of sub-network k and                          
                            
                                    σ
                                
                                    m
                                    a
                                    x
                                
                                    (
                                    W
                                
                                    k
                                    ,
                                     
                                    j
                                     
                            )
                        
                    is the maximum singular of the matrix                         
                            
                                    W
                                
                                    k
                                    ,
                                     
                                    j
                                     
                     or an estimate thereof. 
However, Mok discloses:

    PNG
    media_image10.png
    47
    318
    media_image10.png
    Greyscale

                            λ
                             
                    is a smoothness scale factor,                          
                            
                                    W
                                
                                    k
                                    ,
                                     
                                    j
                                     
                    is a matrix of network weights in layer j of sub-network k and                          
                            
                                    σ
                                
                                    m
                                    a
                                    x
                                
                                    (
                                    W
                                
                                    k
                                    ,
                                     
                                    j
                                     
                            )
                        
                    is the maximum singular of the matrix                         
                            
                                    W
                                
                                    k
                                    ,
                                     
                                    j
                                     
                     or an estimate thereof. 
In [4.2, Page 12326]: Approximation of                         
                            
                                    L
                                
                                    λ
                                
                                                    H
                                                
                                                    s
                                                    t
                                                    d
                                                
                                    F
                                
                     can be expressed in terms of l2 norm: 

    PNG
    media_image6.png
    4
    146
    media_image6.png
    Greyscale

where the expectation is taken over z ∼ N(0,                        
                            
                                    I
                                
                                    d
                                
                    ). Because the direct computation of                          
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
is expensive, we linearly approximate it through the finite difference approximation of 
the Hessian:

    PNG
    media_image7.png
    77
    422
    media_image7.png
    Greyscale

where h controls the scale of the loss landscape on which we induce smoothness. However, computing multiple                         
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                     z in directions drawn from z ∼ N(0,                        
                            
                                    I
                                
                                    d
                                
                    ) and taking its average would be computationally inefficient because each computation of                         
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                     z
requires calculation of the gradient. Therefore, we minimize the input loss landscape along the the high curvature direction, 

    PNG
    media_image11.png
    27
    182
    media_image11.png
    Greyscale

to maximize the effect of 
In [4.2, Page 12326]: Approximation of                         
                            
                                    L
                                
                                    λ
                                
                                                    H
                                                
                                                    s
                                                    t
                                                    d
                                                
                                    F
                                
                     can be expressed in terms of l2 norm: 

    PNG
    media_image6.png
    4
    146
    media_image6.png
    Greyscale

where the expectation is taken over z ∼ N(0,                        
                            
                                    I
                                
                                    d
                                
                    ). Because the direct computation of                          
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
is expensive, we linearly approximate it through the finite difference approximation of 
the Hessian:

    PNG
    media_image7.png
    77
    422
    media_image7.png
    Greyscale

where h controls the scale of the loss landscape on which we induce smoothness. However, computing multiple                         
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                     z in directions drawn from z ∼ N(0,                        
                            
                                    I
                                
                                    d
                                
                    ) and taking its average would be computationally inefficient because each computation of                         
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                     z
requires calculation of the gradient. Therefore, we minimize the input loss landscape along the high curvature direction.
Examiner’s BRI
(The standard loss is the  l2 operator norm is the same as the spectral norm that is defined as the maximum singular value of a matrix representing the maximum expansion applied to a vector and is calculated as the square root of the largest eigenvalue)
In regard to Claim 13: (Previously Presented)
Roth and Pend do not explicitly disclose:
-	where said adjust the architecture parameters and the network weights includes: 
determining gradients of the accumulated first and second loss functions; 
However, Mok discloses:
-	where said adjust the architecture parameters and the network weights includes: 
determining gradients of the accumulated first and second loss functions; 
In [4.2, Page 12326]: Approximation of                         
                            
                                    L
                                
                                    λ
                                
                                                    H
                                                
                                                    s
                                                    t
                                                    d
                                                
                                    F
                                
                     can be expressed in terms of l2 norm: 

    PNG
    media_image6.png
    4
    146
    media_image6.png
    Greyscale

where the expectation is taken over z ∼ N(0,                        
                            
                                    I
                                
                                    d
                                
                    ). Because the direct computation of                          
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
is expensive, we linearly approximate it through the finite difference approximation of 
the Hessian:

    PNG
    media_image7.png
    77
    422
    media_image7.png
    Greyscale

where h controls the scale of the loss landscape on which we induce smoothness. However, computing multiple                         
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                     z in directions drawn from z ∼ N(0,                        
                            
                                    I
                                
                                    d
                                
                    ) and taking its average would be computationally inefficient because each computation of                         
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                     z
requires calculation of the gradient. Therefore, we minimize the input loss landscape along the the high curvature direction, 
in [4.2, Page 12326]:
each computation of                         
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                     z requires calculation of the gradient. Therefore, we minimize the input loss landscape along the  high curvature direction,

    PNG
    media_image12.png
    23
    176
    media_image12.png
    Greyscale

Examiner’s BRI
( minimizing the input loss (specifically, the loss in the vicinity of training data, often termed "input loss curvature" and along high-curvature directions generally provides gradients that 
act as a regularization term, facilitating both the minimization of the standard training loss and the enforcement of smoothness)
-	and update the architecture parameters and the network weights of the super-net based on the gradients.  
	In [4.1, Page 12325]:
To make the search space continuous for gradient-based optimization, categorical choice of a particular operation is continuously relaxed by applying a softmax function over all the possible operations:

    PNG
    media_image13.png
    63
    442
    media_image13.png
    Greyscale

where                         
                            
                                    α
                                     
                                    (
                                    i
                                    ,
                                    j
                                    )
                                
                     is a set of operation mixing weights (i.e., architecture parameters). O is the pre-defined set of operations that are used to construct the supernet. By definition, the size of                         
                            
                                    α
                                     
                                    (
                                    i
                                    ,
                                    j
                                    )
                                
                    must be equal to |O|.
Through continuous relaxation, both the architecture parameters α and the weight parameters ω in the supernet can be updated via gradient descent.
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Roth, Peng and Mok.
Roth teaches a super-net including a plurality of sub-networks and first loss.
Peng  teaches second loss accumulating a loss over a sample of sub-networks, and adjusting network weights to reduce the accumulated loss and measures of smoothness. 
Mok teaches relating the measure of smoothness to the maximum difference between outputs from the layer relative to any possible difference between inputs to the layer.
One of ordinary skill would have motivation to combine Roth, Peng and Mok that can place a proper choice of an architectures to improve the performance of the neural network on a target task (Mok [in 1, Page 12323]).
In regard to Claim 15: (Currently Amended)  
Roth discloses:
-	a data processor for automated selection of neural architecture comprising: 
In [0005]:
FIG. 3 illustrates a diagram of an overall framework on how a neural network is selected for an input at a federated learning (FL) client site, according to at least one embodiment;
Examiner’ BRI
(A data processor for the automated selection of neural architecture is a component within a Neural Architecture Search (NAS) framework)
In [0078]:
 one or more processors selecting a sub-network for an input 302, at each FL client site, is performed by passing input (accessible only to each individual FL client site) through supernet 304 and choosing an optimal path to construct sub-network accordingly. In at least one embodiment, selecting a sub-network in this manner leverages concepts from a Neural Architecture Search (NAS), which is used to design neural network automatically with limited human heuristics to meet different user requirements (e.g., light-weight model, or small amount of computation)
In [0061] :
In one embodiment, once a supernet has been trained sufficiently, one or more processors at each FL client site selects a neural network. In one embodiment, a selected neural network is an optimal neural network, which may also be referred to as a sub-network, with a best path selected from a plurality of neural networks. In one embodiment, for each unseen data point, at each FL client site, one or more processors determine, with guidance of additional unsupervised loss functions at inference, which neural network to select as a sub-network
Examiner’s BRI
( the additional guidance provided for an unseen data point is a stability consideration for the selection)
-	super-net hardware configured to implement a super-net including a plurality of selectable sub-networks, the super-net including network weights and architecture parameters, 
In [0098]:
 In at least one embodiment, inference and/or training logic 915 may include, without limitation, code and/or data storage 901 to store forward and/or output weight and/or input/output data, and/or other parameters to configure neurons or layers of a neural network trained and/or used for inferencing in aspects of one or more embodiments. In at least one embodiment, training logic 915 may include, or be coupled to code and/or data storage 901 to store graph code or other software to control timing and/or order, in which weight and/or other parameter information is to be loaded to configure, logic, including integer and/or floating point units (collectively, arithmetic logic units (ALUs). In at least one embodiment, code, such as graph code, loads weight or other parameter information into processor ALUs based on an architecture of a neural network to which such code corresponds.
In [0063]:
In at least one embodiment, a processor with one or more circuits associated with client server 102 provide neural network 104
In [0060]:
in at least one embodiment, a processor executes a supernet training strategy that is performed in a FL setting. In at least one embodiment, FL is configured to communicate model gradients after a local round of training, at each of a FL client's site, 
In [0064]:
a trained portion of supernet 104 is a selected neural network, which may also be referred to as an optimal neural network, and/or a sub-network for each FL client site (client A 106, client B 108). In at least one embodiment, results 114, 120 include updated model weights (or their gradients) from trained portion of supernet for client A 112 and trained portion of supernet for client B 118, and updated model weights are sent to client server 102 for aggregation. 
In [0061]:
one or more processors determine, with guidance of additional unsupervised loss functions at inference, which neural network to select as a sub-network.
In [0063]:
In at least one embodiment, neural network 104 is a supernet, which may also be referred to as a supernetwork, model architecture, and/or a neural network comprising a plurality of neural networks. I
In [0077]: 
In at least one embodiment, during training, one or more processors at each FL client site sample one path from each searched layer from super blocks 206 of a supernet 204 uniformly at each iteration, and parameters of new sub-networks are updated during gradient back-propagation,
-	where the architecture parameters represent the importance of different architecture choices at various locations inside super-net
In [0061] :
In one embodiment, once a supernet has been trained sufficiently, one or more processors at each FL client site selects a neural network. In one embodiment, a selected neural network is an optimal neural network, which may also be referred to as a sub-network, with a best path selected from a plurality of neural networks.
in [0110]:
 In at least one embodiment, untrained neural network 1006 is trained using supervised learning, wherein training dataset 1002 includes an input paired with a desired output for an input, or where training dataset 1002 includes input having a known output and an output of neural network 1006 is manually graded.
Examiner’s BRI
(supervised learning to train a supernet (a network comprising a plurality of neural network paths or sub-networks) is a foundational technique in (NAS) that provides a mechanism to implement, evaluate, and select from a sample of sub-networks)(
In [0064]:
a trained portion of supernet 104 is a selected neural network, which may also be referred to as an optimal neural network, and/or a sub-network for each FL client site
In [0064]:
after one or more processors conduct several training rounds in a FL setting, trained portions from each client A 106 and client B 108 are converged. In at least one embodiment, each client 106, 108 is allowed to select a locally best model (e.g., sub-network) by monitoring a certain performance metric on a local hold out validation set.
In [0064]:
a trained portion of supernet 104 is a selected neural network, which may also be referred to as an optimal neural network, and/or a sub-network for each FL client site
In [0064]:
after one or more processors conduct several training rounds in a FL setting, trained portions from each client A 106 and client B 108 are converged. In at least one embodiment, each client 106, 108 is allowed to select a locally best model (e.g., sub-network) by monitoring a certain performance metric on a local hold out validation set.
-	a data loader configured to provide the training inputs to the super-net hardware to generate corresponding super-net outputs;
 In [0098]:
 In at least one embodiment, inference and/or training logic 915 may include, without limitation, code and/or data storage 901 to store forward and/or output weight and/or input/output data, and/or other parameters to configure neurons or layers of a neural network trained and/or used for inferencing in aspects of one or more embodiments. In at least one embodiment, training logic 915 may include, or be coupled to code and/or data storage 901 to store graph code or other software to control timing and/or order, in which weight and/or other parameter information is to be loaded to configure, logic, including integer and/or floating point units (collectively, arithmetic logic units (ALUs). In at least one embodiment, code, such as graph code, loads weight or other parameter information into processor ALUs based on an architecture of a neural network to which such code corresponds.
In [0061]:
one or more processors determine, with guidance of additional unsupervised loss functions at inference, which neural network to select as a sub-network.
In [0098]:
training logic 915 may include, or be coupled to code and/or data storage 901 to store graph code or other software to control timing and/or order, in which weight and/or other parameter information is to be loaded to configure, logic, including integer and/or floating point units (collectively, arithmetic logic units (ALUs).
In [0227]:
In at least one embodiment, server(s) 1278 may be used to train machine learning models (e.g., neural networks) based at least in part on training data. In at least one embodiment, training data may be generated by vehicles, and/or may be generated in a simulation (e.g., using a game engine).
-	and provide training inputs of the training data to sample sub-networks of the super-net to generate sub-network outputs,
In [0061] :
In at least one embodiment, once a supernet has been trained sufficiently, one or more processors at each FL client site selects a neural network
In [0061]:
one or more processors determine, with guidance of additional unsupervised loss functions at inference, which neural network to select as a sub-network.
In [0060]:
In at least one embodiment, one or more processors at each FL client site then uses a selected neural network to train a portion of supernet accordingly.
In [0083]:
In at least one embodiment, in order to achieve on-the-fly neural network selection, at each FL client site, additional information is utilized from a reconstruction branch. In an embodiment, an outline on achieving on-the-fly neural network selection is shown in a below algorithm (Algorithm 2) as follows:

    PNG
    media_image14.png
    275
    363
    media_image14.png
    Greyscale

In [0084]:
Algorithm 2 indicates that a first step includes feeding a new testing data point x to a supernet so that reconstruction loss Lrecon is computed.
-	supervised learning controller, coupled to the super-net hardware, configured to train network weights and architecture parameters of a super-net 
In [0098]:
 In at least one embodiment, inference and/or training logic 915 may include, without limitation, code and/or data storage 901 to store forward and/or output weight and/or input/output data, and/or other parameters to configure neurons or layers of a neural network trained and/or used for inferencing in aspects of one or more embodiments. In at least one embodiment, training logic 915 may include, or be coupled to code and/or data storage 901 to store graph code or other software to control timing and/or order, in which weight and/or other parameter information is to be loaded to configure, logic, including integer and/or floating point units (collectively, arithmetic logic units (ALUs). In at least one embodiment, code, such as graph code, loads weight or other parameter information into processor ALUs based on an architecture of a neural network to which such code corresponds.
In [0110] :
In at least one embodiment, untrained neural network 1006 is trained using supervised learning, wherein training dataset 1002 includes an input paired with a desired output for an input,
In [0060]:
In at least one embodiment, a processor with one or more circuits generates a supernet, which may also be referred to as a supernetwork or a neural network comprising a plurality of neural networks, to enable a mixture of candidate modules in parallel to represent multi-scale appearance features at different network levels, respectively. 
In [0109]:
in at least one embodiment, training framework 1004 trains an untrained neural network 1006 and enables it to be trained using processing resources described herein to generate a trained neural network 1008. In at least one embodiment, weights may be chosen randomly or by pre-training using a deep belief network. In at least one embodiment, training may be performed in either a supervised, partially supervised, or unsupervised manner.
In [0108]:
 In at least one embodiment, each of code and/or data storage 901 and 905 and corresponding computational hardware 902 and 906, respectively, correspond to different layers of a neural network, such that resulting activation from one storage/computational pair 901/902 of code and/or data storage 901 and computational hardware 902 is provided as an input to a next storage/computational pair 905/906 of code and/or data storage 905 and computational hardware 906, 
In [0081] :
In at least one embodiment, after one or more processors at client server trains a supernet, a unique sub-network for each input, at each FL client's computing system
In [0081] :
In at least one embodiment, after one or more processors at client server trains a supernet, a unique sub-network for each input, at each FL client's computing system
in [0233]:
Embodiments may be used in other devices such as handheld devices and embedded applications. 
In [0233]:
In at least one embodiment, embedded applications may include a microcontroller, a digital signal processor (“DSP”), system on a chip
In [0224]:
In at least one embodiment, inference and/or training logic 915 are used to select a neural network for a data point in a federated learning (FL) setting.
In [0061] :
In one embodiment, once a supernet has been trained sufficiently, one or more processors at each FL client site selects a neural network
In [0063]:
In at least one embodiment, neural network 104 is a supernet, which may also be referred to as a supernetwork, model architecture, and/or a neural network comprising a plurality of neural networks. I
-	and adjust the architecture parameters of the super-net to reduce the accumulated loss
	In [0065]:
	In one embodiment, an algorithm that trains high-quality models using relatively few rounds of communication by combining local stochastic gradient descent (SGD) on each client with a server that performs model averaging is utilized. In one embodiment, FL minimizes a global loss function £, which can be a weighted combination of K local losses{                         
                            
                                            L
                                        
                                            k
                                        
                                    }
                                
                                    k
                                    =
                                    1
                                
                                    K
                                
                    that each is computed on a client k's local data. In one embodiment, FL is formulated, as Equation 1 below, as a task of finding model parameters ϕ that minimize L given some local data X

    PNG
    media_image15.png
    75
    418
    media_image15.png
    Greyscale

	In [0110]:
	training framework 1004 trains untrained neural network 1006 repeatedly while adjust weights to refine an output of untrained neural network 1006 using a loss function and adjustment algorithm, such as stochastic gradient descent. 
	In [0070]:
	 In one embodiment, p.sub.i is a predicted probability from a final sigmoid activated output layer of supernet f (X) and                           
                            
                                    g
                                
                                    i
                                
                    is a ground truth label map at a given voxel i. In one embodiment, once supernet 104 is trained, a sub-network                         
                            
                                    s
                                
                                    0
                                
                      is found, at each client 106, 108 through supernet 104, effectively adapting a model to a target domain. In one embodiment, during adaptation, model parameters ϕ stay fixed and only path weights are optimized for one epoch on a local validation set. In one embodiment, this results in an optimal path m.sub.0 E. M that defines a locally adapted sub-network s.sub.0 ∈ S.
(BRI: A path is an architecture parameter. Optimizing path weights is essentially an `update of those path weights, as the optimization process aims to adjust the weights to achieve a desired outcome, such as finding the shortest or most efficient path)
-	where the supervised learning controller is further configured to: select a sub-network of the plurality of sub-networks for the chosen task based on the largest adjusted architecture parameters, and
In [0098]:
 In at least one embodiment, inference and/or training logic 915 may include, without limitation, code and/or data storage 901 to store forward and/or output weight and/or input/output data, and/or other parameters to configure neurons or layers of a neural network trained and/or used for inferencing in aspects of one or more embodiments. In at least one embodiment, training logic 915 may include, or be coupled to code and/or data storage 901 to store graph code or other software to control timing and/or order, in which weight and/or other parameter information is to be loaded to configure, logic, including integer and/or floating point units (collectively, arithmetic logic units (ALUs). In at least one embodiment, code, such as graph code, loads weight or other parameter information into processor ALUs based on an architecture of a neural network to which such code corresponds.
In [0005]:
FIG. 3 illustrates a diagram of an overall framework on how a neural network is selected for an input at a federated learning (FL) client site, according to at least one embodiment;
In [0061]:
 In at least one embodiment, once a supernet has been trained sufficiently, one or more processors at each FL client site selects a neural network. In at least one embodiment, a selected neural network is an optimal neural network, which may also be referred to as a sub-network, with a best path selected from a plurality of neural networks. 
In [0087]:
In at least one embodiment, during training, one path is sampled from each searched layer from super blocks of supernet uniformly at each iteration, and parameters of new sub-networks are updated during gradient back-propagation. 
In [0121]: 
For example, in at least one embodiment, a machine learning model may be trained by calculating weight parameters according to a neural network architecture
-	and output a description of the network architecture of the selected sub-network for the chosen task.
In [0067]:
 In at least one embodiment, training different portions of supernet 104, at each FL client site, is performed by a processor at each FL client site passing a data point through supernet 104 that results in selecting a sub-network from supernet 104. In at least one embodiment, selected sub-network is a trained portion of supernet 104. In at least one embodiment, supernet  S  comprises various DL module candidates  M suitable for 3D medical imaging tasks shown in Table 1 below

    PNG
    media_image2.png
    257
    543
    media_image2.png
    Greyscale

	Roth does not explicitly disclose:
- 	accumulate a loss over the sample of sub-networks, the accumulated loss based, at least in part, on a sum, over layers of sub-networks in the sample, of measures of smoothness based on network weights in the layers, where a measure of smoothness relates to the maximum change in output from a layer relative to any possible change in input to the layer, a higher maximum change in output indicating a lower smoothness;
 -	and adjust the architecture parameters and network weights of the super-network to reduce the accumulated loss, where the network weights of the super-net are shared with one or more sub-networks of the plurality of sub-networks and are trained jointly, where the accumulated loss penalizes sub-networks with lower smoothness and improves stability of the adjustment [[; and]],
 However, Peng discloses:
-	accumulate a loss over the sample of sub-networks, the accumulated loss based, at least in part, on a sum, over layers of sub-networks in the sample, of measures of smoothness based on network weights in the layers
In [1, Page 12355]:
In this paper, we attribute the ranking correlation problem to the supernet training consistency shift, including feature shift and parameter shift. 
Examiner’s BRI
(parameter shifts (updates) via gradient descent are specifically designed to reduce the training loss, while feature shifts generally refer to changes in data distribution that can cause the training loss to become unreliable (increase or diverge). The shift (loss) is the first loss)
Motivated by consistency regularization methods [29, 44], we propose a nontrivial 
supernet-II model , called II- NAS, to reduce these two shifts simultaneously. Specifically, to cope with the feature shift, we propose a novel supernet-II model. We evaluate each data point through two randomly sampled paths, then apply a consistency cost between the two predictions to penalize the feature consistency shift between different paths.
As shown in Figure 1a (right), our method can significantly reduce the feature shit and thus can improve the architecture ranking correlation. To address the parameter shift, we propose a novel non-trivial mean teacher model by maintaining an exponential moving average of weights in supernet teacher.
Examiner’s BRI
(evaluating each data point through two randomly sampled paths (or perturbations) and applying a consistency cost (or loss) between the two predictions is a standard technique used to penalize loss of smoothness, often called consistency regularization or constancy training which is the second loss)
In [3.2, Page 12358]:
we propose to maintain an exponential moving average weights for teacher model rather than barely replicate from student model in supernet-II model training.  Formally, we denote                         
                            
                                    W
                                
                                    t
                                
                    as parameters of student mapping function f at training step t. Then, weights of mean teacher model f’ can be defined as:
can be defined as

    PNG
    media_image3.png
    46
    560
    media_image3.png
    Greyscale

where                         
                            λ
                        
                            ∈
                            
                                    0,1
                                
                     is a smoothing coefficient hyper-parameter
A low                         
                            λ
                        
                     close to 0  provide greater smoothing and higher                         
                            λ
                        
                     close to 1 provides low smoothing.  With high smoothing                         
                            λ
                        
                     closer to 0, the weights are same.
Examiner’s BRI
( architecture ranking in Neural Architecture Search (NAS) can provide measures of search space smoothness, particularly in the context of ensuring that small, incremental changes to an architecture)
-	and adjust the architecture parameters and network weights of the super-net to reduce the accumulated loss, where the network weights of the super-net are shared with one or more sub- networks of the plurality of sub-networks and are trained jointly[[;]], where the second loss penalizes sub-networks with lower smoothness and improves stability of the adjustment, 
In [3.2, Page 12358]:
we propose to maintain an exponential moving average weights for teacher model rather than barely replicate from student model in supernet-II model training.  Formally, we denote                         
                            
                                    W
                                
                                    t
                                
                    as parameters of student mapping function f at training step t. Then, weights of mean teacher model f’ can be defined as:
can be defined as

    PNG
    media_image3.png
    46
    560
    media_image3.png
    Greyscale

where                         
                            λ
                        
                            ∈
                            
                                    0,1
                                
                     is a smoothing coefficient hyper-parameter
A low                         
                            λ
                        
                     close to 0  provide greater smoothing and higher                         
                            λ
                        
                     close to 1 provides low smoothing.  With high smoothing                         
                            λ
                        
                     closer to 0, the weights are same.
In [1, Page 12355]:
Parameter shift is identified as contradictory parameter updates for a given layer. In supernet training, a given layer will always be present in different paths from iteration to iteration (see Figure 1b, left). 
In [1, Page 12354]:

    PNG
    media_image4.png
    46
    684
    media_image4.png
    Greyscale

(b) Parameter shift. Different colors represent the distribution of parameters in different iterations. Left: without our nontrivial mean teacher, the parameter has significantly varying distributions in training. Right: with our nontrivial mean teacher, the parameter shift is significantly reduced.
Figure 1: Illustration of supernet training consistency shift
In [1, 12355]:
The parameter in this layer may have a contradictory update from iteration to iteration. These unstable updates lead to varying parameters’ distributions, hurting the architecture ranking correlation in two ways. On the one hand, stable parameters can ensure a correct loss descent and guarantee an accurate architecture ranking, while frequent parameter change could not preserve architecture ranking. On the other hand, varying parameters can also result in a feature shift, further hurting architecture ranking correlation. In summary, both feature shift and parameter shift can hurt the architecture ranking correlation. 
In [1, 12355]:
Motivated by consistency regularization methods [29, 44], we propose a nontrivial supernet-II model, called II- NAS, to reduce these two shifts simultaneously. Specifically, to cope with the feature shift, we propose a novel supernet-II model. We evaluate each data point through two randomly sampled paths, then apply a consistency cost between the two predictions to penalize the feature consistency shift between different paths. As shown in Figure 1a (right), our method can significantly reduce the feature shit and thus can improve the architecture ranking correlation.
The examiner interprets the invention as “ using neural network to learn to extend the quantization and provide a high accuracy to reflect original signal in which the learning is performed by computing and inferring the loss function”.
Within the context of the core of the invention, the prior art combinations teaches the invention making the case for “motivation to combine”.
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Roth and Peng.  
Roth teaches a super-net including a plurality of sub-networks and first loss.
Peng  teaches second loss accumulating a loss over a sample of sub-networks, and adjusting network weights to reduce the accumulated loss and measures of smoothness. 
Within the context of the theme and teaching of Roth and Peng, it may be obvious to POSTA to combine Roth and Peng.
One of ordinary skill would have motivation to combine Roth and Peng that combining a first loss related to training data (e.g., standard cross-entropy or MSE) with a second loss related to a sample of sub-networks (e.g., ranking loss) generally provides improved architecture ranking correlation and better performance in Neural Architecture Search (NAS) (Peng [1, 12355])
However, Mok discloses:
-	where a measure of smoothness relates to the maximum change in output from a layer relative to any possible change in input to the layer, a higher maximum change in output indicating a lower smoothness;
In regard to Claim 16: (Previously Presented)
Roth discloses:
-	train  network weights of the selected sub-network using the training data; 
	In [0052]:
	 FIG. 39 is a system diagram for an example system for training, adapting, instantiating and deploying machine learning models in an advanced computing pipeline, in accordance with at least one embodiment;
	In [0005]:
	 FIG. 3 illustrates a diagram of an overall framework on how a neural network is selected for an input at a federated learning (FL) client site, according to at least one embodiment;
	In [0005]:
	 FIG. 3 illustrates a diagram of an overall framework on how a neural network is selected for an input at a federated learning (FL) client site, according to at least one embodiment;
-	store the trained network weights.  
	In [0064]:
	in at least one embodiment, results 114, 120 include updated model weights (or their gradients) from trained portion of supernet for client A 112 and trained portion of supernet for client B 118, and updated model weights are sent to client server 102 for aggregation. In at least one embodiment, after aggregation, new weights are redistributed to client A 106 and client B 108 and a next round of local training is executed. 
(BRI Within the  context of learning, updated model weights may be sent to a central server, and then those stored (or aggregated) weights are used for redistribution) 
	In regard to Claim 19:(Previously Presented) 
Roth and Pend do not explicitly disclose:
-	where said adjust the architecture parameters and the network weights includes: 
determining gradients of the accumulated first and second loss functions; 
However, Mok discloses:
-	where said adjust the architecture parameters and the network weights includes: 
determining gradients of the accumulated first and second loss functions; 
In [4.2, Page 12326]: Approximation of                         
                            
                                    L
                                
                                    λ
                                
                                                    H
                                                
                                                    s
                                                    t
                                                    d
                                                
                                    F
                                
                     can be expressed in terms of l2 norm: 

    PNG
    media_image6.png
    4
    146
    media_image6.png
    Greyscale

where the expectation is taken over z ∼ N(0,                        
                            
                                    I
                                
                                    d
                                
                    ). Because the direct computation of                          
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
is expensive, we linearly approximate it through the finite difference approximation of 
the Hessian:

    PNG
    media_image7.png
    77
    422
    media_image7.png
    Greyscale

where h controls the scale of the loss landscape on which we induce smoothness. However, computing multiple                         
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                     z in directions drawn from z ∼ N(0,                        
                            
                                    I
                                
                                    d
                                
                    ) and taking its average would be computationally inefficient because each computation of Hstdz requires calculation of the gradient. Therefore, we minimize the input loss landscape along the the high curvature direction, 
in [4.2, Page 12326]:
each computation of                         
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                     z requires calculation of the gradient. Therefore, we minimize the input loss landscape along the  high curvature direction,

    PNG
    media_image12.png
    23
    176
    media_image12.png
    Greyscale

-	and update the architecture parameters and the network weights of the super-net based on the gradients.  
	In [4.1, Page 12325]:
To make the search space continuous for gradient-based optimization, categorical choice of a particular operation is continuously relaxed by applying a softmax function over all the possible operations:

    PNG
    media_image13.png
    63
    442
    media_image13.png
    Greyscale

where                         
                            
                                    α
                                     
                                    (
                                    i
                                    ,
                                    j
                                    )
                                
                     is a set of operation mixing weights (i.e., architecture parameters). O is the pre-defined set of operations that are used to construct the supernet. By definition, the size of                         
                            
                                    α
                                     
                                    (
                                    i
                                    ,
                                    j
                                    )
                                
                    must be equal to |O|.
Through continuous relaxation, both the architecture parameters α and the weight parameters ω in the supernet can be updated via gradient descent.
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Roth, Peng and Mok.
Roth teaches a super-net including a plurality of sub-networks and first loss.
Peng  teaches second loss accumulating a loss over a sample of sub-networks, and adjusting network weights to reduce the accumulated loss and measures of smoothness. 
Mok teaches relating the measure of smoothness to the maximum difference between outputs from the layer relative to any possible difference between inputs to the layer.
One of ordinary skill would have motivation to combine Roth, Peng and Mok that can place a proper choice of an architectures to improve the performance of the neural network on a target task (Mok [in 1, Page 12323]).
In regard to Claim 17: (Original) 
Roth and Peng do not explicitly disclose:
-	and the second loss is a product of a smoothness scale factor and the sum, over layers of the sample sub-network, of measure of smoothness of the layers
However, Mok discloses:
-	and the second loss is a product of a smoothness scale factor and the sum, over layers of the sample sub-network, of measures of smoothness of the layers
In [2.1, Page 12323]:
Our work is closely related to the defense approaches that utilize a regularization term 
derived from the curvature information of the neural network’s loss landscape
In [6.1. Page 12327]: Effect of Regularization Strength 
The regularization strength γ is empirically set to be 0.01 to match the scale of                         
                            
                                    L
                                
                                    v
                                    a
                                    l
                                
                     and                         
                            
                                    L
                                
                                    λ
                                
                    .
(BRI: The regularization strength is a smoothness scale factor as a result of its matching the scale)
In [4.2 , Page 12326]:
                        
                                                    H
                                                
                                                    s
                                                    t
                                                    d
                                                
                                    F
                                
                     can be expressed in terms of l2 norm: 

    PNG
    media_image6.png
    4
    146
    media_image6.png
    Greyscale

where the expectation is taken over z ∼ N(0, Id). Because the direct computation of                          
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
is expensive, we linearly approximate it through the finite difference approximation of 
the Hessian:

    PNG
    media_image7.png
    77
    422
    media_image7.png
    Greyscale

where h controls the scale of the loss landscape on which we induce smoothness. However, computing multiple                         
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                     z in directions drawn from z ∼ N(0,                        
                            
                                    I
                                
                                    d
                                
                    ) and taking its average would be computationally inefficient because each computation of Hstdz requires calculation of the gradient. Therefore, we minimize the input loss landscape along the the high curvature direction, 

    PNG
    media_image8.png
    27
    178
    media_image8.png
    Greyscale

to maximize the effect of                         
                            
                                    L
                                
                                    λ
                                    ,
                                
                    .  With the approximated                         
                            
                                    L
                                
                                    λ
                                    ,
                                
                    the bi-level optimization problem of AdvRush can be expressed as:

    PNG
    media_image9.png
    121
    472
    media_image9.png
    Greyscale

                                    x
                                
                                    v
                                    a
                                    l
                                
                    is the clean input data from Dval. The value of h in the denominator of Eq. (9) is absorbed by the regularization strength γ. 
The examiner interprets the invention as “ using neural network to learn to extend and provide a high accuracy to reflect original signal in which the learning is performed by computing and inferring the loss function”.
Within the context of the core of the invention, the prior art combinations teaches the invention making the case for “motivation to combine”.
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Roth, Peng and Mok.
Roth teaches a super-net including a plurality of sub-networks and first loss.
Peng  teaches second loss accumulating a loss over a sample of sub-networks, and adjusting network weights to reduce the accumulated loss and measures of smoothness. 
Mok teaches relating the measure of smoothness to the maximum difference between outputs from the layer relative to any possible difference between inputs to the layer.
One of ordinary skill would have motivation to combine Roth, Peng and Mok that can place a proper choice of an architectures to improve the performance of the neural network on a target task (Mok [in 1, Page 12323]).
In regard to Claim 18: (Original) 
Roth and Peng do not explicitly disclose:

    PNG
    media_image10.png
    47
    318
    media_image10.png
    Greyscale

                            λ
                             
                    is a smoothness scale factor,                          
                            
                                    W
                                
                                    k
                                    ,
                                     
                                    j
                                     
                    is a matrix of network weights in layer j of sub-network k and                          
                            
                                    σ
                                
                                    m
                                    a
                                    x
                                
                                    (
                                    W
                                
                                    k
                                    ,
                                     
                                    j
                                     
                            )
                        
                    is the maximum singular of the matrix                         
                            
                                    W
                                
                                    k
                                    ,
                                     
                                    j
                                     
                     or an estimate thereof. 
However, Mok discloses:

    PNG
    media_image10.png
    47
    318
    media_image10.png
    Greyscale

                            λ
                             
                    is a smoothness scale factor,                          
                            
                                    W
                                
                                    k
                                    ,
                                     
                                    j
                                     
                    is a matrix of network weights in layer j of sub-network k and                          
                            
                                    σ
                                
                                    m
                                    a
                                    x
                                
                                    (
                                    W
                                
                                    k
                                    ,
                                     
                                    j
                                     
                            )
                        
                    is the maximum singular of the matrix                         
                            
                                    W
                                
                                    k
                                    ,
                                     
                                    j
                                     
                     or an estimate thereof. 
In [4.2, Page 12326]: Approximation of                         
                            
                                    L
                                
                                    λ
                                
                                                    H
                                                
                                                    s
                                                    t
                                                    d
                                                
                                    F
                                
                     can be expressed in terms of l2 norm: 

    PNG
    media_image6.png
    4
    146
    media_image6.png
    Greyscale

where the expectation is taken over z ∼ N(0,                        
                            
                                    I
                                
                                    d
                                
                    ). Because the direct computation of                          
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
is expensive, we linearly approximate it through the finite difference approximation of 
the Hessian:

    PNG
    media_image7.png
    77
    422
    media_image7.png
    Greyscale

where h controls the scale of the loss landscape on which we induce smoothness. However, computing multiple                         
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                     z in directions drawn from z ∼ N(0,                        
                            
                                    I
                                
                                    d
                                
                    ) and taking its average would be computationally inefficient because each computation of                         
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                     z
requires calculation of the gradient. Therefore, we minimize the input loss landscape along the the high curvature direction, 

    PNG
    media_image11.png
    27
    182
    media_image11.png
    Greyscale

to maximize the effect of 
In [4.2, Page 12326]: Approximation of                         
                            
                                    L
                                
                                    λ
                                
                                                    H
                                                
                                                    s
                                                    t
                                                    d
                                                
                                    F
                                
                     can be expressed in terms of l2 norm: 

    PNG
    media_image6.png
    4
    146
    media_image6.png
    Greyscale

where the expectation is taken over z ∼ N(0,                        
                            
                                    I
                                
                                    d
                                
                    ). Because the direct computation of                          
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
is expensive, we linearly approximate it through the finite difference approximation of 
the Hessian:

    PNG
    media_image7.png
    77
    422
    media_image7.png
    Greyscale

where h controls the scale of the loss landscape on which we induce smoothness. However, computing multiple                         
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                     z in directions drawn from z ∼ N(0,                        
                            
                                    I
                                
                                    d
                                
                    ) and taking its average would be computationally inefficient because each computation of                         
                            
                                    H
                                
                                    s
                                    t
                                    d
                                
                     z
requires calculation of the gradient. Therefore, we minimize the input loss landscape along the high curvature direction.
Examiner’s BRI
(The standard loss is the  l2 operator norm is the same as the spectral norm that is defined as the maximum singular value of a matrix representing the maximum expansion applied to a vector and is calculated as the square root of the largest eigenvalue.)
The examiner interprets the invention as “ using neural network to learn to extend and provide a high accuracy to reflect original signal in which the learning is performed by computing and inferring the loss function”.
Within the context of the core of the invention, the prior art combinations teaches the invention making the case for “motivation to combine”.
Roth teaches a super-net including a plurality of sub-networks and first loss.
Peng  teaches second loss accumulating a loss over a sample of sub-networks, and adjusting network weights to reduce the accumulated loss and measures of smoothness. 
Mok teaches relating the measure of smoothness to the maximum difference between outputs from the layer relative to any possible difference between inputs to the layer.
One of ordinary skill would have motivation to combine Roth, Peng and Mok that can place a proper choice of an architectures to improve the performance of the neural network on a target task (Mok [in 1, Page 12323]).
In regard to Claim 21: (New)
Roth and Peng does not explicitly disclose:
-	where the super-net has at least 100,000 parameters including the plurality of network weights and the plurality of architecture parameters.  
However, Mok discloses:
-	where the super-net has at least 100,000 parameters including the plurality of network weights and the plurality of architecture parameters.  
In [5.2, Page 12326]: White-box Attacks  
We evaluate the adversarial robustness of architectures standard and adversarially trained on CIFAR-10 using various white-box attacks. 
In [5.2, Page 12326]: White-box Attacks  
White-box attack evaluation results are presented in Table 1.
In [5.2. Page 12327]:

    PNG
    media_image16.png
    380
    995
    media_image16.png
    Greyscale

(BRI: The added new claim has used CIFAR-10 data set (see [0068]). See AdvRush uses 4.2 M parameters)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Roth, Peng and Mok.
Roth teaches a super-net including a plurality of sub-networks and first loss.
Peng  teaches second loss accumulating a loss over a sample of sub-networks, and adjusting network weights to reduce the accumulated loss and measures of smoothness. 
Mok teaches relating the measure of smoothness to the maximum difference between outputs from the layer relative to any possible difference between inputs to the layer.
One of ordinary skill would have motivation to combine Roth, Peng and Mok that can place a proper choice of an architectures to improve the performance of the neural network on a target task (Mok [in 1, Page 12323]).
Claims 5-6 and 22 are rejected under 35 U.S.C. 103 as being  unpatentable over 
Holger Roth et.al. (hereinafter Roth ) 2021/0374502 A1, 
In view of Jiefeng Peng et.al. (hereinafter Peng)  Pi-NAS:  Improving Neural Architecture Search by Reducing Supernet Training Consistency Shift, Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.
In view Jisoo Mok et.al. (hereinafter Mok)  AdvRush: Searching for adversarially robust neural architectures,  Proceedings of the IEEE/CVF international conference on computer vision. 2021.
further in view of Minjing Dong et.al. (hereinafter Dong) arXiv:2009.00902v1 [cs.CV] 2 Sep 2020.
In regard to Claim 5: (Currently Amended)
Roth , Peng and Mok do not explicitly disclose:
-	the second loss is determined by: for each layer of the sub-network, computationally estimating a maximum singular value of a matrix of network weights of the layer, 
In [2.1, Page 2]:
The network architecture is represented by A, and its filter weight is denoted as W. The objective of adversarial attacks is to find the perturbed input x˜ which leads to wrong predictions through maximizing the classification loss as:

    PNG
    media_image17.png
    37
    498
    media_image17.png
    Greyscale

where, 

    PNG
    media_image18.png
    37
    343
    media_image18.png
    Greyscale

	and, the perturbation is constrained by its                  
                    
                            l
                        
                            p
                        
            norm.
in [2.2, Page 4]:
we focus on the                 
                    
                            L
                        
                            2
                        
            bounded perturbations and according to the definition of spectral norm, the Lipschitz constant of these operations is the spectral norm of its weight matrix w.
(BRI: adversarial attacks, which are designed to cause incorrect predictions by maximizing the model's classification loss (typically through techniques like FGSM or PGD), can be interpreted as a form of non-smoothness. Spectral norm provides maximum singular value)
-	and accumulating the estimated maximum singular values over layers of a sub- network.  
In [2.2, Page 4]:
the Lipschitz constant                 
                    
                            λ
                        
                            F
                        
            is bounded by the product of the Lipschitz constant of intermediate nodes as

    PNG
    media_image19.png
    80
    853
    media_image19.png
    Greyscale

Examiner’s BRI
( Accumulating the estimated maximum singular values over layers of a sub-network is primarily a technique used to bound the Lipschitz constant of the network, ensuring stable gradient propagation and controlling the stability of deep neural networks)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Roth, Peng, Mok and Dong. 
Roth teaches a super-net including a plurality of sub-networks and first loss.
Peng  teaches second loss accumulating a loss over a sample of sub-networks, and adjusting network weights to reduce the accumulated loss and measures of smoothness. 
Mok teaches relating the measure of smoothness to the maximum difference between outputs from the layer relative to any possible difference between inputs to the layer.
Dong teaches maximum singular value and training set.
One of ordinary skill would have motivation to combine Roth , Peng, Mok, and Dong to constraint Lipschitz constant to improve the robustness (Dong [2.2, Page 2]). 
In regard to Claim 6: (Currently Amended) 
	Roth, Peng and Mok do not explicitly disclose:
-	where the measure of smoothness of a layer of a sub-network is based on an estimate of the maximum ratio between variations in the output from the layer and variations in the input to the layer a higher maximum ratio indicating a lower smoothness.
However, Dong discloses:
-	where the measure of smoothness of a layer of a sub-network is based on an estimate of the maximum ratio between variations in the output from the layer and variations in the input to the layer a higher maximum ratio indicating a lower smoothness.
In [2.3, Page 3]:
One concern of NAS for adversarial robustness is the computational cost since both adverarial training and supernet optimization can be time-consuming.
In [3.2, Page 4]:
the relationship between architecture parameters α, β and Lipschitz constant of the network. Since the entire neural network is constructed by stacking cells in series as [I1, I2, ..., IN ], Eq. 2 can be further decomposed as a

    PNG
    media_image20.png
    88
    513
    media_image20.png
    Greyscale

where                 
                    
                            λ
                        
                            l
                        
            ,                 
                    
                            λ
                        
                            C
                        
            ,  λ(                 
                    
                            I
                        
                            N
                        
            ) denote the Lipschitz constants of the loss function, classifier, and cell IN respectively.
In [3.2, Page 4]:
Eq. 6 can be unfolded recursively and rewritten as

    PNG
    media_image21.png
    52
    487
    media_image21.png
    Greyscale

It is obvious that the adversarial robustness can be bounded by the Lipschitz constants of cells. Eq. 7 also suggests that the impact of perturbation grows exponentially with the number of cells, which further highlights the influence of cell designing.
Examiner’s BRI
(The measure of smoothness of a layer of a sub-network based on the maximum ratio between variations in the output and input is known as the Lipschitz constant.  Within the context of dynamical systems and mathematical modeling, an exponential growth of the impact of a perturbation . The exponential growth represents higher Lipchitz constant indicating lower smoothness.  The maximum ratio indicates gain or Lyapunov exponent which is proportional to the rate at which perturbation grows exponentially and lead to instability when the maximum ratio is >1. This is perhaps known to the POSITA).
In [2.2, Page 3]:
the Lipschitz constant of operations without convolutional layers can be summarized as follows, (1). average pooling: S −0.5 where S denotes the stride of pooling layer, (2). max pooling: 1, (3). identity connection: 1, (4). Zeroize: 0. For the rest operations including depthwise separate conv and dilated depth-wise separate conv, we focus on the                 
                    
                            L
                        
                            2
                        
bounded perturbations and according to the definition of spectral norm, the Lipschitz constant of these operations is the spectral norm of its weight matrix where

    PNG
    media_image22.png
    25
    116
    media_image22.png
    Greyscale
which is also is the maximum singular value of the W. 
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Roth, Peng, Mok and Dong. 
Roth teaches a super-net including a plurality of sub-networks and first loss.
Peng  teaches second loss accumulating a loss over a sample of sub-networks, and adjusting network weights to reduce the accumulated loss and measures of smoothness. 
Mok teaches relating the measure of smoothness to the maximum difference between outputs from the layer relative to any possible difference between inputs to the layer.
Dong teaches maximum singular value and training set.
One of ordinary skill would have motivation to combine Roth , Peng, Mok, and Dong to constraint Lipschitz constant to improve the robustness (Dong [2.2, Page 2]). 
In regard to Claim 22: (New)
-	where the training data includes at least 50,000 training inputs and corresponding training outputs.
	In [3.1, Page 6]:
we search the robust neural architectures on CIFAR-10 dataset which contains 50K training images and 10K validation images over 10 classes
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Roth, Peng, Mok and Dong. 
Roth teaches a super-net including a plurality of sub-networks and first loss.
Peng  teaches second loss accumulating a loss over a sample of sub-networks, and adjusting network weights to reduce the accumulated loss and measures of smoothness. 
Mok teaches relating the measure of smoothness to the maximum difference between outputs from the layer relative to any possible difference between inputs to the layer.
Dong teaches maximum singular value and training set.
One of ordinary skill would have motivation to combine Roth , Peng, Mok, and Dong to constraint Lipschitz constant to improve the robustness (Dong [2.2, Page 2]). 
						
Conclusion
Any inquiry concerning this communication or earlier communications from the
examiner should be directed to TIRUMALE KRISHNASWAMY RAMESH whose telephone number is (571)272-4605. The examiner can normally be reached by phone.
	Examiner interviews are available via telephone, in-person, and video conferencing
using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at
http://www.uspto.gov/interviewpractice. 
	If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached on phone (571-272-3768). The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
	Information regarding the status of published or unpublished applications may be
obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit:
https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for
information about filing in DOCX format. For additional questions, contact the Electronic
Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO
Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/TIRUMALE K RAMESH/Examiner, Art Unit 2121                                                                                                                                                                                                        

/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121
Read full office action
Prosecution Timeline

Show 4 earlier events
Jun 20, 2025
Response Filed
Aug 25, 2025
Final Rejection mailed — §103
Nov 24, 2025
Interview Requested
Dec 02, 2025
Applicant Interview (Telephonic)
Dec 11, 2025
Examiner Interview Summary
Dec 12, 2025
Request for Continued Examination
Dec 20, 2025
Response after Non-Final Action
Feb 12, 2026
Non-Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

16/739,694
Patent 12518153
TRAINING MACHINE LEARNING SYSTEMS
5y 12m to grant Granted Jan 06, 2026
17/136,054
Patent 12293284
META COOPERATIVE TRAINING PARADIGMS
4y 4m to grant Granted May 06, 2025
17/064,561
Patent 12229651
BLOCK-BASED INFERENCE METHOD FOR MEMORY-EFFICIENT CONVOLUTIONAL NEURAL NETWORK IMPLEMENTATION AND SYSTEM THEREOF
4y 4m to grant Granted Feb 18, 2025
17/039,178
Patent 12131244
HARDWARE-OPTIMIZED NEURAL ARCHITECTURE SEARCH
4y 1m to grant Granted Oct 29, 2024
16/844,335
Patent 11803745
TERMINAL DEVICE AND METHOD FOR ESTIMATING FIREFIGHTING DATA
3y 6m to grant Granted Oct 31, 2023
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
18%
Grant Probability
20%
With Interview (+2.1%)
4y 7m (~7m remaining)
Median Time to Grant
High
PTA Risk
Based on 40 resolved cases by this examiner. Grant probability derived from career allowance rate.
Automated Selection of Neural Architecture Using a Smoothed Super-Net

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

Automated Selection of Neural Architecture Using a Smoothed Super-Net

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email