DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant's arguments filed 01/05/2026 have been fully considered but they are not persuasive.
Regarding applicant’s remarks directed to the rejection of claims under 35 USC § 103,
Alleged No teaching of Assignment of the same probability to each edge
In Remarks p. 9, Applicant contends:
“Nothing in Figure 2 or in the uniform sampling in the quote taken from Section 3 of Li
discusses the assignment of the same probability to each edge. In addition, the gloss that the
Patent Office places on this aspect of Li ("wherein sampling uniformly is randomly sampling
from the list of options such that each option (ie directed edge from node i to subsequent nodes)
has the same probability") is unproven because the Patent Office fails to provide any reasoning
establishing that "sampling uniformly" is interchangeable in meaning with the claimed
assignment of the same probability to each edge.”
The relevant claim limitations appear to be “each respective edge of the edges being assigned a probability which characterizes with which probability the respective edge is selected” in claim 1.
As noted in the previous Office Action, Li teaches (emphasis added):
(Li, Section 3, …”3. Finally, moving from node to node, we sample uniformly from the set of possible choices for each decision that needs to be made [each respective edge of the edges being assigned a probability which characterizes with which probability the respective edge is selected; wherein sampling uniformly is randomly sampling from the list of options such that each option (ie directed edge from node i to subsequent nodes) has the same probability].”)
After careful consideration, the argument is considered unpersuasive as Li discloses “moving from node to node, we sample uniformly from the set of possible choices for each decision that needs to be made.” Examiner notes that the decided algorithm of uniform sampling the possible choices (edges) at each node reads upon the limitation as Examiner breaks down the interpretation of uniform sampling at each node:
Sampling the possible choices for each node teaches selecting (with which… the respective edge is selected)
Uniformly teaches (…selected with equal probability)
Uniform meaning in the same way/equally/evenly
Performing uniform sampling at each node is thus interpreted to be the sampling rule determined to select the path through the directed graph that would initialize an equal probability across edges. In other words, choosing this particular algorithm is assigning the probabilities of the edges.
Examiner further notes that uniform sampling is different from an exemplary algorithm of always selecting the first edge; which would not be uniform or selecting with equal probability as the selecting is biased to an arbitrary first edge.
The examiner refers to the rejection under 35 USC § 103 in the current office action for more details.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1, 3, 5-6 and 8-9 are rejected under 35 U.S.C. 103 as being unpatentable over US Pub. No. US20200175362A1 Zhang et al. (“Zhang”) in view of Li, Liam, and Ameet Talwalkar. "Random search and reproducibility for neural architecture search." (“Li”) in further view of Veniat, Tom, and Ludovic Denoyer. "Learning time/memory-efficient deep architectures with budgeted super networks." (“Veniat”)
In regards to claim 1,
Zhang teaches A computer-implemented method for creating a machine learning system, which is configured for segmentation and object description, the machine learning system including an input for receiving an image and two outputs, a first output outputting the segmentation of the image and a second output outputting the object description, comprising the following steps:
(Zhang, “[0024] Various embodiments of the present disclosure provide an efficient AutoML algorithm for lifelong learning. In some embodiments, this efficient AutoML algorithm is referred to as a Regularize, Expand and Compress (REC). In these embodiments, REC involves first searching a best new neural network architecture for the given tasks in a continuous learning mode. Tasks may include image classification, image segmentation [segmentation], object detection [object description] and/or many other computer vision tasks. The best neural network architecture can solve multiple different tasks simultaneously [a first output outputting the segmentation of the image and a second output outputting the object description; ie solving (providing outputs for) image segmentation and object detection], without catastrophic forgetting of old tasks' information, even when there is no access to old tasks' training data.”)
However, Zhang does not explicitly teach providing a directed graph, the directed graph including an input node, an output node, and a plurality of further nodes, the input node and the output node being connected via the further nodes using directed edges, the input node, output node, and further nodes representing data and the edges representing operations, which convert a first node of each respective edge into a further node connected to the respective edge, each respective edge of the edges being assigned a probability which characterizes with which probability the respective edge is selected; selecting a path through the directed graph, a subset of nodes being determined from the plurality of further nodes, all of which satisfy a predefined property with respect to a data resolution, at least one additional node being selected from the subset, which serves as the second output, the path through the directed graph from the input node along the edges via the additional node up to the output node being selected as a function of the probability assigned to the edges; creating the machine learning system as a function of the selected path and training the created machine learning system, adapted parameters of the trained machine learning system being stored in the corresponding edges of the directed graph and the probabilities of the edges of the path being adapted; multiple repeating of the selecting a path step and the creating and training a machine learning system step; and creating the machine learning system as a function of the directed graph; wherein the probabilities of the edges are set initially to one value, so that all paths through the directed graph are selected with equal probability.
Li teaches providing a directed graph, the directed graph including an input node, an output node, and a plurality of further nodes, the input node and the output node being connected via the further nodes using directed edges, the input node, output node, and further nodes representing data and the edges representing operations, which convert a first node of each respective edge into a further node connected to the respective edge, each respective edge of the edges being assigned a probability which characterizes with which probability the respective edge is selected;
(Li, Section 3, “Our algorithm is designed for an arbitrary search space with a DAG representation [providing a directed graph, the directed graph including an input node, an output node, and a plurality of further nodes, the input node and the output node being connected via the further nodes using directed edges; see figure 2], and in our in our experiments in Section 4, we use the same search spaces as that considered by DARTS [34] for the standard CIFAR-10 and PTB NAS benchmarks…
1. For each node in the DAG, determine what decisions must be made. In the case of the PTB search space, we need to choose a node as input and a corresponding operation to apply to generate the output of the node.
2. For each decision, identify the possible choices for the given node. In the case of the PTB search space, if we number the nodes from 1 to N, node i can take the outputs of nodes 0 to node i 1 as input (the initial input to the cell is index 0 and is also a possible input). Additionally, we can choose an operation from {tanh, relu, sigmoid, and identity} to apply to the output of node i [the input node, output node, and further nodes representing data and the edges representing operations, which convert a first node of each respective edge into a further node connected to the respective edge].
3. Finally, moving from node to node, we sample uniformly from the set of possible choices for each decision that needs to be made [each respective edge of the edges being assigned a probability which characterizes with which probability the respective edge is selected; wherein sampling uniformly is randomly sampling from the list of options such that each option (ie directed edge from node i to subsequent nodes) has the same probability].”)
PNG
media_image1.png
325
883
media_image1.png
Greyscale
Li teaches selecting a path through the directed graph, a subset of nodes being determined from the plurality of further nodes, all of which satisfy a predefined property with respect to a data resolution, at least one additional node being selected from the subset, which serves as the second output, the path through the directed graph from the input node along the edges via the additional node up to the output node being selected as a function of the probability assigned to the edges;
(Li, Figure 2: Recurrent Cell on PTB Benchmark. The best architecture found by random search with weight-sharing in Section A.3 is depicted [selecting a path through the directed graph, a subset of nodes being determined from the plurality of further nodes, all of which satisfy a predefined property with respect to a data resolution; wherein the completion of the path from input node to the output node is interpreted to be the “predefined property with respect to a data resolution” as resolution is interpreted to mean completion and all of the subset of nodes in the path (all of which) satisfy said predefined property ie a completed path]. Each numbered square is a node of the DAG and each edge represents the flow of data from one node to another after applying the indicated operation along the edge. Nodes with multiple incoming edges (i.e., node 0 and output node h_{t} [at least one additional node being selected from the subset, which serves as the second output] concatenate the inputs to form the output of the node [the path through the directed graph from the input node along the edges via the additional node up to the output node being selected as a function of the probability assigned to the edges; wherein the path is selected from uniform sampling ie a function of probabilities assigned to the edges]”).
Li teaches creating the machine learning system as a function of the selected path and training the created machine learning system, adapted parameters of the trained machine learning system being stored in the corresponding edges of the directed graph and the probabilities of the edges of the path being adapted;
multiple repeating of the selecting a path step and the creating and training a machine learning system step; and creating the machine learning system as a function of the directed graph;
(Li, Section 3, “In order to combine random search with weight-sharing, we simply use randomly sampled architectures to train the shared weights. Shared weights are updated by selecting a single architecture for a given minibatch and updating the shared weights by back-propagating through the network with only the edges and operations as indicated by the architecture activated [creating a machine learning system ie the network provided after training as a function of the selected path ie architecture activated and training the created machine learning system ie updating the shared weights by backpropagation, adapted parameters of the trained machine learning system being stored in the corresponding edges of the directed graph wherein since each edge is an operation, the weights of the particular operation are updated per the shared weights and the probabilities of the edges of the path being adapted wherein since the architecture is selected, the probability of the edges of the path is 100% and thus adapted to update the shared weights]. Hence, the number of architectures used to update the shared weights is equivalent to the total number of minibatch training iterations [multiple repeating of the selecting a path step and the creating and training a machine learning system step and creating the machine learning system as a function of the directed graph].”)
Li teaches wherein the probabilities of the edges are set initially to one value, so that all paths through the directed graph are selected with equal probability.
(Li, Section 3, 2. For each decision, identify the possible choices for the given node. In the case of the PTB search space, if we number the nodes from 1 to N, node i can take the outputs of nodes 0 to node i 1 as input (the initial input to the cell is index 0 and is also a possible input). Additionally, we can choose an operation from {tanh, relu, sigmoid, and identity} to apply to the output of node i.
3. Finally, moving from node to node, we sample uniformly from the set of possible choices for each decision that needs to be made [wherein the probabilities of the edges are set initially to one value, so that all paths through the directed graph are selected with equal probability; wherein sampling uniformly is randomly sampling from the list of options such that each option has the same probability].”)
However, Zhang and Li do not explicitly teach wherein during training of the machine learning system, a cost function is optimized, the cost function including one first function, which evaluates an efficiency of the machine learning system with respect to its outputs, and includes one second function, which estimates a latency and/or a computer resource consumption of the machine learning system as a function of a length of the path and of the operations of the edges.
Veniat teaches wherein during training of the machine learning system, a cost function is optimized, the cost function including one first function, which evaluates an efficiency of the machine learning system with respect to its outputs, and includes one second function, which estimates a latency and/or a computer resource consumption of the machine learning system as a function of a length of the path and of the operations of the edges
(Veniat, Supplemental Material Stochastic costs in the REINFORCE algorithm, “Distributed computation cost Taking the real-life example of a network which will, once optimized, have to run on a given computing infrastructure, the distributed computation cost is a measure of how ”parallelizable” an architecture is. This cost function takes the following three elements as inputs (i)A network architecture (represented as a graph for instance) [one second function, which estimates a latency and/or a computer resource consumption of the machine learning system as a function of a length of the path and of the operations of the edges; see fig. 8], (ii)An allocation algorithm and (iii) a maximum number of concurrent possible operations. The cost function then returns the number of computation cycles required to run the architecture given the allocation strategy [a cost function is optimized, the cost function including one first function, which evaluates an efficiency of the machine learning system with respect to its outputs].
PNG
media_image2.png
206
350
media_image2.png
Greyscale
”)
Zhang and Li are both considered to be analogous to the claimed invention because they are in the same field of neural architecture search. Zhang is further reasonably pertinent to the problem the inventor faced (multi-task learning). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang to incorporate the teachings of Li in order to provide a novel random search with weight-sharing algorithm that outperforms random search with early-stopping (Li, Abstract, “Neural architecture search (NAS) is a promising research direction that has the potential to replace expert-designed networks with learned, task-specific architectures. In this work, in order to help ground the empirical results in this field, we propose new NAS baselines that build off the following observations: (i) NAS is a specialized hyperparameter optimization problem; and (ii) random search is a competitive baseline for hyperparameter optimization. Leveraging these observations, we evaluate both random search with early-stopping and a novel random search with weight-sharing algorithm on two standard NAS benchmarks—PTB and CIFAR-10. Our results show that random search with early-stopping is a competitive NAS baseline, e.g., it performs at least as well as ENAS [41], a leading NAS method, on both benchmarks. Additionally, random search with weight-sharing outperforms random search with early-stopping, achieving a state-of-the-art NAS result on PTB and a highly competitive result on CIFAR-10. Finally, we explore the existing reproducibility issues of published NAS results. We note the lack of source material needed to exactly reproduce these results, and further discuss the robustness of published results given the various sources of variability in NAS experimental setups. Relatedly, we provide all information (code, random seeds, documentation) needed to exactly reproduce our results, and report our random search with weight-sharing results for each benchmark on multiple runs.”)
Veniat considered to be analogous to the claimed invention because they are in the same field of neural architecture search and is further reasonably pertinent to a problem the inventor faced (making efficient use of computational resources). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang and Li to incorporate the teachings of Veniat in order to provide a real-world computation cost of the network and a means for optimizing the network per the obtained cost (Veniat, Supplemental Material Stochastic costs in the REINFORCE algorithm, “Distributed computation cost Taking the real-life example of a network which will, once optimized, have to run on a given computing infrastructure, the distributed computation cost is a measure of how ”parallelizable” an architecture is.”)
In regards to claim 3,
Zhang and Li and Veniat teach The method as recited in claim 1,
Li teaches wherein the nodes of the subset, which all satisfy a predefined property with respect to a data resolution, are each also assigned a probability, the probabilities of the nodes of the subset being normalized.
Examiner interprets normalized to “mean that a drawing of the respective elements is equally probable, i.e., initially there is no preference for certain NOIs and/or edges and/or paths present” in light of the specification of the instant application, (specification, pg. 8 line 20- pg. 9 line 4)
(Li, Section 3, 2. For each decision, identify the possible choices for the given node. In the case of the PTB search space, if we number the nodes from 1 to N, node i can take the outputs of nodes 0 to node i 1 as input (the initial input to the cell is index 0 and is also a possible input). Additionally, we can choose an operation from {tanh, relu, sigmoid, and identity} to apply to the output of node i.
3. Finally, moving from node to node, we sample uniformly from the set of possible choices for each decision that needs to be made [wherein the nodes of the subset, which all satisfy a predefined property with respect to a data resolution, are each also assigned a probability, the probabilities of the nodes of the subset being normalized; wherein uniform sampling is used to obtain the nodes of the subset (completed path which satisfy the predefined property)].”)
In regards to claim 5,
Zhang and Li and Veniat teaches The method as recited in claim 3,
Li teaches wherein the probabilities of the nodes of the subset are initially set to a probability that all nodes of the subset are initially selected with equal probability.
(Li, Section 3, 2. For each decision, identify the possible choices for the given node. In the case of the PTB search space, if we number the nodes from 1 to N, node i can take the outputs of nodes 0 to node i 1 as input (the initial input to the cell is index 0 and is also a possible input). Additionally, we can choose an operation from {tanh, relu, sigmoid, and identity} to apply to the output of node i.
3. Finally, moving from node to node, we sample uniformly from the set of possible choices for each decision that needs to be made [wherein the probabilities of the nodes of the subset are initially set to a probability that all nodes of the subset are initially selected with equal probability; wherein uniform sampling is used to set the probabilities of the nodes equally].”)
In regards to claim 6,
Zhang and Li and Veniat teaches The method as recited in claim 1,
Zhang teaches wherein when selecting the path, at least two additional nodes are selected, a path through the directed graph including at least two paths, each of which extends via one of the additional nodes to the output node, and the two paths from the input node to the additional nodes being created separately from one another starting at the additional nodes up to the input node.
(Zhang, “[0077] The system then adaptively trains a network architecture of the machine learning model to generate an adapted machine learning model based on incorporating inherent correlations between the new task and the existing task (step 710). For example, in various embodiments, in step 710, the system may generate and identify an adapted network architecture based on MWC as discussed above. In using MWC, the system may incorporate inherent correlations between the existing task and the new task and identify the added layer as a task-specific layer for the new task. Also, for example, the system may train the ML model to perform the new task using training data for the new task without access to the training data for the old task.
[0078] In some embodiments, to adapt the network architecture in step 710 [wherein when selecting the path], the system may expand the network architecture for the ML model to perform the new task using AutoML, for example, by training child network architectures using wider and deeper operators as discussed with regard to FIG. 5 above. The expanded network architecture may include adding a layer to the network architecture and expanding one or more existing layers of the network architecture [at least two additional nodes are selected, a path through the directed graph including at least two paths, each of which extends via one of the additional nodes to the output node, and the two paths from the input node to the additional nodes being created separately from one another starting at the additional nodes up to the input node; wherein Zhang provides deeper and wider operators to expand the network architecture (see fig. 5)].”)
PNG
media_image3.png
494
704
media_image3.png
Greyscale
Zhang and Li and Veniat are both considered to be analogous to the claimed invention because they are in the same field of neural architecture search. Zhang is further reasonably pertinent to the problem the inventor faced (multi-task learning). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Li and Veniat to incorporate the teachings of Zhang in order to provide a mechanism for multi-task based lifelong learning to improve the network for better performance adaptive to new tasks (Zhang, “[0004] In many real-world applications, batches of data arrive periodically (e.g., daily, weekly, or monthly) with the data distribution changing over time. This presents an opportunity (or demand) for lifelong learning or continual learning and is an important issue in improving artificial intelligence. The primary goal of lifelong learning is to learn consecutive tasks without forgetting the knowledge learned from previously trained tasks and leverage the previous knowledge to obtain better performance or faster convergence on the newly coming task. One simple way is to finetune the model for every new task. However, such retraining typically degenerates the model performance on both new tasks and the old ones. If the new tasks are largely different from the old ones, it might not be possible to learn the optimal model for the new tasks. Meanwhile, the retrained representations may adversely affect the old tasks, causing them to drift from their optimal solution. This can cause “catastrophic forgetting”—a phenomenon where training a model to perform new tasks interferes the previously learned old knowledge. This leads to a performance degradation or even overwriting of the old knowledge by the new knowledge. Another issue for lifelong learning is resource consumption. A model that is continually trained may increase dramatically in terms of consumed resources (e.g., model size), which may be disadvantageous in applications where resources are limited, for example, in mobile device or mobile computing applications.”)
Claims 8 and 9 are rejected on the same rationale under 35 U.S.C. 103 as claim 1.
Claim(s) 4 is rejected under 35 U.S.C. 103 as being unpatentable over Zhang in view of Li and Veniat in further view of Bram28 (2018, December 12th). Re: What is the probability of passing through a node in a directed graph [Discussion post]. {Link: https://math.stackexchange.com/questions/3036994/what-is-the-probability-of-passing-through-a-node-in-a-directed-graph} (“Bram28”)
In regards to claim 4,
Zhang and Li and Veniat teaches The method as recited in claim 3,
Bram28 teaches wherein the probabilities of the nodes of the subset are initially set to a probability that a first number of paths is set by the respective node of the subset divided by a total number of paths through the directed graph.
Examiner’s note: Since the algorithm of Li sets each node to the probability of 1/[number of available paths from that node] ie uniform sampling on each node, Li must teach the probabilities of the nodes of the subset are initially set to a probability that a first number of paths is set by the respective node of the subset divided by a total number of paths through the directed graph; however, for clarity, Examiner provides Bram to teach the probability of a respective node in view of the total number of paths.
(Bram28, “OK, so then just compute the probability of getting to a node by computing the probability of getting to any of its predecessors, and multiplying that by the probability of following the edge from that predecessor to the node in question. The image below shows the results (green means the probability of taking the edge, while red means the probability of getting to the node [the probabilities of the nodes of the subset are initially set to a probability that a first number of paths is set by the respective node of the subset divided by a total number of paths through the directed graph; see red]):
PNG
media_image4.png
780
1165
media_image4.png
Greyscale
Wherein Bram further provides an exemplary calculation:
Just as an example: the probability of going through node 7 is the probability of going through either of nodes 4, 5, or 6, respectively multiplied by the probability of taking the edge from that node to node 7. Thus:
PNG
media_image5.png
47
544
media_image5.png
Greyscale
Also, just for a sanity check, let's make sure the probability of getting to node
PNG
media_image6.png
39
631
media_image6.png
Greyscale
”)
Bram considered to be analogous to the claimed invention because they are in the same field of probabilities. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang and Li to incorporate the teachings of Bram in order to provide clarity to the probabilities of each respective node having equal likelihood of going to any available paths in view of the total number of paths (John Slaine, first reply under Bram28 response,
PNG
media_image7.png
47
603
media_image7.png
Greyscale
”)
Claim(s) 2 is rejected under 35 U.S.C. 103 as being unpatentable over Zhang in view of Li and Veniat in further view of Su, Xiu, et al. "Prioritized architecture sampling with monto-carlo tree search." (“Su”)
In regards to claim 2,
Zhang and Li and Veniat teaches The method as recited in claim 1,
Su teaches wherein for each respective node of the subset, a total number of first subpaths from the respective node of the subset up to the input node and a total number of second subpaths from the respective node of the subset up to the output node are counted, the probabilities of those edges contained in the first subpaths are each initially set to a number of possible paths which connect the input node to the respective node of the subset and extend over those edges contained in the first subpaths, divided by the total number of the first subpaths, and the probabilities of those edges contained in the second subpaths are each initially set to a number of possible paths which connect the output node to the respective node of the subset and extend over those edges contained in the second subpaths, divided by the total number of the second subpaths.
(Su, Section 3.1, “However, we argue that in a chain-structured network, the selection of operation at each layer should depend on operations in the previous layers.
To capture the dependency among layers and leverage the limited combinations of operations for better understanding of the search space, we replace P (o(l)) in Eq.(1) with a conditional distribution for each 2 ≤ l ≤ L. Therefore, we reformulate Eq. (1) as follows:
PNG
media_image8.png
108
681
media_image8.png
Greyscale
where P (o(l)|o(1), … , o(l−1)) is the conditional probability distribution of the operation selection in the layer l conditioned on its previous layers 1 to l − 1. Note that l = 1 has no previous layer, so P (o(1)) is still independent.
Inspired by Eq.(2), we find this conditional probability distribution [the probabilities of those edges contained in the first subpaths ie conditional probabilities of the ancestor nodes are each initially set to a number of possible paths which connect the input node to the respective node of the subset and extend over those edges contained in the first subpaths, divided by the total number of the first subpaths, and the probabilities of those edges contained in the second subpaths ie conditional probabilities of subsequent nodes from the respective node to the output node are each initially set to a number of possible paths which connect the output node to the respective node of the subset and extend over those edges contained in the second subpaths, divided by the total number of the second subpaths] of search space can be naturally modeled into a tree-based structure; the MCTS is targeting this structure for a better exploration-exploitation trade-off. As a result, we propose to model the search space with a MCT T. In MCT, each node v(l)i∈T [for each respective node of the subset] corresponds to selecting an operation o(l)i∈O for the layer l under the condition of its ancestor nodes [a total number of first subpaths from the respective node of the subset up to the input node ie ancestor nodes], so the architecture representation α={O(l)}l∈{1,…,L} can also be uniquely identified in the MCT [a total number of second subpaths from the respective node of the subset up to the output node are counted]. As Figure 2 shows, the architectures are independently represented by paths in the MCT, and different choices of operations lead to different child trees; thus, the dependencies of all the operation selections can be naturally formed.”)
Su is considered to be analogous to the claimed invention because they are in the same field of neural architecture search with a particular focus on the probabilities of paths. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang and Li and Veniat to incorporate the teachings of Su in order to provide a method to consider previous layers by incorporating a Monte Carlo tree search to capture the dependency among layers as doing so provides the benefit of improved search efficiency and performance (Su, Abstract, “One-shot neural architecture search (NAS) methods significantly reduce the search cost by considering the whole search space as one network, which only needs to be trained once. However, current methods select each operation independently without considering previous layers. Besides, the historical information obtained with huge computation cost is usually used only once and then discarded. In this paper, we introduce a sampling strategy based on Monte Carlo tree search (MCTS) with the search space modeled as a Monte Carlo tree (MCT), which captures the dependency among layers. Furthermore, intermediate results are stored in the MCT for future decisions and a better exploration-exploitation balance. Concretely, MCT is updated using the training loss as a reward to the architecture performance; for accurately evaluating the numerous nodes, we propose node communication and hierarchical node selection methods in the training and search stages, respectively, which make better uses of the operation rewards and hierarchical information. Moreover, for a fair comparison of different NAS methods, we construct an open-source NAS benchmark of a macro search space evaluated on CIFAR-10, namely NAS-Bench-Macro. Extensive experiments on NASBench-Macro and ImageNet demonstrate that our method significantly improves search efficiency and performance. For example, by only searching 20 architectures, our obtained architecture achieves 78:0% top-1 accuracy with 442M FLOPs on ImageNet.”)
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
US Pub. No. US20200265315A1: Zoph et al. teaches Neural architecture search
US Pub. No. US20210142166A1: Chu et al. teaches Hypernetwork training method and device, electronic device and storage medium
US Pub. No. US20190370648A1: Zoph et al. teaches Neural architecture search for dense image prediction tasks
NPL: Guo, Zichao, et al. "Single path one-shot neural architecture search with uniform sampling." Computer vision–ECCV 2020: 16th European conference, glasgow, UK, August 23–28, 2020, proceedings, part XVI 16. Springer International Publishing, 2020.
NPL: Casale, Francesco Paolo, Jonathan Gordon, and Nicolo Fusi. "Probabilistic neural architecture search." arXiv preprint arXiv:1902.05116 (2019).
NPL: Cenciarelli, Pietro, Daniele Gorla, and Ivano Salvo. "A Polynomial-time Algorithm for Detecting the Possibility of Braess Paradox in Directed Graphs." arXiv preprint arXiv:1610.09320 (2016).
US Pub. No. US20080052692A1 LinkedIn teaches System, Method and Computer Program Product for Checking a Software Entity
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JASMINE THAI whose telephone number is (703)756-5904. The examiner can normally be reached M-F 8-4.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael Huntley can be reached at (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/J.T.T./Examiner, Art Unit 2129
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129