Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-14 rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter.
With regard to Claims 1 and 10,
These claims recite a “system” that is implemented on a computer but does not positively recite the computer or any other structural components. Therefore the claimed “system” is software per se and is non-statutory subject matter. The examiner recommends positively recite a structural component such as a processor.
Claims 2-9 and 11-14 depend from claims 1-10 and fail to remedy the deficiencies of the independent claims.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claim(s) 1-3, 7-11, 13, and 15-17 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Gu et al. (NPL from IDS: Search for Better Students to Learn Distilled Knowledge, hereinafter “Gu”).
Regarding claim 1, Gu teaches a machine learning system implemented by one or more computers, the system having access to a base neural network and being configured to determine a simplified neural network by iteratively performing a training process comprising:
forming sample data by sampling an architecture of a current candidate neural network (Gu, Section 3 Paragraph 1 – “The search space is the topology graph of the teacher model. Each channel in the model is taken as a node, and weights connecting nodes as edges. By removing nodes (channels) and all edges directly connected to those nodes, we can obtain a subgraph, which corresponds to smaller neural network architecture” – teaches forming sample data by sampling an architecture of a current candidate neural network (search space is topology graph of teacher model, which is the current candidate neural network));
selecting, in dependence on the sample data, an architecture for a second candidate neural network (Gu, Section 3 Paragraph 1 – “The search space is the topology graph of the teacher model. Each channel in the model is taken as a node, and weights connecting nodes as edges. By removing nodes (channels) and all edges directly connected to those nodes, we can obtain a subgraph, which corresponds to smaller neural network architecture” – teaches selecting, in dependence on the sample data, an architecture for a second candidate neural network (search space is topology graph of teacher model, by removing nodes and edges a subgraph, or architecture, can be obtained for a second candidate neural network, or student));
forming a trained candidate neural network by training the second candidate neural network, wherein the training of the second candidate neural network comprises applying feedback to the second candidate neural network in dependence on a comparison of behaviours of the second candidate neural network and the base neural network (Gu, Section 3.1 Paragraph 2 – “In knowledge distillation process, the logits of teacher network at can offer more information to train the student network. One way to leverage such information is to match the softened outputs of student softmax(as/τ ) and teacher ft(xi) = softmax(at/τ ) via a KL-divergence loss” – teaches forming a trained candidate neural network by training the second candidate neural network (trains the student network), wherein the training of the second candidate neural network comprises applying feedback to the second candidate neural network based on a comparison of behaviors of the second candidate neural network and the base neural network (trains student network by matching softened outputs, or behaviors, of student and teacher via a KL-divergence loss, thus training the second candidate neural network in dependence on a comparison of behaviors of the second candidate and base neural networks)); and
adopting the trained candidate neural network as a current candidate neural network for a subsequent iteration of the training process (Gu, Section 3.2 Paragraph 1 – “In the last subsection, we introduce our distillation-ware loss function. The weights w and the scaling factors g are updated to minimize the loss function. The loss function is differentiable to the weights. The weights w can be updated by Stochastic Gradient Descent (SGD) with momentum or its variants.” and Fig. 1 description – “A new model is constructed on the teacher model by multiplying scaling factors. After an optimization process, the channels with zero scaling factors are removed. The remained small architecture is the selected student architecture.” – teaches adopting the trained candidate neural network as a current candidate neural network for a subsequent iteration of the training process (loss functions are minimized by updating weights w and scaling factors, new model is built on teacher model and remaining small architecture is selected as student architecture, thus adopting trained candidate neural network as current candidate neural network for subsequent iterations of training)).
Claim 15 incorporates substantively all of the limitations of claim 1 in a computer-implemented method and is rejected on the same grounds as above.
Regarding claim 2, Gu teaches a machine learning system as claimed in claim 1, comprising, after multiple iterations of the training process, outputting a current candidate neural network as the simplified neural network (Gu, Section 3.1 Paragraph 5 (Last Paragraph) – “At the end of the optimization, we remove all the channels with closed gated from the constructed model. The remaining small architecture is taken as the student architecture” – teaches outputting a current candidate neural network as the simplified network (at the end of optimization, the small architecture is taken as the student architecture)).
Claim 16 is similar to claim 2, hence similarly rejected.
Regarding claim 3, Gu teaches a machine learning system as claimed in claim 1, wherein the simplified neural network has a smaller capacity and/or is less computationally intensive to implement than the base neural network (Gu, Abstract – “The architecture of the small student is often chosen to be similar to their teacher’s, with fewer layers or fewer channels, or both”, Section 3 Paragraph 1 – “The search space is the topology graph of the teacher model. Each channel in the model is taken as a node, and weights connecting nodes as edges. By removing nodes (channels) and all edges directly connected to those nodes, we can obtain a subgraph, which corresponds to smaller neural network architecture” and in Section 3.1 Paragraph 5 (Last Paragraph) – “At the end of the optimization, we remove all the channels with closed gated from the constructed model. The remaining small architecture is taken as the student architecture” – teaches wherein the simplified neural network has a smaller capacity and/or is less computationally intensive to implement than the base neural networks (selected student network is smaller than teacher network, thus smaller in capacity and/or less computationally intensive to implement than the teacher network)).
Claim 17 is similar to claim 3, hence similarly rejected.
Regarding claim 7, Gu teaches a machine learning system as claimed in claim 1, wherein the sample data is formed by sampling the current candidate neural network according to a predetermined acquisition function (Gu, Section 3 Paragraph 1 – “The search space is the topology graph of the teacher model. Each channel in the model is taken as a node, and weights connecting nodes as edges”, Section 3 Paragraph 2 – “In our search space, channels are taken as individual units. Therefore, we apply a structured pruning method to get student architectures”, and in Section 3 Paragraph 3 – “More concretely, we specify a gate on each channel by multiplying the activation map of the channel by a scaling factor g. At the end of the optimization, the open gate (g 6= 0) means the corresponding channel is important to the distillation process, while the closed gate (g = 0) means the corresponding channel can be removed safely. For a layer with K channels in a teacher neural network, the corresponding g is a K-element vector. The number of channels of the obtained student architecture in this layer is identified by the number of non-zero elements in the vector g. A simple demonstration is shown in Figure 1” – teaches wherein the sample data (topology graph is the search space) is formed by sampling the current candidate neural network (topology graph of teacher model) according to a predetermined acquisition function (apply structured pruning method to get student architectures, by specifying a gate, thus an acquisition function, on each channel)).
Regarding claim 8, Gu teaches a machine learning system as claimed in claim 1, wherein the selecting an architecture for a second candidate neural network is performed by optimization over a stochastic graph of network architectures (Gu, Section 3 Paragraph 1 – “The search space is the topology graph of the teacher model. Each channel in the model is taken as a node, and weights connecting nodes as edges. By removing nodes (channels) and all edges directly connected to those nodes, we can obtain a subgraph, which corresponds to smaller neural network architecture.” and in Section 3 Paragraph 2 – “Given the teacher topology graph, there are three approaches to achieve a subgraph, namely, non-sturctured pruning, groups sparsity, and structured pruning” – teaches wherein the selecting an architecture for a second candidate neural network (selecting architecture for student model) is performed by optimization over a stochastic graph of network architectures (architecture for student model is selected by optimizing over topology graph of teacher model by structured pruning)).
Regarding claim 9, Gu teaches a machine learning system as claimed in claim 1, wherein the forming the trained candidate neural network comprises causing the second candidate neural network to perform a plurality of tasks, causing the base neural network to perform the plurality of tasks, and modifying the second candidate neural network in dependence on a variance in performance between the second candidate neural network and the base neural network in performing the tasks (Gu, Section 3.1 Paragraph 2 – “In knowledge distillation process, the logits of teacher network at can offer more information to train the student network. One way to leverage such information is to match the softened outputs of student softmax(as/τ ) and teacher ft(xi) = softmax(at/τ ) via a KL-divergence loss… The overall loss to train the student network is Ls = LKD + λLCE where the hyperparameter λ is often set to a very small value, the second term works by regularizing the training process.” and in Section 3.1 Paragraph 3 – “Given an input xi, the softened output of a teacher model is ft(xi), and the softened output of the model constructed by adding gates is fs(xi, w, g), i.e., the constructed model in Figure 1. The weights and scaling factors therein are updated during the optimization. The loss function we propose is mathematically defined as follows. Eq (3)” – teaches wherein the forming the trained candidate neural network comprises causing the second candidate neural network to perform a plurality of tasks, causing the base neural network to perform the plurality of tasks, and modifying the second candidate neural network in dependence on a variance in performance between the second candidate neural network and the base neural network (in Eq. 3 the teacher ft and student fs perform over the same task xi- and the difference in the performance between the teacher and student on the same task is used to modify the student)).
Regarding claim 10, Gu teaches a machine learning system as claimed in claim 1, wherein the system has access to a trained neural network and is configured to determine the base neural network by iteratively performing a training process comprising:
forming sample data by sampling the architecture of a current candidate base neural network (Gu, Section 3 Paragraph 1 – “The search space is the topology graph of the teacher model. Each channel in the model is taken as a node, and weights connecting nodes as edges. By removing nodes (channels) and all edges directly connected to those nodes, we can obtain a subgraph, which corresponds to smaller neural network architecture” – teaches forming sample data by sampling an architecture of a current candidate neural network (search space is topology graph of teacher model, which is the current candidate base neural network));
selecting, in dependence on the sample data, an architecture for a second candidate base neural network (Gu, Section 3 Paragraph 1 – “The search space is the topology graph of the teacher model. Each channel in the model is taken as a node, and weights connecting nodes as edges. By removing nodes (channels) and all edges directly connected to those nodes, we can obtain a subgraph, which corresponds to smaller neural network architecture” – teaches selecting, in dependence on the sample data, an architecture for a second candidate neural network (search space is topology graph of teacher model, by removing nodes and edges a subgraph, or architecture, can be obtained for a second candidate neural network, or student));
forming a trained candidate base neural network by training the second candidate base neural network, wherein the training of the second candidate base neural network comprises applying feedback to the second candidate base neural network in dependence on a comparison of behaviours of the second candidate base neural network and the trained neural network (Gu, Section 3.1 Paragraph 2 – “In knowledge distillation process, the logits of teacher network at can offer more information to train the student network. One way to leverage such information is to match the softened outputs of student softmax(as/τ ) and teacher ft(xi) = softmax(at/τ ) via a KL-divergence loss” – teaches forming a trained candidate base neural network by training the second candidate base neural network (trains the student network), wherein the training of the second candidate neural network comprises applying feedback to the second candidate neural network based on a comparison of behaviors of the second candidate base neural network and the trained neural network (trains student network by matching softened outputs, or behaviors, of student and teacher via a KL-divergence loss, thus training the second candidate neural network in dependence on a comparison of behaviors of the second candidate and trained neural network)); and
adopting the trained candidate base neural network as a current candidate base neural network for a subsequent iteration of the training process (Gu, Section 3.2 Paragraph 1 – “In the last subsection, we introduce our distillation-ware loss function. The weights w and the scaling factors g are updated to minimize the loss function. The loss function is differentiable to the weights. The weights w can be updated by Stochastic Gradient Descent (SGD) with momentum or its variants.” and Fig. 1 description – “A new model is constructed on the teacher model by multiplying scaling factors. After an optimization process, the channels with zero scaling factors are removed. The remained small architecture is the selected student architecture.” – teaches adopting the trained candidate neural network as a current candidate neural network for a subsequent iteration of the training process (new model is built on teacher model and remaining small architecture is selected as student architecture, thus adopting trained candidate neural network as current candidate neural network for subsequent iterations of training by updating weights and scaling factors to minimize a loss function)); and
after multiple iterations of the training process, adopting a current candidate base neural network as the base neural network (Gu, Fig. 1 description – “A new model is constructed on the teacher model by multiplying scaling factors. After an optimization process, the channels with zero scaling factors are removed. The remained small architecture is the selected student architecture.” – teaches adopting a current candidate base neural network as the base neural network (remaining architecture that was built on teacher model is the selected student architecture)).
Regarding claim 11, Gu teaches a machine learning system as claimed in claim 10, wherein the base neural network has a smaller capacity and/or is less computationally intensive to implement than the trained neural network (Gu, Abstract – “The architecture of the small student is often chosen to be similar to their teacher’s, with fewer layers or fewer channels, or both”, Section 3 Paragraph 1 – “The search space is the topology graph of the teacher model. Each channel in the model is taken as a node, and weights connecting nodes as edges. By removing nodes (channels) and all edges directly connected to those nodes, we can obtain a subgraph, which corresponds to smaller neural network architecture” and in Section 3.1 Paragraph 5 (Last Paragraph) – “At the end of the optimization, we remove all the channels with closed gated from the constructed model. The remaining small architecture is taken as the student architecture” – teaches wherein the base neural network (selected student architecture) has a smaller capacity and/or is less computationally intensive to implement than the trained neural networks (selected student network is smaller than teacher network, thus smaller in capacity and/or less computationally intensive to implement than the teacher network)).
Regarding claim 13, Gu teaches a machine learning system as claimed in claim 1, the system being configured to install the simplified neural network for execution on a device having lower computational complexity than the one or more computers (Gu, Introduction Paragraph 1 – “The computationally expensive inferences prevent the deploy of deep neural networks in small devices with limited memory size or latency-critical applications such as smartphones and self-driving cars.” and in Paragraph 3 – “The students trained under distillation are closer in performance to their larger teacher. The lower computational cost and memory footprint of the powerful student make its deployment much easier” – teaches the system configured to install the simplified network for execution on a device having lower computational complexity than the one or more computers (student network has lower computational cost and memory footprint thus easier to deploy on small device with limited memory size or latency-critical applications)).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 4-6 and 18-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gu in view of Shi et al. (NPL: Multi-objective Neural Architecture Search via Predictive Network Performance Optimization, hereinafter “Shi”).
Regarding claim 4, Gu teaches a machine learning system as claimed in claim 1.
Gu fails to explicitly teach wherein the selecting an architecture for the second candidate neural network is performed by Bayesian optimization.
However, analogous to the field of the claimed invention, Shi teaches:
wherein the selecting an architecture for the second candidate neural network is performed by Bayesian optimization (Shi, Section 3.4 Paragraph 2 – “The algorithm of our proposed BOGCN-NAS is illustrated in Algorithm 1. Given the search space A, we initialize trained architecture sets U containing architectures (Ai , Xi) with their performance ti = {f1i , . . . , fmi}…Based on tˆj and multi-objective formulation (Section 3.1), we can generate a estimated Pareto Front and sample estimated Pareto optimal models as set S and fully-train them to obtain the true objective values tj .” – teaches selecting an architecture for the second candidate neural network performed by multi-objective Bayesian optimization (selects architecture from architecture sets based on multi-objective function)).
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate the architecture selection performed by Bayesian optimization of Shi to the candidate architectures of Gu in order to select an architecture by Bayesian optimization. As the search strategies of NAS include random search Bayesian optimization (Gu, Introduction) and doing so would enable NAS algorithms to consider many other objectives such as speed/accuracy trade-off (Shi, Introduction).
Claim 18 is similar to claim 4, hence similarly rejected.
Regarding claim 5, the combination of Gu and Shi teaches a machine learning system as claimed in claim 4, wherein the selecting an architecture for the second candidate neural network is performed by multi- objective Bayesian optimization (Shi, Section 3.4 Paragraph 2 – “The algorithm of our proposed BOGCN-NAS is illustrated in Algorithm 1. Given the search space A, we initialize trained architecture sets U containing architectures (Ai , Xi) with their performance ti = {f1i , . . . , fmi}…Based on tˆj and multi-objective formulation (Section 3.1), we can generate a estimated Pareto Front and sample estimated Pareto optimal models as set S and fully-train them to obtain the true objective values tj .” – teaches selecting an architecture for the second candidate neural network performed by multi-objective Bayesian optimization (selects architecture from architecture sets based on multi-objective function)).
Claim 19 is similar to claim 5, hence similarly rejected.
Regarding claim 6, the combination of Gu and Shi teaches a machine learning system as claimed in claim 4, wherein the selecting an architecture for the second candidate neural network is performed by Bayesian optimization having one or more objectives (Shi, Section 3.4 Paragraph 2 – “The algorithm of our proposed BOGCN-NAS is illustrated in Algorithm 1. Given the search space A, we initialize trained architecture sets U containing architectures (Ai , Xi) with their performance ti = {f1i , . . . , fmi}…Based on tˆj and multi-objective formulation (Section 3.1), we can generate a estimated Pareto Front and sample estimated Pareto optimal models as set S and fully-train them to obtain the true objective values tj .” – teaches selecting an architecture for the second candidate neural network performed by multi-objective Bayesian optimization (selects architecture from architecture sets based on multi-objective function)), wherein at least one of said objectives comprise (i) improved classification accuracy of the second candidate neural network and/or (ii) reduced computational intensiveness of the second candidate neural network (Shi, Section 3.1 Paragraph 1 – “We formulate NAS problem as a multi-objective optimization problem over the architecture search space A where objective functions can be accuracy, latency, number of parameters, etc. We aim to find architectures on the Pareto front of A. Specifically, when m = 1, it reduces to single-objective (usually accuracy) NAS” – teaches wherein at least on of said objectives comprises improved classification accuracy of the second candidate neural network (objective functions can be accuracy… single-objective function is usually accuracy) and/or reduced computational intensiveness of the second candidate neural network (objective functions can be latency or number of parameters, which would reduce the computational intensiveness of the candidate neural network)).
Claim 20 is similar to claim 6, hence similarly rejected.
Claim(s) 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gu in view of Mirzadeh et al. (NPL from IDS: Improved Knowledge Distillation via Teacher Assistant, hereinafter “Mirzadeh”).
Regarding claim 12, Gu teaches a machine learning system as claimed in claim 10.
Gu fails to explicitly teach wherein the base neural network is a teaching assistant network for facilitating formation of the simplified neural network
However, analogous to the field of the claimed invention, Mirzadeh teaches:
wherein the base neural network is a teaching assistant network for facilitating formation of the simplified neural network (Mirzadeh, Page 3 Col. 2 Paragraph 4 – “The teacher assistant (TA) lies somewhere in between teacher and student in terms of size or capacity. First, the TA network is distilled from the teacher. Then, the TA plays the role of a teacher and trains the student via distillation. This strategy will alleviate factor 2 in the previous subsection by being closer to the student than the teacher. Therefore, the student is able to fit TA’s logit distribution more effectively than that of the teacher’s.” – teaches wherein the base neural network is a teaching assistant network for facilitating formation of the simplified network (teacher assistant lies between teacher and student, plays role of teacher and trains student via distillation, student is able to fit TA’s logit distribution more effectively)).
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate the teaching assistant networks of Mirzadeh to the base neural network of Gu in order to utilize teacher assistant networks for facilitating formation of the simplified network. Doing so would allow softer, less confident targets (Mirzadeh, Page 3 Col. 2 Paragraph 4) and fill the gap in size between teacher and student models (Mirzadeh, Introduction).
Claim(s) 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gu in view of Xu et al. (US Pub. No. 2022/0130142, hereinafter “Xu”).
Regarding claim 14, Gu teaches a machine learning system as claimed in claim 13, wherein the selecting an architecture for a second candidate neural network is performed by optimization over a stochastic graph of network architectures (Gu, Section 3 Paragraph 1 – “The search space is the topology graph of the teacher model. Each channel in the model is taken as a node, and weights connecting nodes as edges. By removing nodes (channels) and all edges directly connected to those nodes, we can obtain a subgraph, which corresponds to smaller neural network architecture.” and in Section 3 Paragraph 2 – “Given the teacher topology graph, there are three approaches to achieve a subgraph, namely, non-sturctured pruning, groups sparsity, and structured pruning” – teaches wherein the selecting an architecture for a second candidate neural network (selecting architecture for student model) is performed by optimization over a stochastic graph of network architectures (architecture for student model is selected by optimizing over topology graph of teacher model by structured pruning)),
Gu fails to explicitly teach the stochastic graph having been predetermined in dependence on one or more capabilities of the device.
However, analogous to the field of the claimed invention, Xu teaches:
the stochastic graph having been predetermined in dependence on one or more capabilities of the device (Xu, [0217] – “Specifically, the types and the quantity of operations included in the search space may be first determined based on the application requirement of the target neural network, and then the types and the quantity of operations included in the search space are adjusted based on the condition of the video random access memory resource of the device performing neural architecture search,” – teaches the types and operations having been predetermined in dependence on one or more capabilities of the device (operations first determined by application requirement of the target neural network, and search space of NAS is based on one or more capabilities of the device)).
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to incorporate the operations predetermined based on device capabilities of Xu to the stochastic graph of Gu in order to optimize over a stochastic graph determined based on device capabilities. Doing so would enable the deployment of deep neural networks in small devices with limited memory size or latency-critical applications such as smartphones and self-driving cars (Gu, Introduction) and adjust the architecture search space based on conditions of the device (Xu, [0217])
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Kandasamy et al. (NPL: Neural Architecture Search with Bayesian Optimization and Optimal Transport) teaches using Bayesian optimization for neural architecture selection. Teaches using Bayesian optimization to modify neural architectures to satisfy an object function such as improved accuracy.
Yang et al. (US Pub. No. 2021/0056378) teaches systems and method for neural network architecture search based on an optimization of a stochastic graph. The stochastic graph is determined based on device constraints.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LOUIS C NYE whose telephone number is 571-272-0636. The examiner can normally be reached Monday - Friday 9:00AM - 5:00PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, MATT ELL can be reached at 571-270-3264. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/LOUIS CHRISTOPHER NYE/Examiner, Art Unit 2141
/MATTHEW ELL/Supervisory Patent Examiner, Art Unit 2141