Detailed Action
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1 – 4, 8 – 11 and 13 – 14 are rejected under 35 U.S.C. 103 as being unpatentable over Liu et al., "Darts: Differentiable architecture search" in view of Li et al., "Block-wisely supervised neural architecture search with knowledge distillation".
Regarding claim 1, Liu teaches: An information processing apparatus which executes a search of a network model
(Liu, page: 2, “We introduce a novel algorithm for differentiable network architecture search [executes a search of a network model] based on bilevel optimization, which is applicable to both convolutional and recurrent architectures.”)
including an architecture coefficient and a weight coefficient to determine the architecture of the network model, the information processing apparatus comprising:
(Liu, page: 2, “2 DIFFERENTIABLE ARCHITECTURE SEARCH We describe our search space in general form in Sect. 2.1, where the computation procedure for an architecture (or a cell in it) is represented as a directed acyclic graph. We then introduce a simple continuous relaxation scheme for our search space which leads to a differentiable learning objective for the joint optimization of the architecture and its weights (Sect. 2.2) [including an architecture coefficient and a weight coefficient to determine the architecture of the network model]. Finally, we propose an approximation technique to make the algorithm computationally feasible and efficient (Sect. 2.3).”)
at least one memory storing instructions; and at least one processor that, upon execution of the instructions, is configured to:
(Liu, page: 8, “3.3 RESULTS ANALYSIS The CIFAR-10 results for convolutional architectures are presented in Table 1. Notably, DARTS achieved comparable results with the state of the art (Zoph et al., 2018; Real et al., 2018) while using three orders of magnitude less computation resources (i.e. 1.5 or 4 GPU days vs 2000 GPU days for NASNet and 3150 GPU days for AmoebaNet) [at least one memory storing instructions; and at least one processor that, upon execution of the instructions].”)
execute first learning of an architecture coefficient with a weight coefficient fixed; execute second learning of a weight coefficient with an architecture coefficient fixed; and
[AltContent: textbox ([execute second learning of a weight coefficient with an architecture coefficient fixed;] (i.e.: it looks at the training set to minimize error, it treats the architecture coefficient as fixed, focusing only on making the existing connection more accurate.))][AltContent: textbox ([execute first learning of an architecture coefficient with a weight coefficient fixed;] (i.e.: it looks at the validation set to see which architecture performs best on new data. It treats the network weights as temporarily fixed to see how the architecture itself influence the performance.))]
PNG
media_image1.png
163
565
media_image1.png
Greyscale
(Liu, Algorithm 1: DARTS – Differentiable Architecture Search)
execute control to advance the search by causing the first learning and the second learning to execute learning alternately on a network model having an architecture coefficient and a weight coefficient acquired at present time
(Liu, page: 4, “The iterative procedure is outlined in Alg. 1 [execute control to advance the search by causing the first learning and the second learning to execute learning alternately on a network model having an architecture coefficient and a weight coefficient acquired at present time] (i.e.: alternating bi-level optimization loop of DARTS, where weight (w) are updated while architecture (α) is fixed). While we are not currently aware of the convergence guarantees for our optimization algorithm, in practice it is able to reach a fixed point with a suitable choice of ξ 1 . We also note that when momentum is enabled for weight optimisation, the one-step unrolled learning objective in equation 6 is modified accordingly and all of our analysis still applies.”)
Liu does not teach:
using an output value output from a network model set as a teacher model which is configured based on an architecture coefficient and a weight coefficient acquired before the present time
Li teaches:
using an output value output from a network model set as a teacher model which is configured based on an architecture coefficient and a weight coefficient acquired before the present time
(Li, page: 1993, “Thanks to our block-wise search, we can train several cells with different channel numbers or layer numbers independently in each stage to ensure channel and layer variability without the interference of identity operation, As shown in Figure 2, in each training step, the teacher’s previous feature map [before the present time] is first fed to several cells (as suggested by the solid line) [using an output value output from a network model set as a teacher model], and one of the candidate operations of each layer in the cell is randomly chosen to form a path (as suggested by the dotted line). The weight of the supernet is optimized by minimizing the MSE loss with the teacher’s feature map [which is configured based on an architecture coefficient and a weight coefficient acquired].”)
Li and Liu are related to the same field of endeavor (i.e.: optimization of neural architecture search). It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to combine the teaching of Li with teachings of Liu to add block wise NAS framework with architecture level knowledge distillation to reduce bias from shared weights, improved search effectiveness and better final performance (Li, Abstract).
Regarding claim 2, Liu in view of Li teach the method of claim 1.
Liu further teaches: wherein the at least one processor, upon execution of the instructions, is further configured to execute control to cause the second learning to execute, after the at least one processor executes the first learning, learning on a network model
(Liu, page: 4, “The iterative procedure [further configured to execute control to cause the second learning to execute] is outlined in Alg. 1. While we are not currently aware of the convergence guarantees for our optimization algorithm in practice it is able to reach a fixed point with a suitable choice of ξ 1 . We also note that when momentum is enabled for weight optimisation, the one-step unrolled learning objective [after the at least one processor executes the first learning, learning on a network model] in equation 6 is modified accordingly and all of our analysis still applies.”)
Li further teaches: having an architecture coefficient acquired through first learning using an output value output from the teacher model.
(Li, page: 1993, “Thanks to our block-wise search, we can train several cells with different channel numbers or layer numbers independently in each stage to ensure channel and layer variability without the interference of identity operation, As shown in Figure 2, in each training step, the teacher’s previous feature map [having an architecture coefficient acquired through first learning using an output value output from the teacher model] is first fed to several cells (as suggested by the solid line), and one of the candidate operations of each layer in the cell is randomly chosen to form a path (as suggested by the dotted line). The weight of the supernet is optimized by minimizing the MSE loss with the teacher’s feature map.”)
It would have been obvious to one of ordinary skill in the art before the effective filling date of the present application to combine the teachings of Li with teachings of Liu for the same reasons disclosed for claim 1.
Regarding claim 3, Liu in view of Li teach the method of claim 2.
Li further teaches: wherein the at least one processor, upon execution of the instructions, is further configured to execute control to cause the first learning to be executed using, as input data, a teacher data set.
PNG
media_image3.png
360
504
media_image3.png
Greyscale
(Li, Figure 2)
(Li, page: 1992, “Figure 2. Illustration of our DNA. The teacher’s previous feature map is used as input for both teacher and student block [is further configured to execute control to cause the first learning to be executed using, as input data, a teacher data set]. Each cell of the supernet is trained independently to mimic the behavior of the corresponding teacher block by minimizing the l2-distance between their output feature maps. The dotted lines indicate randomly sampled paths in a cell.”)
It would have been obvious to one of ordinary skill in the art before the effective filling date of the present application to combine the teachings of Li with teachings of Liu for the same reasons disclosed for claim 1.
Regarding claim 4, Liu in view of Li teach the method of claim 2.
Li further teaches: wherein the at least one processor, upon execution of the instructions, is further configured to execute control to set a network model having an architecture coefficient and a weight coefficient acquired
(Li, page: 1993, “Thanks to our block-wise search, we can train several cells with different channel numbers or layer numbers independently in each stage to ensure channel and layer variability without the interference of identity operation, As shown in Figure 2, in each training step, the teacher’s previous feature map is first fed to several cells (as suggested by the solid line), and one of the candidate operations of each layer in the cell is randomly chosen to form a path (as suggested by the dotted line). The weight of the supernet is optimized by minimizing the MSE loss with the teacher’s feature map [configured to execute control to set a network model having an architecture coefficient and a weight coefficient acquired].”)
immediately before the first learning as a first teacher model, and execute control to cause the second learning to execute learning on a network model having an architecture coefficient acquired through the first learning using an output value output from the first teacher model.
(Li, Figure 2)
PNG
media_image4.png
370
831
media_image4.png
Greyscale
[AltContent: textbox ([immediately before the first learning as a first teacher model])][AltContent: textbox ([execute control to cause the second learning to execute learning on a network model having an architecture coefficient acquired through the first learning using an output value output from the first teacher model])]
It would have been obvious to one of ordinary skill in the art before the effective filling date of the present application to combine the teachings of Li with teachings of Liu for the same reasons disclosed for claim 1.
Regarding claim 8, Liu in view of Li teach the method of claim 1.
Li further teaches: wherein the at least one processor, upon execution of the instructions, is further configured to execute control to generate the teacher model by using a plurality of architecture coefficients and a plurality of weight coefficients acquired over a course of different searches.
(Li, page: 1992, “3.2. Block-wise Supervision with Distilled Architecture Knowledge Although we motivate well in Section 3.1, a technical barrier in our block-wise NAS is that we lack of internal ground truth in Eqn. (3). Fortunately, we find that different blocks of an existing architecture have different knowledge [by using a plurality of architecture coefficients and a plurality of weight coefficients] in extracting different patterns of an image. We also find that the knowledge not only lies, as the literature suggests, in the network parameters, but also in the network architecture. Hence, we use the block-wise representation of existing models to supervise our architecture search [acquired over a course of different searches]. Let Yi be the output feature maps of the i-th block of the supervising model (i.e., teacher model) [configured to execute control to generate the teacher model] and Yˆ i(X ) be the output feature maps of the i-th block of the supernet. We take L2 norm as the cost function. The loss function in Eqn. (3)”)
It would have been obvious to one of ordinary skill in the art before the effective filling date of the present application to combine the teachings of Li with teachings of Liu for the same reasons disclosed for claim 1.
Regarding claim 9, Liu in view of Li teach the method of claim 1.
Li further teaches: wherein the at least one processor, upon execution of the instructions, is further configured to execute control to cause both the first learning and the second learning to execute learning using, as input, a teacher data set
(Li, 1992, “Figure 2. Illustration of our DNA. The teacher’s previous feature map is used as input for both teacher and student block [configured to execute control to cause both the first learning and the second learning to execute learning using, as input, a teacher data set]. Each cell of the supernet is trained independently to mimic the behavior of the corresponding teacher block by minimizing the l2-distance between their output feature maps. The dotted lines indicate randomly sampled paths in a cell.”)
after at least one processor executes at least any one of the first learning and the second learning
(Li, 1990, “To address the above-mentioned issues, we propose a new solution to NAS where the search space is large, while the potential candidate architectures can be fully and fairly trained. We consider a network architecture that has several blocks, conceptualized as analogous to the ventral visual blocks V1, V2, V4, and IT [27] (see Fig. 1). We then train each block of the candidate architectures separately [after at least one processor executes at least any one of the first learning and the second learning].”)
using an output value output from the teacher model.
(Li, page: 1993, “Thanks to our block-wise search, we can train several cells with different channel numbers or layer numbers independently in each stage to ensure channel and layer variability without the interference of identity operation, As shown in Figure 2, in each training step, the teacher’s previous feature map is first fed to several cells (as suggested by the solid line) [using an output value output from the teacher model], and one of the candidate operations of each layer in the cell is randomly chosen to form a path (as suggested by the dotted line). The weight of the supernet is optimized by minimizing the MSE loss with the teacher’s feature map.”)
It would have been obvious to one of ordinary skill in the art before the effective filling date of the present application to combine the teachings of Li with teachings of Liu for the same reasons disclosed for claim 1.
Regarding claim 10, Liu in view of Li teach the method of claim 1.
Li further teaches: wherein, through the control executed by the at least one processor, the at least one processor, upon execution of the instructions, is further configured to execute the first learning to acquire a first output value by inputting input data to a network model
(Li, page: 1990, “we propose to parallelize the block-wise search in an analogous way. Specifically, for each block, we use the output [further configured to execute the first learning to acquire a first output value] of the previous block of the supervising model as the input for each of our blocks [by inputting input data to a network model]. Thus, the search can be sped up in a parallel way.”)
acquire a second output value by inputting the input data to the teacher model, and
(Li, page: 1993, “Thanks to our block-wise search, we can train several cells with different channel numbers or layer numbers independently in each stage to ensure channel and layer variability without the interference of identity operation, As shown in Figure 2, in each training step, the teacher’s previous feature map is first fed to several cells (as suggested by the solid line) [by inputting the input data to the teacher model], and one of the candidate operations of each layer in the cell is randomly chosen to form a path (as suggested by the dotted line) [acquire a second output value]. The weight of the supernet is optimized by minimizing the MSE loss with the teacher’s feature map.”)
execute learning of an architecture coefficient based on a loss calculated from the first output value and the second output value.
(Li, page: 1991 – 1992, “In our experiment, the single weight-sharing search space in a block reduces significantly (e.g., Drop rate ≈ 1/(1e 15 N )), ensuring each candidate architecture αi ∈ Ai [execute learning of an architecture coefficient] to be optimized sufficiently. Finally, the architecture is searched across the different blocks in the whole search space A: α ∗ = arg min α∈A XN i=1 λiLval (W∗ i (αi), αi; X, Y), (5) where λi represents the loss weights [based on a loss calculated from the first output value and the second output value]. Here, W∗ i (αi) denotes the learned shared network parameters of the sub-net αi and the supernet. Note that different from the learning of the supernet, we use the validation set to evaluate the performance of the candidate architectures.”)
Liu further teaches: having an architecture coefficient and a weight coefficient acquired at the present time,
(Liu, page: 2, “2 DIFFERENTIABLE ARCHITECTURE SEARCH We describe our search space in general form in Sect. 2.1, where the computation procedure for an architecture (or a cell in it) is represented as a directed acyclic graph. We then introduce a simple continuous relaxation scheme for our search space which leads to a differentiable learning objective for the joint optimization of the architecture and its weights (Sect. 2.2) [having an architecture coefficient and a weight coefficient acquired at the present time]. Finally, we propose an approximation technique to make the algorithm computationally feasible and efficient (Sect. 2.3).”)
It would have been obvious to one of ordinary skill in the art before the effective filling date of the present application to combine the teachings of Li with teachings of Liu for the same reasons disclosed for claim 1.
Regarding claim 11, Liu in view of Li teach the method of claim 1.
Li further teaches: wherein, through the control executed by the at least one processor, the at least one processor, upon execution of the instructions, is further configured to execute the second learning to acquire a first output value by inputting input data to a network model
(Li, page: 1993, “Thanks to our block-wise search, we can train several cells with different channel numbers or layer numbers independently in each stage to ensure channel [further configured to execute the second learning] and layer variability without the interference of identity operation, As shown in Figure 2, in each training step, the teacher’s previous feature map is first fed to several cells (as suggested by the solid line) [to acquire a first output value by inputting input data to a network model], and one of the candidate operations of each layer in the cell is randomly chosen to form a path (as suggested by the dotted line). The weight of the supernet is optimized by minimizing the MSE loss with the teacher’s feature map.”)
acquire a second output value by inputting the input data to the teacher model, and
(Li, page: 1993, “Thanks to our block-wise search, we can train several cells with different channel numbers or layer numbers independently in each stage to ensure channel and layer variability without the interference of identity operation, As shown in Figure 2, in each training step, the teacher’s previous feature map is first fed to several cells (as suggested by the solid line) [by inputting the input data to the teacher model], and one of the candidate operations of each layer in the cell is randomly chosen to form a path (as suggested by the dotted line) [acquire a second output value]. The weight of the supernet is optimized by minimizing the MSE loss with the teacher’s feature map.”)
execute learning of an architecture coefficient based on a loss calculated from the first output value and the second output value.
(Li, page: 1991 – 1992, “In our experiment, the single weight-sharing search space in a block reduces significantly (e.g., Drop rate ≈ 1/(1e 15 N )), ensuring each candidate architecture αi ∈ Ai [execute learning of an architecture coefficient] to be optimized sufficiently. Finally, the architecture is searched across the different blocks in the whole search space A: α ∗ = arg min α∈A XN i=1 λiLval (W∗ i (αi), αi; X, Y), (5) where λi represents the loss weights [based on a loss calculated from the first output value and the second output value]. Here, W∗ i (αi) denotes the learned shared network parameters of the sub-net αi and the supernet. Note that different from the learning of the supernet, we use the validation set to evaluate the performance of the candidate architectures.”)
Liu further teaches: having an architecture coefficient and a weight coefficient acquired at the present time,
(Liu, page: 2, “2 DIFFERENTIABLE ARCHITECTURE SEARCH We describe our search space in general form in Sect. 2.1, where the computation procedure for an architecture (or a cell in it) is represented as a directed acyclic graph. We then introduce a simple continuous relaxation scheme for our search space which leads to a differentiable learning objective for the joint optimization of the architecture and its weights (Sect. 2.2) [having an architecture coefficient and a weight coefficient acquired at the present time]. Finally, we propose an approximation technique to make the algorithm computationally feasible and efficient (Sect. 2.3).”)
It would have been obvious to one of ordinary skill in the art before the effective filling date of the present application to combine the teachings of Li with teachings of Liu for the same reasons disclosed for claim 1.
Regarding claim 13, Liu teaches: A control method for an information processing apparatus which executes a search of a network model
(Liu, page: 2, “We introduce a novel algorithm for differentiable network architecture search [executes a search of a network model] based on bilevel optimization, which is applicable to both convolutional and recurrent architectures.”)
including an architecture coefficient and a weight coefficient to determine the architecture of the network model, the information processing apparatus comprising:
(Liu, page: 2, “2 DIFFERENTIABLE ARCHITECTURE SEARCH We describe our search space in general form in Sect. 2.1, where the computation procedure for an architecture (or a cell in it) is represented as a directed acyclic graph. We then introduce a simple continuous relaxation scheme for our search space which leads to a differentiable learning objective for the joint optimization of the architecture and its weights (Sect. 2.2) [including an architecture coefficient and a weight coefficient to determine the architecture of the network model]. Finally, we propose an approximation technique to make the algorithm computationally feasible and efficient (Sect. 2.3).”)
executing first learning of an architecture coefficient by first learning with a weight coefficient fixed; executing second learning of a weight coefficient by second learning with an architecture coefficient fixed; and
[AltContent: textbox ([executing second learning of a weight coefficient by second learning with an architecture coefficient fixed;] (i.e.: in the second learning, treats the architecture coefficient as fixed, focusing only on making the existing connection more accurate.))][AltContent: textbox ([executing first learning of an architecture coefficient by first learning with a weight coefficient fixed;] (i.e.: in the first learning, treats the network weights as temporarily fixed to see how the architecture itself influence the performance.))]
PNG
media_image1.png
163
565
media_image1.png
Greyscale
(Liu, Algorithm 1: DARTS – Differentiable Architecture Search)
The rest of the limitations are analogous to claim 1, so are rejected under similar rationale.
Claim 14, recites limitations analogous to claim 13, so is rejected under the same rationale.
Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Liu in view of Li and in further view of Park et al., Pub. No.: US20220156596A1.
Regarding claim 12, Liu in view of Li teach the method of claim 1.
Liu in view of Li do not teach:
wherein the information processing apparatus executes a search of a network model by using a technique of a neural architecture search (NAS).
Park teaches:
wherein the information processing apparatus executes a search of a network model by using a technique of a neural architecture search (NAS).
(Park, “[0017] In an aspect of the present disclosure, the neural architecture search [wherein the information processing apparatus executes a search of a network model by using a technique of a neural architecture search (NAS)] method further comprises the step (e) of verifying the candidate learning model of the student network, selected in the step (d).”)
Park, Liu and Li are related to the same field of endeavor (i.e.: optimization of neural architecture search). It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to combine the teaching of Park with teachings of Liu and Li to add a capacity reallocation strategy that dynamically increases or decreases block capacity based on distillation loss. (Park, Abstract).
Allowable Subject Matter
Claim(s) 5 – 7 are objected to as being dependent upon a rejected base claim and would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. The prior art made of record does not teach, make obvious, or suggest the claim limitations as disclosed in applicant's claims.
Claim 5 recites:
The information processing apparatus according to claim 4, wherein the at least one processor, upon execution of the instructions, is further configured to execute control to set a network model having an architecture coefficient acquired through the first learning as a second teacher model, and execute control to cause the first learning to execute learning on a network model having a weight coefficient acquired through second learning using an output value output from the second teacher model after the at least one processor executes the second learning using the output value output from the first teacher model.
Closest prior art:
Liu, et al., "Darts: Differentiable architecture search.", (2018).
Liu teaches an algorithm for differentiable network architecture search based on bilevel optimization, which is applicable to both convolutional and recurrent architectures. However, Liu does not teach performing two interlinked learning stages in which architecture learning and weight learning alternately act as teachers for each other. In the first step, architecture coefficients learned in a first learning stage are used to configure a network model that is set a second teacher model then perform second learning to obtain weight coefficient using output values from a first teacher model and after that, re-execute the first learning so that architecture learning is guided by output values from the second teacher model.
Li, Changlin, et al. "Block-wisely supervised neural architecture search with knowledge distillation." (2018).
Li teaches modularize the large search space of NAS into blocks, ensuring that the potential candidate architectures are fairly trained, and the representation shift caused by the shared parameters is reduced, which leads to correct ratings of the candidates. However, Li does not teach performing two interlinked learning stages in which architecture learning and weight learning alternately act as teachers for each other. In the first step, architecture coefficients learned in a first learning stage are used to configure a network model that is set a second teacher model then perform second learning to obtain weight coefficient using output values from a first teacher model and after that, re-execute the first learning so that architecture learning is guided by output values from the second teacher model.
Claim(s) 6 – 7 are allowable because of their dependency to claim 5.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Zhang, et al., "idarts: Differentiable architecture search with stochastic implicit gradients.", (2021).
Zhang teaches a hypergradient in the differentiable NAS with the implicit function theorem (IFT), which can thus gracefully handle many inner optimization steps without increasing the memory requirement.
Fukuda et al., Pub. No.: US20200034702A1.
Fukuda teaches selecting a teacher neural network among a plurality of teacher neural networks, inputting an input data to the selected teacher neural network to obtain a soft label output generated by the selected teacher neural network, and training a student neural network with at least the input data and the soft label output from the selected teacher neural network.
Any inquiry concerning this communication or earlier communications from the examiner
should be directed to MATIYAS T MARU whose telephone number is (571)270-0902 or via email: matiyas.maru@uspto.gov. The examiner can normally be reached Monday 8:00am - Friday 4:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a
USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to
use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor,
Michelle Bechtold can be reached on (571) 431-0762. The fax phone number for the organization were this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from
Patent Center. Unpublished application information in Patent Center is available to registered users.
To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit
https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and
https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional
questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like
assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA)
or 571-272-1000.
/M.T.M./ Examiner, Art Unit 2148
/MICHELLE T BECHTOLD/ Supervisory Patent Examiner, Art Unit 2148