DETAILED ACTION
Status of Claims
This Office action is responsive to communications filed on 2025-05-19 and 2025-06-03. Claim(s) 1-30 is/are pending and are examined herein.
Claim(s) 22-30 invoke(s) 35 USC 112(f).
Claim(s) 1-30 is/are rejected under 35 USC 112(b).
Claim(s) 1-30 is/are rejected under 35 USC 112(a).
Claim(s) 1-30 is/are rejected under 35 USC 103.
Notice of Pre-AIA or AIA Status
The present application, filed on or after 2013-03-16, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 2025-05-19 has been entered.
Response to Arguments
Regarding rejections under 35 USC 112, the applicant “disagrees with the above rejections” [remarks, pages 12-13] but provides no rationale for their disagreement. A bare assertion of disagreement fails to comply with 37 CFR 1.111(b) because it does not distinctly and specifically point out the reasons for disagreement. The applicant’s amendments do not adequately address all of the concerns raised in the previous Office actions, and raise further concerns. Issues in the pending claims are described below.
Regarding rejections under 35 USC 103, the applicant’s remarks have been fully considered but they are unpersuasive.
The applicant argues that “the combination of Meyer, Li, Kaehler, and Hinton do not describe generating an accuracy predictor” [remarks, page 16; sic]. However, Meyer clearly discloses that the modelling ANN is “trained to estimate one or more performance characteristics of a candidate ANN” [Meyer, 0016] and describes how the modelling ANN is trained [Meyer, 0025]. As explained in the previous Office actions and below, training a neural network which predicts performance characteristics of candidate networks evidently falls under the broadest reasonable interpretation of “generating an accuracy predictor” as recited by the claim.
The applicant argues that “there is no description or suggestion in either of Meyer, Li, Kaehler, and Hinton (individually or in combination) regarding quality metrics” [remarks, page 16]. However, Li discloses losses computed during blockwise knowledge distillation [Li, figure 2] which fall under the broadest reasonable interpretation of the “quality metrics” of the claim.
An updated prior art mapping is given below.
Examiner’s Remarks
Claims 2, 12, and 23 recite a second plurality of the plurality of blockwise knowledge distillation trained search blocks while their respective parents recite a second plurality of blockwise knowledge distillation trained search blocks. The examiner notes that these two pluralities bear similar, but non-identical names, and appear to refer to distinct entities. In other words, the “second plurality of the plurality of blockwise knowledge distillation trained search blocks” of the dependent is not bound in scope by the “second plurality of blockwise knowledge distillation trained search blocks” of the parent.
Claim Interpretation - 35 USC 112(f)
The following is a quotation of 35 USC 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
The following is a quotation of pre-AIA 35 USC 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art. The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 USC 112(f) or pre-AIA 35 USC 112, sixth paragraph, is invoked.
As explained in MPEP 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 USC 112(f) or pre-AIA 35 USC 112, sixth paragraph:
the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function;
the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and
the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function.
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 USC 112(f) or pre-AIA 35 USC 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 USC 112(f) or pre-AIA 35 USC 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function.
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 USC 112(f) or pre-AIA 35 USC 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 USC 112(f) or pre-AIA 35 USC 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function.
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 USC 112(f) or pre-AIA 35 USC 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 USC 112(f) or pre-AIA 35 USC 112, sixth paragraph, except as otherwise indicated in an Office action.
Claims 22-30 each recite and/or inherit recitations of limitations that invoke interpretation under 35 USC 112(f) due to the “means for” language used therein. In each case, the specification indicates that the functions recited by the means-plus-function limitations are intended to performed on a “a computing device including a processor configured with processor-executable instructions to perform the operations of” the claimed functions and “a non-transitory processor-readable storage medium having stored thereon software instructions configured to cause a processor to perform the operations of” the claimed functions [0014].
If applicant does not intend to have this/these limitation(s) interpreted under 35 USC 112(f) or pre-AIA 35 USC 112, sixth paragraph, applicant may:
amend the claim limitation(s) to avoid it/them being interpreted under 35 USC 112(f) or pre-AIA 35 USC 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or
present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 USC 112(f) or pre-AIA 35 USC 112, sixth paragraph.
Claim Rejections - 35 USC 112(b)
The following is a quotation of 35 USC 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 USC 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claim(s) 1-30 is/are rejected under 35 USC 112(b) or 35 USC 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 USC 112, the applicant), regards as the invention.
Claims 1, 11, and 21-22 are indefinite for at least the following reasons:
They recite a plurality of blockwise knowledge distillation trained search blocks of parameters and configurations of a reference neural network [emphasis added] but this is indefinite as it is not clear what this phrase means. Based on the specification, the applicant’s “search blocks” are portions of a search space represented as a neural network [specification, 0068], the blocks then being trained using blockwise knowledge distillation from a trained reference neural network [specification, 0073]. It is unclear what it means for search blocks to be “of parameters and configurations of a reference neural network” as recited in the claim, nor does the specification provide any guidance on this point (in fact, it includes no recitations of the phrase “of parameters and configurations of a reference neural network”). For the purpose of compact prosecution, the indefinite claim element is interpreted broadly as encompassing at least the interpretation suggested by the specification, i.e., a plurality of blockwise knowledge distillation trained search blocks, wherein the plurality of blockwise distillation trained search blocks is trained by distilling from a reference neural network.
They recite a first plurality of blockwise knowledge distillation trained search blocks that were trained from the search space [emphasis added]. However, it is unclear what it means for search blocks to be “trained from the search space” as recited. In the context of NAS, the plain meaning of “search space” is a collection of candidate architectures (from which one is to be selected for deployment), and it is not clear what it means for a search block to be trained “trained from” a collection of candidate architectures. As such, the scope of the claimed invention is indefinite. For the purpose of compact prosecution, these indefinite phrases are interpreted broadly as encompassing at least the interpretation that the search blocks are part of the search space. The examiner suggests removing “that were trained”, i.e., simply “a first plurality of blockwise knowledge distillation trained search blocks
Dependent claims 2-10, 12-20, and 23-30 inherit the rejection.
Claims 4 and 14 recite Pareto-optimal with respect to the criteria of the predicted accuracy. The examiner notes that, while parent claims 2 and 12 respectively recite “criteria” and “a predicted accuracy”, the “criteria” introduced in the parent claims appear to consist of two elements (namely, the “predicted accuracy” and the “cost function”). The phrase “Pareto-optimal with respect to the criteria of the predicted accuracy” therefore makes unclear whether the applicant intends for the search blocks to be Pareto-optimal with respect to both of the criteria of the parent, or only the predicted accuracy criterion. For the purpose of compact prosecution, the examiner notes that, since Pareto-optimality is typically only invoked when there are multiple objective functions, and since “criteria” is a plural noun, the claim is interpreted as indicating that the search blocks are “Pareto-optimal with respect to the criteria both of the criteria introduced in the parent).
Claims 7, 17, and 28 recite each neural network of the plurality of neural networks comprises blockwise knowledge distillation trained search blocks of the first plurality of blockwise knowledge distillation trained search blocks but this limitation conflicts with the structure of the parent claims and disclosures in the specification. The “first plurality” is indicated in the parent claims as being part of the single selected neural network, while the “plurality of neural networks” introduced in this dependent appears to refer, based on the disclosures in the applicant’s specification, to networks in a Pareto-optimal front. However, it is not clear how each neural network in the Pareto-optimal front can comprise blocks that are found in one neural network in that front. For the purpose of compact prosecution, the claim is interpreted broadly as requiring just that each neural network comprise blocks (i.e., without requiring that the blocks be from the “first plurality”).
Claims 22-30 recite means-plus-function limitations invoking 35 USC 112(f) or pre-AIA 35 USC 112, sixth paragraph. However, the written description fails to disclose the corresponding structure, material, or acts for performing the entire claimed function and to clearly link the structure, material, or acts to the function. The specification indicates that the recited functions are intended to be performed on “a computing device including a processor configured with processor-executable instructions to perform the operations of” the claimed functions and “a non-transitory processor-readable storage medium having stored thereon software instructions configured to cause a processor to perform the operations of” the claimed functions [0014]. However, MPEP 2181(II)(B) indicates that “[i]n cases involving a special purpose computer-implemented means-plus-function limitation, the Federal Circuit has consistently required that the structure be more than simply a general purpose computer or microprocessor and that the specification must disclose an algorithm for performing the claimed function” [emphasis added], wherein “[a]n algorithm is defined, for example, as ‘a finite sequence of steps for solving a logical or mathematical problem or performing a task’”. Algorithms for the following means-plus-function limitations incorporated in the claims 23-30 which invoke 35 USC 112(f) are not adequately disclosed by the specification:
[Claim 22 and dependents] means for generating an accuracy predictor trained on quality metrics of a plurality of blockwise knowledge distillation trained search blocks of parameters and configurations of a reference neural network; (The specification does not disclose an algorithm for generating an accuracy predictor.)
[Claim 22 and dependents] means for selecting from a search space using an accuracy predictor, a neural network comprising a first plurality of blockwise knowledge distillation trained search blocks from a plurality of blockwise knowledge distillation trained search blocks that were trained from the search space, (The specification does not disclose a sequence of steps indicating how the accuracy predictor is used to select a neural network from a search space. Moreover, the specification does not clearly disclose what is meant by “blockwise knowledge distillation”. It merely indicates that “blocks” refers to “partial neural networks” [specification, 0043]. While it can be inferred that “blockwise knowledge distillation” refers broadly to some technique in which knowledge distillation is applied blockwise (i.e., block-by-block) in order to train a neural network, this is at best an outline of an algorithm; it is not a sequence of steps.)
[Claim 22 and dependents] means for initializing a second plurality of blockwise knowledge distillation trained search blocks using weights of the first plurality of blockwise knowledge distillation trained search blocks (The specification does not provide a sequence of steps indicating how search blocks are initialized using weights of search blocks. For example, it does not clarify how the weights are to be “us[ed]” in the process of initializing.)
[Claim 22 and dependents] means for fine-tuning the neural network using knowledge distillation to generate a distilled neural network including the first plurality of blockwise knowledge distillation trained search blocks and the second plurality of blockwise knowledge distillation trained search blocks (The specification does not provide a sequence of steps for how the neural network is to be fine-tuned, and how it is to “include” the first plurality and the second plurality.)
[Claim 23 and dependents] means for selecting a second plurality of the plurality of blockwise knowledge distillation trained search blocks based on criteria of predicted accuracy and a cost function for measuring a cost of implementing the second plurality of the plurality of blockwise knowledge distillation trained search blocks (The specification does not disclose a sequence of steps indicating how the second plurality is to be selected. It does disclose the use of an “evolutionary search,” but this phrase does not name a specific algorithm: rather, an evolutionary algorithm is merely any algorithm used for search and optimization which happens to be inspired by biological evolution. A wide and extremely diverse variety of algorithms fall under this label, and new evolutionary algorithms continue to be discovered. The specification does not clarify which evolutionary algorithm is being used.)
[Claim 24 and dependents] means for using an evolutionary search to select the second plurality of the plurality of blockwise knowledge distillation trained search blocks (The specification does not disclose a sequence of steps indicating how the second plurality is to be selected. As explained above, “evolutionary search” describes a widely varying class of algorithms, not a specific algorithm.)
[Claim 25 and dependents] means for selecting the first plurality of blockwise knowledge distillation trained search blocks using a scenario-aware search (The specification does not disclose a sequence of steps indicating how the first subset is to be selected.)
[Claim 26 and dependents] means for initializing the first plurality of blockwise knowledge distillation trained search blocks using weights of the plurality of blockwise knowledge distillation trained search blocks (The specification does not provide a sequence of steps indicating how search blocks are initialized using weights of search blocks. For example, it does not clarify how the weights are to be “us[ed]” in the process of initializing.)
[Claim 27 and dependents] means for selecting a plurality of neural networks of the search space (As indicated above, the specification does not provide a sequence of steps indicating how to select a subset of neural networks.)
[Claim 27 and dependents] means for initializing the blockwise knowledge distillation trained search blocks of the plurality of neural networks using weights of the plurality of blockwise knowledge distillation trained search blocks (As indicated above, the specification does not provide a sequence of steps indicating how search blocks are initialized using weights of search blocks.)
[Claim 27 and dependents] means for fine-tuning the plurality of neural networks using knowledge distillation (As indicated above, the specification does not provide a sequence of steps indicating how to fine-tune the neural network.)
[Claim 28 and dependents] means for extracting the quality metrics by using blockwise knowledge distillation to train the neural network from the search space (The specification does not provide a sequence of steps indicating how a quality metric is to be extracted. As noted above, the specification does not even indicate clearly what is meant by “blockwise knowledge distillation”, and it certainly does not pin down a sequence of steps explaining how quality metrics are to be extracted from this process.)
[Claim 28 and dependents] means for extracting a target by fine-tuning the plurality of neural networks using knowledge distillation (The specification does not provide a sequence of steps indicating how a target is to be extracted; a mere recitation of “by fine-tuning the sub-set of neural networks using knowledge distillation” does not pin down a sequence of steps.)
[Claim 29 and dependents] means for selecting the neural network from the search space based on a search of the first plurality of blockwise knowledge distillation trained search blocks using a criterion of predicted accuracy using the accuracy predictor and a cost function for measuring a cost of implementing blockwise knowledge distillation trained search blocks of the neural network (As indicated above, the specification does not clearly indicate how a search based on the two recited criteria is to be conducted, as a mere recitation of “evolutionary search” does not pin down a specific algorithm. It also does not provide a sequence of steps for indicating how the neural network is to be selected based on the search.)
[Claim 30] means for using blockwise knowledge distillation to train neural network blocks from an extended search space to generate blockwise knowledge distillation trained search blocks and the quality metrics (The specification provides no sequence of steps for extending the search space [0012, 0131] nor any other indication as to how an “extended search space” is to be selected and/or constructed. Moreover, as indicated above, the specification does not clearly indicate what is meant by “blockwise knowledge distillation”.)
[Claim 30] means for using the accuracy predictor to predict accuracy neural networks in the extended search space, wherein the accuracy predictor is built for the search space different from the extended search space (As indicated above, the specification provides no sequence of steps for either using the accuracy predictor or for extending the search space.)
Therefore, claims 22-30 are indefinite and are rejected under 35 USC 112(b) or pre-AIA 35 USC 112, second paragraph.
Claim Rejections - 35 USC 112(a)
The following is a quotation of the first paragraph of 35 USC 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.
The following is a quotation of the first paragraph of pre-AIA 35 USC 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.
Claim(s) 1-30 is/are rejected under 35 USC 112(a) or 35 USC 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA 35 USC 112, the inventor(s), at the time the application was filed, had possession of the claimed invention.
Claims 1, 11, and 21-22 recite limitations involving the second plurality of blockwise knowledge distillation trained search blocks which are new matter because they are not described in the originally filed specification in the manner recited in the claims. For example:
The claims recite initializing a second plurality of blockwise knowledge distillation trained search blocks using weights of the first plurality of blockwise knowledge distillation trained search blocks. The examiner notes that the specification describes initializing a second plurality “using weights of the blockwise knowledge distillation trained search blocks” [specification, 0011] but not specifically “using weights of the first plurality of blockwise knowledge distillation trained search blocks” as recited in the claim.
The claims recite generate a distilled neural network including the first plurality of blockwise knowledge distillation trained search blocks and the second plurality of blockwise knowledge distillation trained search blocks. The examiner notes that the specification mentions a generating a distilled neural network using knowledge distillation [0027-0030, 0033, 0080, 0108-0111, 0115-0117, 0130], and separately mentions a “first plurality of blockwise knowledge distillation trained search blocks” [0004-0006, 0011, 0123-0125, 0130], at no point does it mention the “distilled neural network including the first plurality… and the second plurality” as in the claim.
The examiner notes that, while the phrase “second plurality” occurs in the specification, the “second plurality” of the specification appears to refer to the “second plurality of the plurality of blockwise knowledge distillation trained search blocks” recited by dependent claims (i.e., to the Pareto-optimal architectures), not to the “second plurality” of the independent claims (cf. examiner’s remarks regarding this distinction). In other words, even the “second plurality” of the independent claims taken alone appears to be new matter. Dependent claims 2-10, 12-20, and 23-30 inherit the rejection.
Claims 22-30 invoke 35 USC 112(f) and are found to be indefinite under 35 USC 112(b) for failure to disclose sufficient corresponding structure in the specification. MPEP 2181(II)(B) indicates that “[w]hen a claim containing a computer-implemented 35 USC 112(f) claim limitation is found to be indefinite under 35 USC 112(b) for failure to disclose sufficient corresponding structure (e.g., the computer and the algorithm) in the specification that performs the entire claimed function, it will also lack written description under 35 USC 112(a)”. Consequently, these claims are further rejected for inadequate written description under 35 USC 112(a) or pre-AIA 35 USC 112, second paragraph.
Claim Rejections - 35 USC 103
The following is a quotation of 35 USC 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1-2, 5-7, 9-12, 15-17, 19-23, 25-27, and 29-30 is/are rejected under 35 USC 103 as being unpatentable over Changlin LI et al. (Blockwisely Supervised Neural Architecture Search with Knowledge Distillation, published 2020-03-06; hereafter “Li”) in view of Brett MEYER et al. (US20190138901A1, published 2019-05-09; hereafter “Meyer”) and Frederick TUNG et al. (US20200302295A1, published 2020-09-24; hereafter “Tung”).
Claim 1
Li discloses:
quality metrics of a plurality of blockwise knowledge distillation trained search blocks of parameters and configurations of a reference neural network; ([Li, abstract, section 3, and figure 2]: Li discloses a method of neural architecture search (NAS) in which a large search space is modularized into blocks and trained by distilling blockwise knowledge from a teacher model [Li, abstract]. More precisely, Li discloses “formulat[ing] the search space mathcal{A} into an over-parameterized supernet such that that each of the candidate architecture α is a sub-net of the supernet” [Li, section 3.1 paragraph beginning “Inaccurate evaluation”], “divid[ing the supernet] mathcal{N} into N blocks” [Li, section 3.1 paragraph beginning “Block-wise NAS”], and training each block using corresponding blocks of a supervising/teacher model [Li, section 3.2]. For example, [Li, figure 2] depicts a situation where a search space of three candidate architectures is assembled into a student supernet, and each architecture is partitioned into five blocks corresponding to five blocks in a teacher network. The teacher model is the “reference neural network” of the claim, and the blocks of the supernet after they have underdone the blockwise knowledge distillation procedure described in Li map to the “plurality of blockwise knowledge distillation trained search blocks” of the claim. The “loss[es]” computed in the process of performing blockwise knowledge distillation [Li, figure 2 and/or section 3.2 equation (6)] map to the “quality metrics” of the claim.)
selecting, from a search space [using the accuracy predictor,] a neural network comprising a first plurality of blockwise knowledge distillation trained search blocks from the plurality of blockwise knowledge distillation trained search blocks; ([Li, section 3]: As noted above, Li discloses a method of neural architecture search. This means that it searches an architecture search space mathcal{A} to find an “optimal pair (α^*, ω_α^*) such that the model performance is maximized” [Li, section 3.1 first paragraph]. The architecture search space maps to the “search space” of the claim, and the optimal pair maps to the “neural network” of the claim. The blocks of the optimal architecture α^* map to the “first plurality” of the claim.)
Li does not distinctly disclose:
A method comprising: generating an accuracy predictor trained on [quality metrics] … [selecting, from a search space] using the accuracy predictor, [a neural network]
initializing a second plurality of blockwise knowledge distillation trained search blocks using weights of the first plurality of blockwise knowledge distillation trained search blocks;
and fine-tuning the neural network using knowledge distillation to generate a distilled neural network including the first plurality of blockwise knowledge distillation trained search blocks and the second plurality of blockwise knowledge distillation trained search blocks.
Meyer is in the field of neural architecture search [Li, abstract] and discloses a method of selecting architectures from candidate architectures in a “design space” [Meyer, abstract and 0017-0018]. In other words, Meyer’s design space corresponds to the search space of Li and of the claim. Moreover, Li in view of Meyer discloses:
A method comprising: generating an accuracy predictor trained on [quality metrics] … [selecting, from a search space] using the accuracy predictor, [a neural network] ([Meyer, 0016, 0029, 0031]: Meyer’s NAS method uses a “modelling ANN” which is “trained to estimate one or more performance characteristics of a candidate ANN”, where the performance characteristics include “error (or accuracy)” [Meyer, 0016]. More precisely, “the modelling ANN models the response surface using an MLP model with an input set representative of ANN hyper-parameters and a single output trained to predict the error of corresponding ANN” [Meyer, 0029] and the modelling ANN is “trained using stochastic gradient descent (SGD)” [Meyer, 0031]. The examiner notes that the modelling ANN is also called a response surface modelling (RSM) ANN/network/model therein [Meyer, 0016, 0029-0031]. The modelling ANN maps to the “accuracy predictor” of the claim, and its training maps to the “generating” step of the claim. In the combination, the input parameters for the modelling ANN are the “quality metrics” as mapped above.)
Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art to combine Li’s NAS using blockwise knowledge distillation with Meyer’s use of an accuracy predictor because “training an artificial neural network to predict the performance of future candidate networks” provides “a multi-objective design space exploration method that may assist in reducing the number of solution networks” [Meyer, 0006], thereby resulting in a more effective system.
Li in view of Meyer does not distinctly disclose:
initializing a second plurality of blockwise knowledge distillation trained search blocks using weights of the first plurality of blockwise knowledge distillation trained search blocks;
and fine-tuning the neural network using knowledge distillation to generate a distilled neural network including the first plurality of blockwise knowledge distillation trained search blocks and the second plurality of blockwise knowledge distillation trained search blocks.
Tung is in the field of machine learning. Moreover, Li in view of Meyer and Tung discloses:
initializing a second plurality of blockwise knowledge distillation trained search blocks using weights of the first plurality of blockwise knowledge distillation trained search blocks; ([Tung, 0101]: Tung discloses a method where a “student network is initialized with source domain… pretrained weights” [Tung, 0101]. In the combination, the pretrained weights correspond to the weights ω_α^* of the optimal architecture α^* obtained by blockwise knowledge distillation as in Li (i.e., to the weights of the “first plurality” as mapped above). In other words, the blocks in Tung’s student network map to the “second plurality” of the claim.)
and fine-tuning the neural network using knowledge distillation to generate a distilled neural network including the first plurality of blockwise knowledge distillation trained search blocks and the second plurality of blockwise knowledge distillation trained search blocks. ([Tung, 0101]: Tung discloses that, after initialization, the student network is “fine-tuned” using “[k]nowledge distillation” [Tung, 0101]. The student network after it has been fine-tuned thus maps to the “distilled neural network” of the claim. This network “include[es] the first plurality… and the second plurality” in the sense that its architecture contains the same blocks as the ones found in the recited pluralities.)
Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art to combine the NAS method disclosed by Li in view of Meyer with the fine-tuning disclosed by Tung because it is an effective and versatile method for handling “limited training data” [Tung, 0100] and “different network architectures” [Tung, 0101], so the combination would be more effective overall.
Claim 2
Li in view of Meyer and Tung discloses the elements of the parent claim(s). It also discloses:
[The method of claim 1, further comprising] selecting a second plurality of the plurality of blockwise knowledge distillation trained search blocks based on criteria of a predicted accuracy using the accuracy predictor and a cost function for measuring a cost of implementing the second plurality of the plurality of blockwise knowledge distillation trained search blocks. ([Meyer, 0015-0016, 0036 and figure 3]: Meyer discloses a method of identifying the “Pareto-optimal front” in a space of candidate architectures [Meyer, 0015] in order to “optimiz[e] one of more performance characteristics, including error (or accuracy) and at least of computation time, latency, energy efficiency, implementation cost (e.g., time, hardware, power, etc.), computational complexity, and the like” [Meyer, 0016]. The Pareto-optimal front with respect to the two criteria of “accuracy (Error %) and performance (Normalized Cost)” is described in [Meyer, 0036 and figure 3]. The architectures in the Pareto-optimal front map to the “second plurality of the plurality of blockwise knowledge distillation trained search blocks” of this claim.)
The same motivation to combine applies.
Claim 5
Li in view of Meyer and Tung discloses the elements of the parent claim(s). It also discloses:
[The method of claim 1, wherein using the accuracy predictor to select from the search space the neural network comprises] selecting the first plurality of blockwise knowledge distillation trained search blocks using a scenario-aware search to select the first plurality of blockwise knowledge distillation trained search blocks. ([Li, abstract; Meyer, abstract]: As noted above, Li and Meyer both disclose methods for selecting an architecture [Li, abstract; Meyer, abstract], and the blocks in the selected architecture map to the “first plurality” of the claim. Any act of selecting requires a search of the candidate options and is aware of the “scenario” in which it is performed.)
The same motivation to combine applies.
Claim 6
Li in view of Meyer and Tung discloses the elements of the parent claim(s). It also discloses:
[The method of claim 1, further comprising:] initializing the first plurality of blockwise knowledge distillation trained search blocks using weights of the plurality of blockwise knowledge distillation trained search blocks. ([Tung, 0101]: As noted under the parent claim, Tung discloses that a “student network is initialized with source domain… pretrained weights” [Tung, 0101]. In the combination, the pretrained weights are the weights in Li’s blockwise knowledge distillation trained supernet and map to the “weights of the plurality of blockwise knowledge distillation trained search blocks” of the claim. The student network of Tung can be taken to be the optimal network of Li, i.e., the network which comprises the “first plurality” of the claim.)
The same motivation to combine applies.
Claim 7
Li in view of Meyer and Tung discloses the elements of the parent claim(s). It also discloses:
[The method of claim 1, further comprising:] selecting a plurality of neural networks of the search space, ([Meyer, 0015-0016, 0036 and figure 3]: As noted under claim 2, Meyer discloses identifying a Pareto-optimal front of networks that optimize both accuracy and cost. The networks in the Pareto-optimal front map to the “plurality of neural networks of the search space” of the claim.)
wherein each neural network of the plurality of neural networks comprises blockwise knowledge distillation trained search blocks of the first plurality of blockwise knowledge distillation trained search blocks; ([Li, figure 2]: As noted under the parent claims, Li discloses that each neural network in the search space is partitioned into blocks. In particular, this means that each neural network in the Pareto-optimal front is also partitioned into blocks as recited by the claim. The applicant is advised to consult a related 112(b) rejection.)
initializing the blockwise knowledge distillation trained search blocks of the plurality of neural networks using weights of the plurality of blockwise knowledge distillation trained search blocks; and fine-tuning the plurality of neural networks using knowledge distillation. ([Tung, 0101]: As noted above, Tung that a “student network is initialized with source domain… pretrained weights” and then “fine-tuned” using “[k]nowledge distillation” [Tung, 0101]. In the combination, the pretrained weights are the weights in Li’s blockwise knowledge distillation trained supernet and map to the “weights of the plurality of blockwise knowledge distillation trained search blocks” of the claim. The student network of Tung can be taken to be a network in the Pareto-optimal front, i.e., one of the “plurality of neural networks” of the claim.)
The same motivation to combine applies.
Claim 9
Li in view of Meyer and Tung discloses the elements of the parent claim(s). It also discloses:
[The method of claim 1, wherein: selecting the neural network from the search space using the accuracy predictor comprises] selecting the neural network based on a search of the first plurality of blockwise knowledge distillation trained search blocks ([Li, section 3 and figure 2; Meyer, 0017-0018]: Li discloses selecting the optimal architecture based on searching the search space [Li, figure 2 and section 3]. Similarly, Meyer discloses selecting a network based on searching a design space [Meyer, 0017-0018]. The search space of Li and/or the design space of Meyer includes the “first plurality” as mapped above, so the selection as mapped under the parent claim is “based on a search of the first plurality” as recited in the claim.) using criteria of a predicted accuracy using the accuracy predictor and a cost function for measuring a cost of implementing blockwise knowledge distillation trained search blocks of the neural network. ([Meyer, 0015-0016, 0036 and figure 3]: Meyer discloses selecting network architectures to “optimiz[e] one of more performance characteristics, including error (or accuracy) and at least of computation time, latency, energy efficiency, implementation cost (e.g., time, hardware, power, etc.), computational complexity, and the like” [Meyer, 0016].)
The same motivation to combine applies.
Claim 10
Li in view of Meyer and Tung discloses the elements of the parent claim(s). It also discloses:
[The method of claim 1, further comprising:] using blockwise knowledge distillation to train neural network blocks from an extended search space to generate blockwise knowledge distillation trained search blocks and the quality metrics; ([Li, abstract and figure 2]: As noted above, Li discloses using blockwise knowledge distillation to train search blocks. Moreover, Li discloses computing “loss[es]” during the blockwise knowledge distillation [Li, figure 2], and these losses map to the “quality metrics” recited by the claim.)
and using the accuracy predictor to predict accuracy of neural networks in the extended search space, wherein the accuracy predictor is built for the search space different from the extended search space. ([Meyer, figure 1 step 104 and 0016-0018]: As described under claim 1 above, Meyer discloses using a “modelling ANN” to predict the accuracy of candidate architectures. The examiner notes that this predictor can be applied to any architecture, including architectures in an extended search space, as long as the architecture is provided to the predictor in the right format. The recitation of the accuracy predictor being built “for the search space different from the extended search space” is being interpreted as a recitation of an intended use of the accuracy predictor; it does not require that the accuracy predictor be built in any particular way or using any particular data. At most, it requires that the accuracy predictor be capable of being applied to make accuracy predictions to different search spaces, with no requirement that it perform this function well on any particular search space.)
The same motivation to combine applies.
Claim 11
Li discloses:
quality metrics of a plurality of blockwise knowledge distillation trained search blocks of parameters and configurations of a reference neural network; ([Li, abstract, section 3, and figure 2]: Li discloses a method of neural architecture search (NAS) in which a large search space is modularized into blocks and trained by distilling blockwise knowledge from a teacher model [Li, abstract]. More precisely, Li discloses “formulat[ing] the search space mathcal{A} into an over-parameterized supernet such that that each of the candidate architecture α is a sub-net of the supernet” [Li, section 3.1 paragraph beginning “Inaccurate evaluation”], “divid[ing the supernet] mathcal{N} into N blocks” [Li, section 3.1 paragraph beginning “Block-wise NAS”], and training each block using corresponding blocks of a supervising/teacher model [Li, section 3.2]. For example, [Li, figure 2] depicts a situation where a search space of three candidate architectures is assembled into a student supernet, and each architecture is partitioned into five blocks corresponding to five blocks in a teacher network. The teacher model is the “reference neural network” of the claim, and the blocks of the supernet after they have underdone the blockwise knowledge distillation procedure described in Li map to the “plurality of blockwise knowledge distillation trained search blocks” of the claim. The “loss[es]” computed in the process of performing blockwise knowledge distillation [Li, figure 2 and/or section 3.2 equation (6)] map to the “quality metrics” of the claim.)
selecting, from a search space [using the accuracy predictor,] a neural network comprising a first plurality of blockwise knowledge distillation trained search blocks from the plurality of blockwise knowledge distillation trained search blocks; ([Li, section 3]: As noted above, Li discloses a method of neural architecture search. This means that it searches an architecture search space mathcal{A} to find an “optimal pair (α^*, ω_α^*) such that the model performance is maximized” [Li, section 3.1 first paragraph]. The architecture search space maps to the “search space” of the claim, and the optimal pair maps to the “neural network” of the claim. The blocks of the optimal architecture α^* map to the “first plurality” of the claim.)
Li does not distinctly disclose:
A computing device, comprising a processor configured with processor-executable instructions to perform operations comprising:
generating an accuracy predictor trained on [quality metrics] … [selecting, from a search space] using the accuracy predictor, [a neural network]
initializing a second plurality of blockwise knowledge distillation trained search blocks using weights of the first plurality of blockwise knowledge distillation trained search blocks;
and fine-tuning the neural network using knowledge distillation to generate a distilled neural network including the first plurality of blockwise knowledge distillation trained search blocks and the second plurality of blockwise knowledge distillation trained search blocks.
Meyer is in the field of neural architecture search [Li, abstract] and discloses a method of selecting architectures from candidate architectures in a “design space” [Meyer, abstract and 0017-0018]. In other words, Meyer’s design space corresponds to the search space of Li and of the claim. Moreover, Li in view of Meyer discloses:
A computing device, comprising a processor configured with processor-executable instructions to perform operations comprising: ([Meyer, 0032]: Meyer discloses a “computing device” with a “processing unit” and “computer-executable instructions” for performing the NAS method disclosed therein.)
generating an accuracy predictor trained on [quality metrics] … [selecting, from a search space] using the accuracy predictor, [a neural network] ([Meyer, 0016, 0029, 0031]: Meyer’s NAS method uses a “modelling ANN” which is “trained to estimate one or more performance characteristics of a candidate ANN”, where the performance characteristics include “error (or accuracy)” [Meyer, 0016]. More precisely, “the modelling ANN models the response surface using an MLP model with an input set representative of ANN hyper-parameters and a single output trained to predict the error of corresponding ANN” [Meyer, 0029] and the modelling ANN is “trained using stochastic gradient descent (SGD)” [Meyer, 0031]. The examiner notes that the modelling ANN is also called a response surface modelling (RSM) ANN/network/model therein [Meyer, 0016, 0029-0031]. The modelling ANN maps to the “accuracy predictor” of the claim, and its training maps to the “generating” step of the claim. In the combination, the input parameters for the modelling ANN are the “quality metrics” as mapped above.)
Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art to combine Li’s NAS using blockwise knowledge distillation with Meyer’s use of an accuracy predictor because “training an artificial neural network to predict the performance of future candidate networks” provides “a multi-objective design space exploration method that may assist in reducing the number of solution networks” [Meyer, 0006], thereby resulting in a more effective system.
Li in view of Meyer does not distinctly disclose:
initializing a second plurality of blockwise knowledge distillation trained search blocks using weights of the first plurality of blockwise knowledge distillation trained search blocks;
and fine-tuning the neural network using knowledge distillation to generate a distilled neural network including the first plurality of blockwise knowledge distillation trained search blocks and the second plurality of blockwise knowledge distillation trained search blocks.
Tung is in the field of machine learning. Moreover, Li in view of Meyer and Tung discloses:
initializing a second plurality of blockwise knowledge distillation trained search blocks using weights of the first plurality of blockwise knowledge distillation trained search blocks; ([Tung, 0101]: Tung discloses a method where a “student network is initialized with source domain… pretrained weights” [Tung, 0101]. In the combination, the pretrained weights correspond to the weights ω_α^* of the optimal architecture α^* obtained by blockwise knowledge distillation as in Li (i.e., to the weights of the “first plurality” as mapped above). In other words, the blocks in Tung’s student network map to the “second plurality” of the claim.)
and fine-tuning the neural network using knowledge distillation to generate a distilled neural network including the first plurality of blockwise knowledge distillation trained search b