Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Remarks
This Office Action is responsive to Applicants' Amendment filed on November 25, 2025, in which claims 1, 6, 11, 16, and 18 are currently amended. Claims 5 and 15 are cancelled. Claims 1-4, 6-14, and 16-19 are currently pending.
Response to Arguments
Applicant’s arguments with respect to rejection of claims 1-19 under 35 U.S.C. 101 based on amendment have been considered, however, are not persuasive.
With respect to Applicant’s arguments on p. 8 of the Remarks submitted 11/25/2025 that “the pending claims as amended now more clearly recite a method that cannot be readily performed in the human mind”, Examiner respectfully disagrees. For example in independent claim 1, with the exception of the recitation of generic computer components to apply the judicial exception on, the claim is wholly directed to a mental process which is routinely performed entirely in the mind. As the claim is wholly directed towards a mental process performed on a generic computer the claims are not seen as improving the functioning of a technology (See MPEP 2106.05(a) "It is important to note, the judicial exception alone cannot provide the improvement. The improvement can be provided by one or more additional elements.").
For at least these reasons and those further detailed below Examiner asserts that it is reasonable and appropriate to maintain the rejection under 35 U.S.C. 101.
Applicant’s arguments with respect to rejection of claims 1-19 under 35 U.S.C. 102 based on amendment have been considered, however, are not persuasive.
With respect to Applicant’s arguments on p. 11 of the Remarks submitted 11/25/2025 that “Dong fails to teach a separate, preceding step of generating a reduced neural search space or supernet before conducing a neural architecture search”, Examiner respectfully disagrees. First, Examiner notes that this is a foundational part of neural architecture search. Secondly, Examiner notes that Dong explicitly describes a generated bounded search space and multiple search steps which necessarily require the bounded search space to be generated before performing ([p. 5] "In Sec. III-B1 we present our search space of neural architectures. Given a latency constraint, we can first search feasible neural architectures and corresponding mixed-precision bitwidth settings by applying the aforementioned hardware latency model as well as a model quantifying the effect of quantization perturbation. We then use an accuracy predictor to compare across different networks and find the pareto-optimal architectures and quantization settings among all candidates [...] 1) Search Space of Neural Architectures: In HAO, we construct the neural network architectures from subgraphs with feasible hardware mappings on FPGAs. Our subgraphs are combinations of operations such as convolution or depthwise convolution with kernel size of 1 × 1 or k × k as mentioned in the previous section. Although only one subgraph can be chosen on hardware, the possible building blocks for neural architecture search include the sub-layers of the subgraph. This is because each layer in the subgraph can be decided whether to bypass or not using a skip signal in hardware. We set no limit on the total number of subgraphs"). This is also explicitly shown in FIG. 4 of Dong where “Select Top Candidates” (the second to last step in FIG. 4) is a neural architecture search step occurring subsequent to the initial generation of the bounded search space (the first step in FIG. 4).
For at least these reasons and those further detailed below Examiner asserts that it is reasonable and appropriate to maintain the rejection under 35 U.S.C. 102.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 1-19 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
Regarding claims 1, 11, and 18, "the neural architecture search space defining boundaries of a separate and subsequent neural architecture search" is indefinite. It's unclear what the neural architecture search is separate from as one of ordinary skill in the art would expect that it cannot be entirely separate from the neural architecture search space which explicitly defines its boundaries. As "separate" is ambiguous in the claim limitation the scope of the claim cannot reasonably be determined. In the interest of further examination the limitation is interpreted as "the neural architecture search space defining boundaries of a subsequent neural architecture search".
The remaining claims are rejected with respect to their dependence on the rejected claims.
Claim Rejections - 35 USC § 101
101 Rejection
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-4, and 4-14, and 16-19 are rejected under 35 USC § 101 because the claimed invention is directed to non-statutory subject matter.
Regarding Claim 1: Claim 1 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 1 is directed to a method, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: Claim 1 under its broadest reasonable interpretation is a series of mental processes. For example, but for the generic computer components language, the above limitations in the context of this claim encompass neural network processing, including the following:
identifying hardware computing resources for execution of a neural network-based inference engine (observation, evaluation, and judgement),
identifying one or more design constraints for a neural network architecture to implement a neural network based, at least in part, on the identified hardware computing resources (observation, evaluation, and judgement)
determine the neural architecture search space subject to the one or more design constraints, the neural architecture search space defining boundaries of a separate and subsequent neural architecture search for selection of the neural network architecture (observation, evaluation, and judgement)
Therefore, claim 1 recites an abstract idea which is a judicial exception.
Step 2A Prong Two Analysis: Claim 1 recites additional elements “executing computer-readable instructions by one or more processors of a computing device to”. However, these additional features are computer components recited at a high-level of generality, such that they amount to no more than mere instructions to apply the judicial exception using a generic computer component. An additional element that merely recites the words “apply it” (or an equivalent) with the judicial exception, or merely includes instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea, does not integrate the judicial exception into a practical application (See MPEP 2106.05(f)). Therefore, claim 1 is directed to a judicial exception.
Step 2B Analysis: Claim 1 does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the lack of integration of the abstract idea into a practical application, the additional elements recited in claim 1 amount to no more than mere instructions to apply the judicial exception using a generic computer component.
For the reasons above, claim 1 is rejected as being directed to non-patentable subject matter under §101. This rejection applies equally to dependent claims 2-10. The additional limitations of the dependent claims are addressed briefly below:
Dependent claim 2 recites additional observation, evaluation, and judgement “determine the neural architecture search space based, at least in part, on combinations of activation bit width and weight bit width to be implemented in one or more layers of the neural network.” And “quantify costs associated with the combinations of activation bit width and weight bit width”
Dependent claim 3 recites additional observation, evaluation, and judgement “determine the neural architecture search space further based, at least in part, on candidate channel sizes for at least one of the one or more layers of the neural network based, at least in part, on the combinations of activation bit width and weight bit width” and “quantify costs associated with the candidate channel sizes based, at least in part, on the quantified costs associated with the combinations of activation bit width and weight bit width”
Dependent claim 4 recites additional instructions to apply the judicial exception using generic computer components “at least one of the one or more layers of the neural network comprises a convolution layer to be implemented at least in part by application of a kernel” as well as additional observation, evaluation, and judgement “determine the neural architecture search space further based, at least in part, on available kernel sizes for the kernel”.
Dependent claim 6 recites additional observation, evaluation, and judgement “determine the neural architecture search space further based, at least in part, on candidate operator types for the at least one of the one or more layers of the neural network based, at least in part, on the combinations of activation bit width and weight bit width; and quantify costs associated with the candidate operator types based, at least in part, on the quantified costs associated with the candidate channel sizes”
Dependent claim 7 recites additional observation, evaluation, and judgement “determine a union of candidate design options over the candidate operator types for the at least one of the one or more layers of the neural network; and determine a union of the candidate design options over the combinations of activation bit width and weight bit width based, at least in part, on the determined union of the candidate design options over the candidate operator types for the at least one of the one or more layers of the neural network”
Dependent claim 8 recites additional observation, evaluation, and judgement “determine candidate design options for implementation of the neural network based, at least in part, on the identified hardware computing resources and the one or more design constraints” as well as additional insignificant extra-solution activity of gathering and outputting data (See MPEP 2106.05(g)) “express and/or structure the candidate design options as the neural architecture search space in a non-transitory storage medium” (expressing the candidate design options in a non-transitory storage medium interpreted as storing in memory) which is well-understood, routine, and conventional in the art (See MPEP 2106.05(d)(II)(i))
Dependent claim 9 recites additional observation, evaluation, and judgement “execute the NAS process to select a design option from the neural architecture search space to implement the neural network”
Dependent claim 10 recites additional observation, evaluation, and judgement “wherein at least one of the one or more design constraints are defined by execution latency, operation count, model size, power consumption or memory usage, or a combination thereof”
Regarding Claim 11: Claim 11 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 11 is directed to a method, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: Claim 11 under its broadest reasonable interpretation is a series of mental processes. For example, but for the generic computer components language, the above limitations in the context of this claim encompass neural network processing, including the following:
identify hardware computing resources for execution of a neural network-based inference engine (observation, evaluation, and judgement),
identify one or more design constraints for a neural network architecture to implement a neural network-based inference engine based, at least in part, on the identified hardware computing resources (observation, evaluation, and judgement)
determine a neural network architecture search space subject to the one or more design constraints, the neural network architecture search space to define boundaries of a separate and subsequent neural architecture search for selection of the neural network architecture (observation, evaluation, and judgement)
Therefore, Claim 11 recites an abstract idea which is a judicial exception.
Step 2A Prong Two Analysis: Claim 11 recites additional elements “A computing device, comprising: one or more processors to”. However, these additional features are computer components recited at a high-level of generality, such that they amount to no more than mere instructions to apply the judicial exception using a generic computer component. An additional element that merely recites the words “apply it” (or an equivalent) with the judicial exception, or merely includes instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea, does not integrate the judicial exception into a practical application (See MPEP 2106.05(f)). Claim 11 also recites additional elements “express and/or structure the neural network architecture search space in a non-transitory storage medium” (expressing the candidate design options in a non-transitory storage medium interpreted as storing in memory) which amounts to gathering and outputting data (See MPEP 2106.05(g)). Therefore, Claim 11 is directed to a judicial exception.
Step 2B Analysis: Claim 11 does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the lack of integration of the abstract idea into a practical application, the additional elements recited in Claim 11 amount to no more than mere instructions to apply the judicial exception using a generic computer component.
For the reasons above, Claim 11 is rejected as being directed to non-patentable subject matter under §101. This rejection applies equally to dependent claims 12-17. The additional limitations of the dependent claims are addressed briefly below:
Dependent claim 12 recites additional observation, evaluation, and judgement “determine the neural network architecture search space based, at least in part, on combinations activation bit width and weight bit width to be implemented in one or more layers of the neural network-based inference engine; and quantify costs associated with the combinations of activation bit width and weight bit width”
Dependent claim 13 recites additional observation, evaluation, and judgement “determine the neural network architecture search space further based, at least in part, on candidate channel sizes for at least one of the one or more layers of the neural network-based inference engine based, at least in part, on the combinations of activation bit width and weight bit width; and quantify costs associated with the candidate channel sizes based, at least in part, on the quantified costs associated with the combinations of activation bit width and weight bit width”
Dependent claim 14 recites additional instructions to apply the judicial exception using generic computer components “wherein at least one of the one or more layers of the neural network-based inference engine comprises a convolution layer to be implemented at least in part by application of a kernel, and wherein the one or more processors are further to” as well as additional observation, evaluation, and judgement “determine the neural network architecture search space further based, at least in part, on available kernel sizes for the kernel”.
Dependent claim 16 recites additional observation, evaluation, and judgement “determine the neural network architecture search space further based, at least in part, on candidate operator types for the at least one of the one or more layers of the neural network-based inference engine based, at least in part, on the combinations of activation bit width and weight bit width; and quantify costs associated with the candidate operator types based, at least in part, on the quantified costs associated with the candidate channel sizes”
Dependent claim 17 recites additional observation, evaluation, and judgement “wherein at least one of the one or more design constraints are defined by execution latency, operation count, model size, power consumption or memory usage, or a combination thereof”
Regarding Claim 18: Claim 18 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 18 is directed to a method, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: Claim 18 under its broadest reasonable interpretation is a series of mental processes. For example, but for the generic computer components language, the above limitations in the context of this claim encompass neural network processing, including the following:
select a design option for a neural network-based inference engine from a plurality of candidate design options expressed and/or structured as a neural network architecture search space in a non-transitory storage medium, the plurality of candidate design options defining a neural network architecture search space having been determined based, at least in part, on (observation, evaluation, and judgement),
identification of hardware computing resources for execution of the neural network-based inference engine (observation, evaluation, and judgement)
identification of one or more design constraints for a neural network architecture to implement the neural network-based inference engine based, at least in part, on the identified hardware computing resources (observation, evaluation, and judgement)
determination of a neural network architecture search space subject to the one or more design constraints by application of the one or more design constraints to the identification of the hardware computing resources, the neural network architecture search space defining boundaries of a separate and subsequent neural architecture search for an implementation of the neural network-based inference engine (observation, evaluation, and judgement)
Therefore, Claim 18 recites an abstract idea which is a judicial exception.
Step 2A Prong Two Analysis: Claim 18 recites additional elements “executing computer-readable instructions by one or more processors of a computing device to execute a neural network architecture search process to”. However, these additional features are computer components recited at a high-level of generality, such that they amount to no more than mere instructions to apply the judicial exception using a generic computer component. An additional element that merely recites the words “apply it” (or an equivalent) with the judicial exception, or merely includes instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea, does not integrate the judicial exception into a practical application (See MPEP 2106.05(f)). Therefore, Claim 18 is directed to a judicial exception.
Step 2B Analysis: Claim 18 does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the lack of integration of the abstract idea into a practical application, the additional elements recited in Claim 18 amount to no more than mere instructions to apply the judicial exception using a generic computer component.
For the reasons above, Claim 18 is rejected as being directed to non-patentable subject matter under §101. This rejection applies equally to dependent claim 19. The additional limitations of the dependent claims are addressed briefly below:
Dependent claim 19 recites additional observation, evaluation, and judgement “wherein at least one of the one or more design constraints are defined by execution latency, operation count, model size, power consumption or memory usage, or a combination thereof”
Therefore, when considering the elements separately and in combination, they do not add significantly more to the inventive concept. Accordingly, claims 1-19 are rejected under 35 U.S.C. § 101.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claims 1-4, 6-14, 16-19 are rejected under U.S.C. §102(a)(1) as being anticipated by Dong (“HAO: Hardware-aware Neural Architecture Optimization for Efficient Inference”, 2021).
PNG
media_image1.png
556
726
media_image1.png
Greyscale
FIG. 1 of Dong
PNG
media_image2.png
340
1420
media_image2.png
Greyscale
FIG. 4 of Dong
Regarding claim 1, Dong teaches A method of generating a neural architecture search space for a neural architecture search (NAS) process, comprising:([p. 1] "Quantization [9], [19], [50], [53] is a general and effective technique that uses low bitwidth (such as 4-bit or 8-bit) to represent the floating-point weights/activations in neural networks [...] As an example, the quantization algorithm may select a mixture of every bitwidth from 1 bit to 8 bit, and the NAS algorithm may choose to jointly use convolution with different kernel and group sizes")
identifying hardware computing resources for execution of a neural network-based inference engine;([Abstract] "Differing from existing hardware-aware neural architecture search (NAS) algorithms that rely solely on the expensive learning-based approaches, our work incorporates integer programming into the search algorithm to prune the design space. Given a set of hardware resource constraints, our integer programming formulation directly outputs the optimal accelerator configuration for mapping a DNN subgraph that minimizes latency." DNN interpreted as deep neural network.)
identifying one or more design constraints for a neural network architecture to implement a neural network based, at least in part, on the identified hardware computing resources; and([Abstract] "Differing from existing hardware-aware neural architecture search (NAS) algorithms that rely solely on the expensive learning-based approaches, our work incorporates integer programming into the search algorithm to prune the design space. Given a set of hardware resource constraints, our integer programming formulation directly outputs the optimal accelerator configuration for mapping a DNN subgraph that minimizes latency." [p. 2 §III] "In HAO, we expose a large design space in both hardware and algorithm configurations to accelerate DNNs. To efficiently navigate the search space, we first apply integer programming to prune the hardware configuration space by minimizing the latency subject to a set of hardware resource constraints. We then narrow the DNN architecture space by adopting Monte Carlo tree search (MCTS) [24] to minimize the quantization accuracy perturbation while satisfying a given latency constraint")
executing computer-readable instructions by one or more processors of a computing device to determine the neural architecture search space subject to the one or more design constraints, ([p. 2 §III] "In HAO, we expose a large design space in both hardware and algorithm configurations to accelerate DNNs. To efficiently navigate the search space, we first apply integer programming to prune the hardware configuration space by minimizing the latency subject to a set of hardware resource constraints. We then narrow the DNN architecture space by adopting Monte Carlo tree search (MCTS) [24] to minimize the quantization accuracy perturbation while satisfying a given latency constraint. In addition, we develop an accuracy predictor to estimate the accuracy of the DNN to further reduce the overall feedback time for each sample. Our flow produces a pareto-optimal curve between latency and accuracy" See also FIG. 4.)
the neural architecture search space defining boundaries of a separate and subsequent neural architecture search for selection of the neural network architecture.([p. 5] "In Sec. III-B1 we present our search space of neural architectures. Given a latency constraint, we can first search feasible neural architectures and corresponding mixed-precision bitwidth settings by applying the aforementioned hardware latency model as well as a model quantifying the effect of quantization perturbation. We then use an accuracy predictor to compare across different networks and find the pareto-optimal architectures and quantization settings among all candidates [...] 1) Search Space of Neural Architectures: In HAO, we construct the neural network architectures from subgraphs with feasible hardware mappings on FPGAs. Our subgraphs are combinations of operations such as convolution or depthwise convolution with kernel size of 1 × 1 or k × k as mentioned in the previous section. Although only one subgraph can be chosen on hardware, the possible building blocks for neural architecture search include the sub-layers of the subgraph. This is because each layer in the subgraph can be decided whether to bypass or not using a skip signal in hardware. We set no limit on the total number of subgraphs" Dong explicitly presents the search space (defines boundaries of a neural architecture search) and then subsequently performs multiple subsequent search steps "search feasible neural architectures […] by applying the aforementioned hardware latency model", "then use an accuracy predictor to compare across different networks and find the pareto-optimal architectures and quantization settings among all candidates".).
Regarding claim 2, Dong teaches The method of claim 1, and further comprising executing computer-readable instructions by one or more processors of the computing device to: determine the neural architecture search space based, at least in part, on combinations of activation bit width and weight bit width to be implemented in one or more layers of the neural network; and(Dong [p. 1] "Quantization [9], [19], [50], [53] is a general and effective technique that uses low bitwidth (such as 4-bit or 8-bit) to represent the floating-point weights/activations in neural networks [...] As an example, the quantization algorithm may select a mixture of every bitwidth from 1 bit to 8 bit, and the NAS algorithm may choose to jointly use convolution with different kernel and group sizes")
quantify costs associated with the combinations of activation bit width and weight bit width.(Dong See Table II which quantifies costs (framerates) associated with mixed bit-width quantized models.).
Regarding claim 3, Dong teaches The method of claim 2, and further comprising executing computer-readable instructions by one or more processors of the computing device to: determine the neural architecture search space further based, at least in part, on candidate channel sizes for at least one of the one or more layers of the neural network based, at least in part, on the combinations of activation bit width and weight bit width; and(Dong [p. 1] "As an example, the quantization algorithm may select a mixture of every bitwidth from 1 bit to 8 bit, and the NAS algorithm may choose to jointly use convolution with different kernel and group sizes" [p. 4 §III] "Given a layer with input channel size IC, output channel size" [p. 5] "We set no limit on the total number of subgraphs and choose the channel size for different layers")
quantify costs associated with the candidate channel sizes based, at least in part, on the quantified costs associated with the combinations of activation bit width and weight bit width.(Dong See Table II which quantifies costs (framerates) associated with models having a channel size of 192x192, 256x256, etc. on mixed bit-width quantized models.).
Regarding claim 4, Dong teaches The method of claim 3, wherein at least one of the one or more layers of the neural network comprises a convolution layer to be implemented at least in part by application of a kernel, the method further comprising executing computer-readable instructions by one or more processors of the computing device to:(Dong [p. 3 §IIIA] "We implement a parameterizable accelerator template in high-level synthesis (HLS). The generated dataflow accelerator can contain M convolution kernels chained through FIFOs to exploit pipeline-level parallelism. Each convolution kernel can be chosen from one of the three convolutions from the kernel pool")
determine the neural architecture search space further based, at least in part, on available kernel sizes for the kernel.(Dong [p. 1] "As an example, the quantization algorithm may select a mixture of every bitwidth from 1 bit to 8 bit, and the NAS algorithm may choose to jointly use convolution with different kernel and group sizes").
Regarding claim 6, Dong teaches The method of claim 3, and further comprising executing computer-readable instructions by one or more processors of the computing device to: determine the neural architecture search space further based, at least in part, on candidate operator types for the at least one of the one or more layers of the neural network based, at least in part, on the combinations of activation bit width and weight bit width; and ([p. 3 §IIIA] "1) Hardware Subgraph Template: As shown in Fig. 1, in HAO, we adopt a subgraph-based hardware design. A subgraph consists of several convolution kernels that are spatially mapped on hardware, which also corresponds to the major building block of neural architecture [...] We implement a parameterizable accelerator template in high-level synthesis (HLS). The generated dataflow accelerator can contain M convolution kernels chained through FIFOs to exploit pipeline-level parallelism. Each convolution kernel can be chosen from one of the three convolutions from the kernel pool: Conv k×k, Depthwise Conv k×k [8], and Conv 1 × 1. The hardware implementation of each kernel typically comprises a weight buffer, a line buffer, a MAC engine, and a quantization unit to rescale outputs. All the computational units are implemented using integer-only arithmetics" Conv kxk, DW Conv kxk, and Conv 1x1 are interpreted as candidate operator types the neural architecture search space is based on. See also FIG. 1.)
quantify costs associated with the candidate operator types based, at least in part, on the quantified costs associated with the candidate channel sizes.([p. 8] "In Fig. 7, we show one of the searched results by HAO. A subgraph {1x1 convolution, 3x3 depthwise convolution, 1x1 convolution} is used in this solution. As can be seen, HAO finds that a 6-bit/7-bit mixed-precision quantization setting is better than 8-bit uniform quantization for weights. In general, lower bit-width means more computation units under the same resource constraints, but it can lead to larger quantization perturbation [...] the results of HAO show that, for our implementation on Zynq ZU3EG, solutions with solely 3 × 3 depthwise convolution perform better than those with a mixture of 3 × 3 and 5 × 5 depthwise convolution. This is due to the fact that when using a mixture of 3 × 3 and 5 × 5 depthwise convolution, either 3×3 or 5×5 kernel will be idle when invoking the accelerator").
Regarding claim 7, Dong teaches The method of claim 6, and further comprising executing computer-readable instructions by the one or more processors of the computing device to: determine a union of candidate design options over the candidate operator types for the at least one of the one or more layers of the neural network; and(Dong [p. 8] "In Fig. 7, we show one of the searched results by HAO. A subgraph {1x1 convolution, 3x3 depthwise convolution, 1x1 convolution} is used in this solution. As can be seen, HAO finds that a 6-bit/7-bit mixed-precision quantization setting is better than 8-bit uniform quantization for weights. In general, lower bit-width means more computation units under the same resource constraints, but it can lead to larger quantization perturbation [...] the results of HAO show that, for our implementation on Zynq ZU3EG, solutions with solely 3 × 3 depthwise convolution perform better than those with a mixture of 3 × 3 and 5 × 5 depthwise convolution. This is due to the fact that when using a mixture of 3 × 3 and 5 × 5 depthwise convolution, either 3×3 or 5×5 kernel will be idle when invoking the accelerator")
determine a union of the candidate design options over the combinations of activation bit width and weight bit width based, at least in part, on the determined union of the candidate design options over the candidate operator types for the at least one of the one or more layers of the neural network.(Dong [p. 8] "In Fig. 7, we show one of the searched results by HAO. A subgraph {1x1 convolution, 3x3 depthwise convolution, 1x1 convolution} is used in this solution. As can be seen, HAO finds that a 6-bit/7-bit mixed-precision quantization setting is better than 8-bit uniform quantization for weights. In general, lower bit-width means more computation units under the same resource constraints, but it can lead to larger quantization perturbation [...] the results of HAO show that, for our implementation on Zynq ZU3EG, solutions with solely 3 × 3 depthwise convolution perform better than those with a mixture of 3 × 3 and 5 × 5 depthwise convolution. This is due to the fact that when using a mixture of 3 × 3 and 5 × 5 depthwise convolution, either 3×3 or 5×5 kernel will be idle when invoking the accelerator").
Regarding claim 8, Dong teaches The method of claim 1, and further comprising executing computer-readable instructions by the one or more processors of the computing device to: determine candidate design options for implementation of the neural network based, at least in part, on the identified hardware computing resources and the one or more design constraints;(Dong [Abstract] "Differing from existing hardware-aware neural architecture search (NAS) algorithms that rely solely on the expensive learning-based approaches, our work incorporates integer programming into the search algorithm to prune the design space. Given a set of hardware resource constraints, our integer programming formulation directly outputs the optimal accelerator configuration for mapping a DNN subgraph that minimizes latency." [p. 2 §III] "In HAO, we expose a large design space in both hardware and algorithm configurations to accelerate DNNs. To efficiently navigate the search space, we first apply integer programming to prune the hardware configuration space by minimizing the latency subject to a set of hardware resource constraints. We then narrow the DNN architecture space by adopting Monte Carlo tree search (MCTS) [24] to minimize the quantization accuracy perturbation while satisfying a given latency constraint")
express and/or structure the candidate design options as the neural architecture search space in a non-transitory storage medium.(Dong [p. 4 §III] "The communication latency for loading the activation on-chip and off-chip can be roughly calculated as: [See Eqn. 11] where bw is the practical bandwidth of off-chip memory. Similarly, the latency of loading weights can be estimated as: [See Eqn. 12] Based on the latency model for a single layer, we can further derive the latency of computing a subgraph").
Regarding claim 9, Dong teaches The method of claim 1, and further comprising executing computer-readable instructions by the one or more processors of the computing device to: execute the NAS process to select a design option from the neural architecture search space to implement the neural network.(Dong [p. 2 §II] "To avoid manual efforts, neural architecture search (NAS) algorithms have been proposed to automatically design pareto-optimal network architectures" See FIG. 4 "Select Top Candidates" from "Neural Architecture Space".).
Regarding claim 10, Dong teaches The method of claim 1, wherein at least one of the one or more design constraints are defined by execution latency, operation count, model size, power consumption or memory usage, or a combination thereof.(Dong [p. 4 §III] "The communication latency for loading the activation on-chip and off-chip can be roughly calculated as: [See Eqn. 11] where bw is the practical bandwidth of off-chip memory. Similarly, the latency of loading weights can be estimated as: [See Eqn. 12] Based on the latency model for a single layer, we can further derive the latency of computing a subgraph").
Regarding claim 11, Dong teaches A computing device, comprising: one or more processors to:([p. 2 §II] "which typically requires a large number of computational resources (48,000 GPU hours). [32] applies evolutionary algorithm to search for efficient neural architectures, which is feasible but also costly (75,600 GPU hours). Differential search based NAS methods [6], [27], [37], [43] significantly reduce the search cost")
identify hardware computing resources for execution of a neural network-based inference engine based, at least in part, on the identified hardware computing resources;([Abstract] "Differing from existing hardware-aware neural architecture search (NAS) algorithms that rely solely on the expensive learning-based approaches, our work incorporates integer programming into the search algorithm to prune the design space. Given a set of hardware resource constraints, our integer programming formulation directly outputs the optimal accelerator configuration for mapping a DNN subgraph that minimizes latency." DNN interpreted as deep neural network.)
identify one or more design constraints for a neural network architecture to implement a neural network-based inference engine;([Abstract] "Differing from existing hardware-aware neural architecture search (NAS) algorithms that rely solely on the expensive learning-based approaches, our work incorporates integer programming into the search algorithm to prune the design space. Given a set of hardware resource constraints, our integer programming formulation directly outputs the optimal accelerator configuration for mapping a DNN subgraph that minimizes latency." [p. 2 §III] "In HAO, we expose a large design space in both hardware and algorithm configurations to accelerate DNNs. To efficiently navigate the search space, we first apply integer programming to prune the hardware configuration space by minimizing the latency subject to a set of hardware resource constraints. We then narrow the DNN architecture space by adopting Monte Carlo tree search (MCTS) [24] to minimize the quantization accuracy perturbation while satisfying a given latency constraint")
determine a neural network architecture search space subject to the one or more design constraints, ([p. 2 §III] "In HAO, we expose a large design space in both hardware and algorithm configurations to accelerate DNNs. To efficiently navigate the search space, we first apply integer programming to prune the hardware configuration space by minimizing the latency subject to a set of hardware resource constraints. We then narrow the DNN architecture space by adopting Monte Carlo tree search (MCTS) [24] to minimize the quantization accuracy perturbation while satisfying a given latency constraint. In addition, we develop an accuracy predictor to estimate the accuracy of the DNN to further reduce the overall feedback time for each sample. Our flow produces a pareto-optimal curve between latency and accuracy")
the neural network architecture search space to define boundaries of a separate and subsequent neural architecture search for selection of the neural network architecture; and([p. 5] "In Sec. III-B1 we present our search space of neural architectures. Given a latency constraint, we can first search feasible neural architectures and corresponding mixed-precision bitwidth settings by applying the aforementioned hardware latency model as well as a model quantifying the effect of quantization perturbation. We then use an accuracy predictor to compare across different networks and find the pareto-optimal architectures and quantization settings among all candidates [...] 1) Search Space of Neural Architectures: In HAO, we construct the neural network architectures from subgraphs with feasible hardware mappings on FPGAs. Our subgraphs are combinations of operations such as convolution or depthwise convolution with kernel size of 1 × 1 or k × k as mentioned in the previous section. Although only one subgraph can be chosen on hardware, the possible building blocks for neural architecture search include the sub-layers of the subgraph. This is because each layer in the subgraph can be decided whether to bypass or not using a skip signal in hardware. We set no limit on the total number of subgraphs" Dong explicitly presents the search space (defines boundaries of a neural architecture search) and then subsequently performs multiple subsequent search steps "search feasible neural architectures […] by applying the aforementioned hardware latency model", "then use an accuracy predictor to compare across different networks and find the pareto-optimal architectures and quantization settings among all candidates".)
express and/or structure the neural network architecture search space in a non-transitory storage medium.([p. 4 §III] "The communication latency for loading the activation on-chip and off-chip can be roughly calculated as: [See Eqn. 11] where bw is the practical bandwidth of off-chip memory. Similarly, the latency of loading weights can be estimated as: [See Eqn. 12] Based on the latency model for a single layer, we can further derive the latency of computing a subgraph" Eqn. 11 interpreted as structuring the search space in a non-transitory storage medium (memory).).
Regarding claim 12, Dong teaches The computing device of claim 11, wherein the one or more processors are further to: determine the neural network architecture search space based, at least in part, on combinations activation bit width and weight bit width to be implemented in one or more layers of the neural network-based inference engine; and(Dong [p. 1] "Quantization [9], [19], [50], [53] is a general and effective technique that uses low bitwidth (such as 4-bit or 8-bit) to represent the floating-point weights/activations in neural networks [...] As an example, the quantization algorithm may select a mixture of every bitwidth from 1 bit to 8 bit, and the NAS algorithm may choose to jointly use convolution with different kernel and group sizes")
quantify costs associated with the combinations of activation bit width and weight bit width.(Dong See Table II which quantifies costs (framerates) associated with mixed bit-width quantized models.).
Regarding claim 13, Dong teaches The computing device of claim 12, wherein the one or more processors are further to: determine the neural network architecture search space further based, at least in part, on candidate channel sizes for at least one of the one or more layers of the neural network-based inference engine based, at least in part, on the combinations of activation bit width and weight bit width; and(Dong [p. 1] "As an example, the quantization algorithm may select a mixture of every bitwidth from 1 bit to 8 bit, and the NAS algorithm may choose to jointly use convolution with different kernel and group sizes" [p. 4 §III] "Given a layer with input channel size IC, output channel size" [p. 5] "We set no limit on the total number of subgraphs and choose the channel size for different layers")
quantify costs associated with the candidate channel sizes based, at least in part, on the quantified costs associated with the combinations of activation bit width and weight bit width.(Dong See Table II which quantifies costs (framerates) associated with models having a channel size of 192x192, 256x256, etc. on mixed bit-width quantized models.).
Regarding claim 14, Dong teaches The computing device of claim 13, wherein at least one of the one or more layers of the neural network-based inference engine comprises a convolution layer to be implemented at least in part by application of a kernel, and wherein the one or more processors are further to: (Dong [p. 3 §IIIA] "We implement a parameterizable accelerator template in high-level synthesis (HLS). The generated dataflow accelerator can contain M convolution kernels chained through FIFOs to exploit pipeline-level parallelism. Each convolution kernel can be chosen from one of the three convolutions from the kernel pool")
determine the neural network architecture search space further based, at least in part, on available kernel sizes for the kernel.(Dong [p. 1] "As an example, the quantization algorithm may select a mixture of every bitwidth from 1 bit to 8 bit, and the NAS algorithm may choose to jointly use convolution with different kernel and group sizes").
Regarding claim 16, Dong teaches The computing device of claim 13, wherein the one or more processors are further to: determine the neural network architecture search space further based, at least in part, on candidate operator types for the at least one of the one or more layers of the neural network-based inference engine based, at least in part, on the combinations of activation bit width and weight bit width; and(Dong [p. 3 §IIIA] "1) Hardware Subgraph Template: As shown in Fig. 1, in HAO, we adopt a subgraph-based hardware design. A subgraph consists of several convolution kernels that are spatially mapped on hardware, which also corresponds to the major building block of neural architecture [...] We implement a parameterizable accelerator template in high-level synthesis (HLS). The generated dataflow accelerator can contain M convolution kernels chained through FIFOs to exploit pipeline-level parallelism. Each convolution kernel can be chosen from one of the three convolutions from the kernel pool: Conv k×k, Depthwise Conv k×k [8], and Conv 1 × 1. The hardware implementation of each kernel typically comprises a weight buffer, a line buffer, a MAC engine, and a quantization unit to rescale outputs. All the computational units are implemented using integer-only arithmetic" Conv kxk, DW Conv kxk, and Conv 1x1 are interpreted as candidate operator types the neural architecture search space is based on. See also FIG. 1.)
quantify costs associated with the candidate operator types based, at least in part, on the quantified costs associated with the candidate channel sizes.(Dong [p. 8] "In Fig. 7, we show one of the searched results by HAO. A subgraph {1x1 convolution, 3x3 depthwise convolution, 1x1 convolution} is used in this solution. As can be seen, HAO finds that a 6-bit/7-bit mixed-precision quantization setting is better than 8-bit uniform quantization for weights. In general, lower bit-width means more computation units under the same resource constraints, but it can lead to larger quantization perturbation [...] the results of HAO show that, for our implementation on Zynq ZU3EG, solutions with solely 3 × 3 depthwise convolution perform better than those with a mixture of 3 × 3 and 5 × 5 depthwise convolution. This is due to the fact that when using a mixture of 3 × 3 and 5 × 5 depthwise convolution, either 3×3 or 5×5 kernel will be idle when invoking the accelerator").
Regarding claim 17, Dong teaches The computing device of claim 11, wherein at least one of the one or more design constraints are defined by execution latency, operation count, model size, power consumption or memory usage, or a combination thereof.(Dong [p. 4 §III] "The communication latency for loading the activation on-chip and off-chip can be roughly calculated as: [See Eqn. 11] where bw is the practical bandwidth of off-chip memory. Similarly, the latency of loading weights can be estimated as: [See Eqn. 12] Based on the latency model for a single layer, we can further derive the latency of computing a subgraph").
Regarding claim 18, Dong teaches A method comprising: executing computer-readable instructions by one or more processors of a computing device to execute a neural network architecture search process ([p. 2 §II] "which typically requires a large number of computational resources (48,000 GPU hours). [32] applies evolutionary algorithm to search for efficient neural architectures, which is feasible but also costly (75,600 GPU hours). Differential search based NAS methods [6], [27], [37], [43] significantly reduce the search cost")
to select a design option for a neural network-based inference engine from a plurality of candidate design options expressed and/or structured as a neural network architecture search space in a non-transitory storage medium, the plurality of candidate design options defining a neural network architecture search space having been determined based, at least in part, on:([p. 2 §II] "To avoid manual efforts, neural architecture search (NAS) algorithms have been proposed to automatically design pareto-optimal network architectures" See FIG. 4 "Select Top Candidates" from "Neural Architecture Space".)
identification of hardware computing resources for execution of the neural network-based inference engine;([Abstract] "Differing from existing hardware-aware neural architecture search (NAS) algorithms that rely solely on the expensive learning-based approaches, our work incorporates integer programming into the search algorithm to prune the design space. Given a set of hardware resource constraints, our integer programming formulation directly outputs the optimal accelerator configuration for mapping a DNN subgraph that minimizes latency." DNN interpreted as deep neural network.)
identification of one or more design constraints for a neural network architecture to implement the neural network-based inference engine based, at least in part, on the identified hardware computing resources;([Abstract] "Differing from existing hardware-aware neural architecture search (NAS) algorithms that rely solely on the expensive learning-based approaches, our work incorporates integer programming into the search algorithm to prune the design space. Given a set of hardware resource constraints, our integer programming formulation directly outputs the optimal accelerator configuration for mapping a DNN subgraph that minimizes latency." [p. 2 §III] "In HAO, we expose a large design space in both hardware and algorithm configurations to accelerate DNNs. To efficiently navigate the search space, we first apply integer programming to prune the hardware configuration space by minimizing the latency subject to a set of hardware resource constraints. We then narrow the DNN architecture space by adopting Monte Carlo tree search (MCTS) [24] to minimize the quantization accuracy perturbation while satisfying a given latency constraint")
and determination of a neural network architecture search space subject to the one or more design constraints by application of the one or more design constraints to the identification of the hardware computing resources([p. 2 §III] "In HAO, we expose a large design space in both hardware and algorithm configurations to accelerate DNNs. To efficiently navigate the search space, we first apply integer programming to prune the hardware configuration space by minimizing the latency subject to a set of hardware resource constraints. We then narrow the DNN architecture space by adopting Monte Carlo tree search (MCTS) [24] to minimize the quantization accuracy perturbation while satisfying a given latency constraint. In addition, we develop an accuracy predictor to estimate the accuracy of the DNN to further reduce the overall feedback time for each sample. Our flow produces a pareto-optimal curve between latency and accuracy")
the neural network architecture search space defining boundaries of a separate and subsequent neural architecture search for an implementation of the neural network-based inference engine.([p. 5] "In Sec. III-B1 we present our search space of neural architectures. Given a latency constraint, we can first search feasible neural architectures and corresponding mixed-precision bitwidth settings by applying the aforementioned hardware latency model as well as a model quantifying the effect of quantization perturbation. We then use an accuracy predictor to compare across different networks and find the pareto-optimal architectures and quantization settings among all candidates [...] 1) Search Space of Neural Architectures: In HAO, we construct the neural network architectures from subgraphs with feasible hardware mappings on FPGAs. Our subgraphs are combinations of operations such as convolution or depthwise convolution with kernel size of 1 × 1 or k × k as mentioned in the previous section. Although only one subgraph can be chosen on hardware, the possible building blocks for neural architecture search include the sub-layers of the subgraph. This is because each layer in the subgraph can be decided whether to bypass or not using a skip signal in hardware. We set no limit on the total number of subgraphs" Dong explicitly presents the search space (defines boundaries of a neural architecture search) and then subsequently performs multiple subsequent search steps "search feasible neural architectures […] by applying the aforementioned hardware latency model", "then use an accuracy predictor to compare across different networks and find the pareto-optimal architectures and quantization settings among all candidates".).
Regarding claim 19, Dong teaches The method of claim 18, wherein at least one of the one or more design constraints are defined by execution latency, operation count, model size, power consumption or memory usage, or a combination thereof.(Dong [p. 4 §III] "The communication latency for loading the activation on-chip and off-chip can be roughly calculated as: [See Eqn. 11] where bw is the practical bandwidth of off-chip memory. Similarly, the latency of loading weights can be estimated as: [See Eqn. 12] Based on the latency model for a single layer, we can further derive the latency of computing a subgraph").
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Wu (“FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search”, 2019) is directed towards a hardware aware neural architecture search which clearly distinguishes the search space generation from the search step.
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SIDNEY VINCENT BOSTWICK whose telephone number is (571)272-4720. The examiner can normally be reached M-F 7:30am-5:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571)270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/SIDNEY VINCENT BOSTWICK/Examiner, Art Unit 2124
/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124