DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 112(b)
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 1-16 and 19-31 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention. Specifically, exemplary claim 1 recites, “the instance proposal distributions” twice without proper antecedent basis. For this reason, the above listed claims are rejected for containing this language or being dependent on a claim that contains this language.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1, 4, 10, 11, 13, 15, 16, 19, 22, 24, 25, 27, and 29 is/are rejected under 35 U.S.C. 103 as being unpatentable over Che et al. (hereinafter Che), U.S. Patent Application Publication 2020/0249998, in view of Caldas et al. (hereinafter Caldas), A design optimization tool based on a genetic algorithm, further in view of Portugal et al. (hereinafter Portugal), A Study of Genetic Algorithms for Approximating the Longest Path in Generic Graphs.
Regarding Claim 1, Che discloses a method comprising:
obtaining data representing an input computation graph, the input computation graph comprising a plurality of nodes that are connected by edges, the nodes representing operations and the edges representing dependencies between the operations [“Graph generator 211 can compile a source code for a machine-learning model or neural network model to generate a computation graph representing the source code…the graph generator 211 can generate a computation graph in a form of a Directed Acyclic Graph (DAG)…Nodes represent variables, weights, or computation operations, while edges represent data or tensor flowing from one node to another. An incoming edge to a node representing a computation operation is input data consumed by the computation operation, while an outgoing edge from the node represents output data produced by the computation operation” ¶33];
processing the data representing the input computation graph using a graph neural network having a plurality of network parameters, wherein the graph neural network has been trained to process the data representing the input computation graph in accordance with first values of the network parameters to generate one or more instance-specific proposal distributions for an optimization algorithm that schedules the input computation graph for execution across a plurality of hardware devices to optimize a performance metric for the hardware devices during the execution of the computation graph [“The optimization of the task allocation optimizer 215 is performed per a subset of the computation graph…the task allocation optimizer 215 may use a reinforcement learning algorithm to optimize both the execution order and device placement. The reinforcement learning algorithm used by the task allocation optimizer 215 will be explained referring to FIG. 5…In reinforcement learning, an agent 501 makes observations to an environment 502 and takes actions within the environment 502…The agent 501 can use a policy network to determine its actions. In FIG. 5, the policy network of the agent 501 is illustrated as a neural network including input layer, output layer, and one or more hidden layers. Consistent with embodiments of the present disclosure, any policy-based neural network can be used as the policy network for the agent 501…the policy network of the agent 501 may generate a probability distribution over all possible actions. An action can be taken according to this probability distribution, leading to a new state or task allocation model with a reward” ¶45-46; Fig. 5],
a candidate schedule [“Scheduler 210 is configured to schedule tasks” ¶29],
generating a schedule for the input computation graph that is predicted to optimize the performance metric [“Scheduler 210 is configured to schedule tasks with respect to execution order of operations and which operation is processed in which target device or which operation is processed in which processing element” ¶29] by performing the optimization algorithm in accordance with the one or more instance-specific proposal distributions generated by the graph neural network for the input computation graph [“the task allocation optimizer 215 may use a reinforcement learning algorithm to optimize both the execution order and device placement.” ¶45; “the policy network of the agent 501 may generate a probability distribution over all possible actions. An action can be taken according to this probability distribution” ¶46], and
using the instance-specific proposal distributions generated by the graph neural network by processing the data representing the input computation graph [“the policy network of the agent 501 may generate a probability distribution over all possible actions. An action can be taken according to this probability distribution” ¶46],
using the instance-specific proposal distributions generated by the graph neural network by processing the data representing the input computation graph [“the policy network of the agent 501 may generate a probability distribution over all possible actions. An action can be taken according to this probability distribution” ¶46],
executing the input computation graph on the plurality of hardware devices by causing the plurality of hardware devices to perform the operations represented by the nodes in the input computation graph in accordance with the generated schedule [“The optimization of the task allocation optimizer 215 is performed per a subset of the computation graph…the task allocation optimizer 215 may use a reinforcement learning algorithm to optimize both the execution order and device placement. The reinforcement learning algorithm used by the task allocation optimizer 215 will be explained referring to FIG. 5…In reinforcement learning, an agent 501 makes observations to an environment 502 and takes actions within the environment 502…The agent 501 can use a policy network to determine its actions…the policy network of the agent 501 may generate a probability distribution over all possible actions. An action can be taken according to this probability distribution, leading to a new state or task allocation model with a reward” and pg. 6 [0049]: “For a sequence of nodes, a change in the sequence of nodes can be an action…That is, the agent 501 can take an action to change a target device to execute a certain operation represented by a node (e.g., FPGA to GPU)” ¶45-46].
However, Che fails to explicitly disclose wherein the optimization algorithm is a genetic algorithm, wherein the genetic algorithm repeatedly adds new chromosomes to a population of chromosomes each representing a candidate schedule, wherein each new chromosome is (i) a child chromosome generated by selecting two or more parent chromosomes and applying a crossover procedure given the two parent chromosomes or (ii) a mutated chromosome,
wherein performing the optimization algorithm comprises, for each new chromosome: (i) sampling the mutation from a set of possible mutations using the instance-specific proposal distributions generated by the graph neural network by processing the data representing the input computation graph, wherein the instance proposal distributions include the one or more independent beta distributions comprising the one or more node-specific probability distributions, or (ii) selecting the crossover procedure using the instance-specific proposal distributions generated by the graph neural network by processing the data representing the input computation graph, wherein the instance proposal distributions include node-specific elite biases, or (iii) both in place of using fixed distributions that are shared across computation graphs.
Caldas discloses wherein the optimization algorithm is a genetic algorithm [“goal-oriented design was the application of a search and optimization technique borrowed from the field of artificial intelligence, Genetic Algorithms” §1 ¶3], wherein the genetic algorithm repeatedly adds new chromosomes to a population of chromosomes each representing a candidate schedule [“a solution to a problem is an individual and the group of solutions existent at each stage is a population. Each time a new population of individuals is created, it is called a generation.” §4 ¶2], wherein each new chromosome is (i) a child chromosome generated by selecting two or more parent chromosomes and applying a crossover procedure given the two parent chromosomes [“Crossover implies that parts of two randomly chosen chromosomes will be swapped to create a new individual.” §4 ¶3] or (ii) a mutated chromosome [“Mutation involves randomly changing an allele in a solution to look for new points in the solution space.” §4 ¶3],
wherein performing the optimization algorithm comprises, for each new chromosome: (i) sampling the mutation from a set of possible mutations [“probability of a given solution being chosen for reproduction is proportional to the fitness of that solution.” §4 ¶3; “Mutation involves randomly changing an allele in a solution to look for new points in the solution space.” §4 ¶3; Note: the mutation is randomly sampled from all possible mutations] using the instance-specific proposal distributions generated by the graph neural network by processing the data representing the input computation graph, wherein the instance proposal distributions include the one or more independent beta distributions comprising the one or more node-specific probability distributions, or (ii) selecting the crossover procedure [“probability of a given solution being chosen for reproduction is proportional to the fitness of that solution.” §4 ¶3; “Crossover implies that parts of two randomly chosen chromosomes will be swapped to create a new individual.” §4 ¶3] using the instance-specific proposal distributions generated by the graph neural network by processing the data representing the input computation graph, wherein the instance proposal distributions include node-specific elite biases, or (iii) both in place of using fixed distributions that are shared across computation graphs.
It would have been obvious to one having ordinary skill in the art, having the teachings of Che and Caldas before him before the effective filing date of the claimed invention, to modify the method of Che using the probability distributions to incorporate optimization using genetic algorithms of Caldas.
Given the advantage of GAs having been proven to have high efficacy in solving complex problems, one having ordinary skill in the art would have been motivated to make this obvious modification.
However, Che fails to explicitly disclose and wherein the one or more instance-specific proposal distributions include (i) one or more independent beta distributions comprising one or more node-specific probability distributions for each of the plurality of nodes used by the genetic algorithm when constructing an initial population and generating each mutated chromosome, (ii) node-specific elite biases for each node used by the genetic algorithm when generating each child chromosome, or (iii) both;
wherein performing the optimization algorithm comprises, for each new chromosome: (i) sampling the mutation from a set of possible mutations using the instance-specific proposal distributions generated by the graph neural network by processing the data representing the input computation graph, wherein the instance proposal distributions include the one or more independent beta distributions comprising the one or more node-specific probability distributions, or (ii) selecting the crossover procedure using the instance-specific proposal distributions generated by the graph neural network by processing the data representing the input computation graph, wherein the instance proposal distributions include node-specific elite biases, or (iii) both in place of using fixed distributions that are shared across computation graphs.
Portugal discloses and wherein the one or more instance-specific proposal distributions include (i) one or more independent beta distributions comprising one or more node-specific probability distributions for each of the plurality of nodes used by the genetic algorithm when constructing an initial population and generating each mutated chromosome, (ii) node-specific elite biases for each node used by the genetic algorithm when generating each child chromosome, or (iii) both [“It is not clear what is the appropriate elite fraction to use for each approach. Results denote an apparent pattern where 1-5% groups perform better for smaller population sizes and 5-10% fractions are favourable otherwise. The choice of this value is important to avoid imposing too much selective pressure and a premature convergence to a local optimum” §IV ¶7];
wherein performing the optimization algorithm comprises, for each new chromosome: (i) sampling the mutation from a set of possible mutations using the instance-specific proposal distributions generated by the graph neural network by processing the data representing the input computation graph, wherein the instance proposal distributions include the one or more independent beta distributions comprising the one or more node-specific probability distributions, or (ii) selecting the crossover procedure using the instance-specific proposal distributions generated by the graph neural network by processing the data representing the input computation graph, wherein the instance proposal distributions include node-specific elite biases [“It is not clear what is the appropriate elite fraction to use for each approach. Results denote an apparent pattern where 1-5% groups perform better for smaller population sizes and 5-10% fractions are favourable otherwise. The choice of this value is important to avoid imposing too much selective pressure and a premature convergence to a local optimum” §IV ¶7], or (iii) both in place of using fixed distributions that are shared across computation graphs.
It would have been obvious to one having ordinary skill in the art, having the teachings of Che, Caldas, and Portugal before him before the effective filing date of the claimed invention, to modify the combination to incorporate different elite biases.
Given the advantage of better convergence and performance, one having ordinary skill in the art would have been motivated to make this obvious modification.
Regarding Claim 4,
Che, Caldas, and Portugal disclose the method of claim 1.
CHE et al. further teaches wherein the graph neural network has been trained to generate instance-specific probability distributions that result in schedules that optimize the performance metric for executing the input computation graph across the plurality of devices (pg. 6 [0045]-[0046] teaches training a graph neural network to generate probability distribution that results in an optimized execution order (schedule); pg. 1 [0006]: “The policy network can be updated according to a reward determined by performance evaluation of the action in runtime environments for executing the computation graph. The reward can be determined based on execution delay or memory usage efficiency” teaches optimizing a performance metric, including execution delay or memory usage efficiency).
Regarding Claim 10,
Che, Caldas, and Portugal disclose the method of claim 1.
CHE et al. further teaches wherein the optimization algorithm generates an assignment that assigns each operation to a respective device from the plurality of devices (Fig. 5 and pg. 6 [0045]-[0046]: “The optimization of the task allocation optimizer 215 is performed per a subset of the computation graph…the task allocation optimizer 215 may use a reinforcement learning algorithm to optimize both the execution order and device placement. The reinforcement learning algorithm used by the task allocation optimizer 215 will be explained referring to FIG. 5” teach the reinforcement learning algorithm (optimization algorithm) generates optimized execution order and device placement).
Regarding Claim 11,
Che, Caldas, and Portugal disclose the method of claim 10.
CHE et al. further teaches wherein the optimization algorithm generates, for each device, a device schedule that defines an execution order for a plurality of operations that have been assigned to the device (pg. 6 [0049]: “For a sequence of nodes, a change in the sequence of nodes can be an action. For example, a new sequence of nodes [n13, n14, n15, n16, n17], which is different from the original [n13, n15, n14, n16, n17] and still meets the dependency requirement for the subset S21, can be chosen as an action. For a sequence of target devices, a target device change in at least one position of the inputted sequence of target devices can be an action” teaches the reinforcement learning algorithm (optimization algorithm) generates a sequence of nodes (corresponds to execution order of operations) for each target device).
Regarding Claim 13,
Che, Caldas, and Portugal disclose the method of claim 1.
CHE et al. further teaches wherein data is transmitted between one or more of the operations as of tensors represented by edges in the computation graph, and the schedule identifies tensors which should be re-computed instead of stored during execution of the computation graph (pg. 4 [0033]: “Nodes represent variables, weights, or computation operations, while edges represent data or tensor flowing from one node to another. An incoming edge to a node representing a computation operation is input data consumed by the computation operation, while an outgoing edge from the node represents output data produced by the computation operation” teaches the data transmitted between operations represented by nodes are tensors; the transferring of tensors from one computation operation to another corresponds to determining that the tensors need further computation instead of storage).
Regarding Claim 15,
Che, Caldas, and Portugal disclose the method of claim 1.
CHE et al. further teaches wherein the input computation graph is a sub-block of a another computation graph (pg. 5 [0043]: “Referring to FIG. 3, task allocation generator 214 is configured to generate one or more task allocation models for each subset of a computation graph” teaches the input computation graph can be a subset of a another computation graph).
Regarding Claim 16,
Che, Caldas, and Portugal disclose the method of claim 1.
CHE et al. further teaches wherein the input computation graph is a graph representation of at least a portion of a neural network inference workload or a neural network training workload (pg. 4 [0033]: “graph generator 211 may transform a machine-learning model or neural network model written in high level language to generate a computation graph representing the machine-learning model or neural network model” and pg. 2 [0015]: “The disclosed embodiments also provide a method and apparatus for improving inference performance by minimizing end-to-end inference delay based on optimized task schedule and device placement” teach the input computation graph represents a neural network model and its inference tasks).
Regarding Claim 19,
Claim 19 recites analogous limitations to claim 1. Therefore, claim 19 is rejected based on the same rationale as claim 1.
CHE et al. further teaches a system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations comprising (see pg. 8 [0065]).
Regarding Claim 22,
Claim 22 recites analogous limitations to claim 4. Therefore, claim 22 is rejected based on the same rationale as claim 4.
Regarding Claim 24,
Claim 24 recites analogous limitations to claim 10. Therefore, claim 24 is rejected based on the same rationale as claim 10.
Regarding Claim 25,
Claim 25 recites analogous limitations to claim 11. Therefore, claim 25 is rejected based on the same rationale as claim 11.
Regarding Claim 27,
Claim 27 recites analogous limitations to claim 13. Therefore, claim 27 is rejected based on the same rationale as claim 13.
Regarding Claim 29,
Claim 29 recites analogous limitations to claim 1. Therefore, claim 29 is rejected based on the same rationale as claim 1.
CHE et al. further teaches One or more non-transitory computer-readable storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations comprising (see pg. 8 [0065]).
Claims 12 and 26 rejected under 35 U.S.C. 103 as being unpatentable over Che, Caldas, and Portugal, and further in view of Ma et al. (“Towards Efficient Large-Scale Graph Neural Network Computing”).
Regarding Claim 12,
Che, Caldas, and Portugal disclose the method of claim 1.
CHE does not appear to explicitly teach wherein the schedule identifies operations that should be fused into a single operation during execution of the input computation graph.
However, Ma et al. teaches wherein the schedule identifies operations that should be fused into a single operation during execution of the input computation graph (Section 3: “an optimization layer that produces a scheduling strategy for minimizing data movement between host and GPU device memory, and recognizes opportunities to fuse operations and remove redundant computations” teaches producing a schedule that recognizes when operations should be fused; also see Fig. 1).
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Ma et al. to the disclosed invention of CHE et al.
One of ordinary skill in the arts would have been motivated to make this modification to “[recognize] opportunities to fuse operations and remove redundant computations” (Ma et al. Section 3).
Regarding Claim 26,
Claim 26 recites analogous limitations to claim 12. Therefore, claim 26 is rejected based on the same rationale as claim 12.
Claims 14 and 28 rejected under 35 U.S.C. 103 as being unpatentable over Che, Caldas, and Portugal and further in view of Ross et al. (US 10,685,295 B1).
Regarding Claim 14,
Che, Caldas, and Portugal disclose the method of claim 1.
CHE et al. further teaches wherein data is transmitted between one or more of the operations as of tensors represented by edges in the computation graph (pg. 4 [0033]: “Nodes represent variables, weights, or computation operations, while edges represent data or tensor flowing from one node to another. An incoming edge to a node representing a computation operation is input data consumed by the computation operation, while an outgoing edge from the node represents output data produced by the computation operation” teaches the data transmitted between operations represented by nodes are tensors; the tensors being transmitted are represented by edges).
CHE does not appear to explicitly teach the schedule identifies priorities for transferring tensors between devices.
However, Ross et al. teaches the schedule identifies priorities for transferring tensors between devices (Col. 8 lines 46-97: “In some implementations, a user has access to remotely-located special purpose machine learning model processors. These processors can be provided to a user using a pricing model that is different from pricing models of customary processing devices…An example pricing model may price operations based on priority, with lower priority operations that can wait to complete being less expensive. Pricing may additionally or alternatively be based on raw quota or model execution operations scheduled over a long period of time” teaches the schedule identifies priorities of operations, which include transferring computation operation results between user and the remotely-located special purpose machine learning model processors; Col. 4 lines 13-34 teaches machine learning computations can include tensor computations).
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Ross et al. to the disclosed invention of CHE et al.
One of ordinary skill in the arts would have been motivated to make this modification to solve the problem of unpredictable resource allocation “by providing a system that determines…the amount of resources that a machine learning model will use during execution on a special purpose machine learning model processor. The determination is based, in part, on the processing of deterministic metrics to facilitate the efficient scheduling of binaries for the machine learning models” (Ross et al. Col. 1 lines 46-52).
Regarding Claim 28,
Claim 28 recites analogous limitations to claim 14. Therefore, claim 28 is rejected based on the same rationale as claim 14.
Claims 2, 3, 5-9, 20, 21, and 23 are rejected under 35 U.S.C. 103 as being unpatentable over Che, Caldas, and Portugal and further in view of Addanki et al. (“Placeto: Learning Generalizable Device Placement Algorithms for Distributed Machine Learning”).
Regarding Claim 2,
Che, Caldas, and Portugal disclose the method of claim 1.
CHE does not appear to explicitly teach wherein the graph neural network has been trained on training data that includes data representing a plurality of training computation graphs.
However, Addanki et al. teaches wherein the graph neural network has been trained on training data that includes data representing a plurality of training computation graphs (Section 3.1: “We also evaluate on three synthetic datasets, each comprising of 32 graphs, spanning a wide range of graph sizes and structures…We randomly split these datasets for training and test purposes” teaches the training set includes multiple graphs).
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Addanki et al. to the disclosed invention of CHE et al.
One of ordinary skill in the arts would have been motivated to make this modification to leverage “an RL-based approach for finding device placements to minimize training time of deep-learning models. By structuring the policy decisions as incremental placement improvement steps, and using graph embeddings to encode graph structure, Placeto is able to train efficiently and learns policies that generalize to unseen graphs” (Addanki et al. Section 5).
Regarding Claim 3,
Che, Caldas, and Portugal and Addanki et al. disclose the method of claim 2.
Addanki et al. further teaches wherein the input computation graph is not represented in the training data (Section 3.1: “We also evaluate on three synthetic datasets, each comprising of 32 graphs, spanning a wide range of graph sizes and structures…We randomly split these datasets for training and test purposes” teaches the testing set and training set contain different datasets of graphs, therefore the testing input graph would not be present in the training data).
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Addanki et al. to the disclosed invention of CHE et al.
One of ordinary skill in the arts would have been motivated to make this modification to leverage “an RL-based approach for finding device placements to minimize training time of deep-learning models. By structuring the policy decisions as incremental placement improvement steps, and using graph embeddings to encode graph structure, Placeto is able to train efficiently and learns policies that generalize to unseen graphs” (Addanki et al. Section 5).
Regarding Claim 5,
Che, Caldas, and Portugal disclose the method of claim 1.
CHE does not appear to explicitly teach wherein the performance metric measures one or more of a peak memory usage of the execution of the input computation graph or a running time of the input computation graph.
However, Addanki et al. teaches wherein the performance metric measures one or more of a peak memory usage of the execution of the input computation graph or a running time of the input computation graph (Section 2: “The goal of device placement is to find a placement π that minimizes ρ(G, π), the duration of G’s execution when its ops are placed according to π” teaches the performance metric includes execution duration (running time) of the graph).
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Addanki et al. to the disclosed invention of CHE et al.
One of ordinary skill in the arts would have been motivated to make this modification to leverage “an RL-based approach for finding device placements to minimize training time of deep-learning models. By structuring the policy decisions as incremental placement improvement steps, and using graph embeddings to encode graph structure, Placeto is able to train efficiently and learns policies that generalize to unseen graphs” (Addanki et al. Section 5).
Regarding Claim 6,
Che, Caldas, and Portugal disclose the method of claim 1.
CHE et al. further teaches processing the training example in accordance with current values…to generate one or more training instance-specific proposal distributions for the optimization algorithm (Fig. 5 and pg. 6 [0045]-[0046]: “The optimization of the task allocation optimizer 215 is performed per a subset of the computation graph…the task allocation optimizer 215 may use a reinforcement learning algorithm to optimize both the execution order and device placement. The reinforcement learning algorithm used by the task allocation optimizer 215 will be explained referring to FIG. 5…In reinforcement learning, an agent 501 makes observations to an environment 502 and takes actions within the environment 502…The agent 501 can use a policy network to determine its actions. In FIG. 5, the policy network of the agent 501 is illustrated as a neural network including input layer, output layer, and one or more hidden layers. Consistent with embodiments of the present disclosure, any policy-based neural network can be used as the policy network for the agent 501…the policy network of the agent 501 may generate a probability distribution over all possible actions. An action can be taken according to this probability distribution, leading to a new state or task allocation model with a reward” teach processing the graph using a neural network (a neural network that can process graph data is considered a graph neural network) to generate probability distribution over all possible actions (correspond to generate one or more instance-specific proposal distributions) for task allocation optimizer 215 as represented by a reinforcement learning algorithm (corresponds to optimization algorithm), wherein the task allocation optimizer generates an optimized “execution order” (corresponds to optimized schedule for executing the operations across devices; pg. 4 [0033] teaches current values of neural network in the form of tensors);
generating a training schedule for the training computation graph represented by the training example by performing the optimization algorithm in accordance with the one or more training instance-specific proposal distributions (Fig. 5 and pg. 6 [0045]-[0046]: “The optimization of the task allocation optimizer 215 is performed per a subset of the computation graph…the task allocation optimizer 215 may use a reinforcement learning algorithm to optimize both the execution order and device placement. The reinforcement learning algorithm used by the task allocation optimizer 215 will be explained referring to FIG. 5…In reinforcement learning, an agent 501 makes observations to an environment 502 and takes actions within the environment 502…The agent 501 can use a policy network to determine its actions. In FIG. 5, the policy network of the agent 501 is illustrated as a neural network including input layer, output layer, and one or more hidden layers. Consistent with embodiments of the present disclosure, any policy-based neural network can be used as the policy network for the agent 501…the policy network of the agent 501 may generate a probability distribution over all possible actions. An action can be taken according to this probability distribution, leading to a new state or task allocation model with a reward” teach generating execution order (corresponds to schedule) by performing a reinforcement learning algorithm (corresponds to optimization algorithm) in accordance with the probability distribution generated by the agent represented by the neural network);
executing the training computation graph on the plurality of devices by causing the plurality of devices to perform the operations represented by the nodes in the input computation graph in accordance with the training schedule (pg. 6 [0045]-[0046]: “The optimization of the task allocation optimizer 215 is performed per a subset of the computation graph…the task allocation optimizer 215 may use a reinforcement learning algorithm to optimize both the execution order and device placement. The reinforcement learning algorithm used by the task allocation optimizer 215 will be explained referring to FIG. 5…In reinforcement learning, an agent 501 makes observations to an environment 502 and takes actions within the environment 502…The agent 501 can use a policy network to determine its actions…the policy network of the agent 501 may generate a probability distribution over all possible actions. An action can be taken according to this probability distribution, leading to a new state or task allocation model with a reward” and pg. 6 [0049]: “For a sequence of nodes, a change in the sequence of nodes can be an action…That is, the agent 501 can take an action to change a target device to execute a certain operation represented by a node (e.g., FPGA to GPU)” teach executing the operations represented by nodes of the computation graph according to an order (schedule) determined by the agent);
determining a performance metric for the execution of the training computation graph (pg. 6 [0045]-[0046] teaches training a graph neural network to generate probability distribution that results in an optimized execution order (schedule); pg. 1 [0006]: “The policy network can be updated according to a reward determined by performance evaluation of the action in runtime environments for executing the computation graph. The reward can be determined based on execution delay or memory usage efficiency” teaches optimizing a performance metric, including execution delay or memory usage efficiency);
generating a reward from the performance metric (pg. 6 [0045]-[0046] teaches training a graph neural network to generate probability distribution that results in an optimized execution order (schedule); pg. 1 [0006]: “The policy network can be updated according to a reward determined by performance evaluation of the action in runtime environments for executing the computation graph. The reward can be determined based on execution delay or memory usage efficiency” teaches generating a reward from the performance metric).
CHE et al. does not appear to explicitly teach further comprising: further comprising: training the graph neural network on training data to determine the first values of the network parameters, wherein the training data comprises a plurality of training examples, each training example being data representing a different training computation graph, and wherein the training comprises, for each training example; processing the training example in accordance with current values of the network parameters…determining an update to the current values of the network parameters based on the reward using a reinforcement learning technique.
However, Addanki et al. teaches further comprising: further comprising: training the graph neural network on training data to determine the first values of the network parameters, wherein the training data comprises a plurality of training examples, each training example being data representing a different training computation graph, and wherein the training comprises, for each training example (Section 3.1: “We also evaluate on three synthetic datasets, each comprising of 32 graphs, spanning a wide range of graph sizes and structures…We randomly split these datasets for training and test purposes” teaches the training set includes multiple graphs; Fig. 2 and Section 2.3: “The neural network design of Placeto’s graph embedding procedure and policy network allows the training parameters to be shared across episodes” teach training the RL agent (including a graph neural network) with training data to determine network parameters):
processing the training example in accordance with current values of the network parameters…(Fig. 2 and Section 2.3: “The neural network design of Placeto’s graph embedding procedure and policy network allows the training parameters to be shared across episodes” teach training the RL agent (including a graph neural network) with training data to determine network parameters)
…determining an update to the current values of the network parameters based on the reward using a reinforcement learning technique (Fig. 2 teaches using reinforcement learning technique to update the RL agent (including the graph neural network and its network parameters) based on the reward).
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Addanki et al. to the disclosed invention of CHE et al.
One of ordinary skill in the arts would have been motivated to make this modification to leverage “an RL-based approach for finding device placements to minimize training time of deep-learning models. By structuring the policy decisions as incremental placement improvement steps, and using graph embeddings to encode graph structure, Placeto is able to train efficiently and learns policies that generalize to unseen graphs” (Addanki et al. Section 5).
Regarding Claim 7,
Che, Caldas, and Portugal disclose the method of claim 1.
CHE et al. further teaches processing the training example in accordance with current values …to generate one or more training instance-specific proposal distributions for the optimization algorithm (Fig. 5 and pg. 6 [0045]-[0046]: “The optimization of the task allocation optimizer 215 is performed per a subset of the computation graph…the task allocation optimizer 215 may use a reinforcement learning algorithm to optimize both the execution order and device placement. The reinforcement learning algorithm used by the task allocation optimizer 215 will be explained referring to FIG. 5…In reinforcement learning, an agent 501 makes observations to an environment 502 and takes actions within the environment 502…The agent 501 can use a policy network to determine its actions. In FIG. 5, the policy network of the agent 501 is illustrated as a neural network including input layer, output layer, and one or more hidden layers. Consistent with embodiments of the present disclosure, any policy-based neural network can be used as the policy network for the agent 501…the policy network of the agent 501 may generate a probability distribution over all possible actions. An action can be taken according to this probability distribution, leading to a new state or task allocation model with a reward” teach processing the graph using a neural network (a neural network that can process graph data is considered a graph neural network) to generate probability distribution over all possible actions (correspond to generate one or more instance-specific proposal distributions) for task allocation optimizer 215 as represented by a reinforcement learning algorithm (corresponds to optimization algorithm), wherein the task allocation optimizer generates an optimized “execution order” (corresponds to optimized schedule for executing the operations across devices; pg. 4 [0033] teaches current values of neural network in the form of tensors);
generating a training schedule for the training computation graph represented by the training example by performing the optimization algorithm in accordance with the one or more training instance-specific proposal distributions (Fig. 5 and pg. 6 [0045]-[0046]: “The optimization of the task allocation optimizer 215 is performed per a subset of the computation graph…the task allocation optimizer 215 may use a reinforcement learning algorithm to optimize both the execution order and device placement. The reinforcement learning algorithm used by the task allocation optimizer 215 will be explained referring to FIG. 5…In reinforcement learning, an agent 501 makes observations to an environment 502 and takes actions within the environment 502…The agent 501 can use a policy network to determine its actions. In FIG. 5, the policy network of the agent 501 is illustrated as a neural network including input layer, output layer, and one or more hidden layers. Consistent with embodiments of the present disclosure, any policy-based neural network can be used as the policy network for the agent 501…the policy network of the agent 501 may generate a probability distribution over all possible actions. An action can be taken according to this probability distribution, leading to a new state or task allocation model with a reward” teach generating execution order (corresponds to schedule) by performing a reinforcement learning algorithm (corresponds to optimization algorithm) in accordance with the probability distribution generated by the agent represented by the neural network);
determining a performance metric for the training schedule using a cost model that models a value of one or more properties of the training schedule (pg. 6-7 [0053]: “memory usage during execution of a computation graph can be obtained by applying liveness analysis…memory usage efficiency for a certain memory can be obtained by a ratio of a time period that the certain memory is in use (e.g., the memory is live) to a pre-set time period” teaches performance metric such as memory usage is determined using a liveness analysis (corresponds to cost model) that models when the memory is “live” (corresponds to modeling a value of a property of the schedule));
generating a reward from the performance metric (pg. 6 [0045]-[0046] teaches training a graph neural network to generate probability distribution that results in an optimized execution order (schedule); pg. 1 [0006]: “The policy network can be updated according to a reward determined by performance evaluation of the action in runtime environments for executing the computation graph. The reward can be determined based on execution delay or memory usage efficiency” teaches generating a reward from the performance metric).
CHE et al. does not appear to explicitly teach further comprising: training the graph neural network on training data to determine the first values of the network parameters, wherein the training data comprises a plurality of training examples, each training example being data representing a different training computation graph, and wherein the training comprises, for each training example; processing the training example in accordance with current values of the network parameters…determining an update to the current values of the network parameters based on the reward using a reinforcement learning technique
However, Addanki et al. teaches further comprising: training the graph neural network on training data to determine the first values of the network parameters, wherein the training data comprises a plurality of training examples, each training example being data representing a different training computation graph, and wherein the training comprises, for each training example (Section 3.1: “We also evaluate on three synthetic datasets, each comprising of 32 graphs, spanning a wide range of graph sizes and structures…We randomly split these datasets for training and test purposes” teaches the training set includes multiple graphs; Fig. 2 and Section 2.3: “The neural network design of Placeto’s graph embedding procedure and policy network allows the training parameters to be shared across episodes” teach training the RL agent (including a graph neural network) with training data to determine network parameters):
processing the training example in accordance with current values of the network parameters…(Fig. 2 and Section 2.3: “The neural network design of Placeto’s graph embedding procedure and policy network allows the training parameters to be shared across episodes” teach training the RL agent (including a graph neural network) with training data to determine network parameters)
…determining an update to the current values of the network parameters based on the reward using a reinforcement learning technique (Fig. 2 teaches using reinforcement learning technique to update the RL agent (including the graph neural network and its network parameters) based on the reward).
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Addanki et al. to the disclosed invention of CHE et al.
One of ordinary skill in the arts would have been motivated to make this modification to leverage “an RL-based approach for finding device placements to minimize training time of deep-learning models. By structuring the policy decisions as incremental placement improvement steps, and using graph embeddings to encode graph structure, Placeto is able to train efficiently and learns policies that generalize to unseen graphs” (Addanki et al. Section 5).
Regarding Claim 8,
Che, Caldas, and Portugal and Addanki et al. disclose the method of claim 7.
Addanki et al. further teaches wherein the performance metric measures one or more of a peak memory usage or a running time of the execution of the training computation graph in accordance with the training schedule (Section 2: “The goal of device placement is to find a placement π that minimizes ρ(G, π), the duration of G’s execution when its ops are placed according to π” teaches the performance metric includes execution duration (running time) of the graph).
It would have been obvious for one of ordinary skill in the arts before the effective filing date of the claimed invention to incorporate the limitation(s) above as taught by Addanki et al. to the disclosed invention of CHE et al in view of Caldas.
One of ordinary skill in the arts would have been motivated to make this modification to leverage “an RL-based approach for finding device placements to minimize training time of deep-learning models. By structuring the policy decisions as incremental placement improvement steps, and using graph embeddings to encode graph structure, Placeto is able to train efficiently and learns policies that generalize to unseen graphs” (Addanki et al. Section 5).
Regarding Claim 9,
Che, Caldas, and Portugal and Addanki et al. disclose the method of claim 7.
CHE et al. further teaches wherein the reward is based on the performance metric and a baseline performance metric generated from the execution of the training computation graph in accordance with a baseline schedule (pg. 1 [0006]: “The policy network can be updated according to a reward determined by performance evaluation of the action in runtime environments for executing the computation graph. The reward can be determined based on execution delay or memory usage efficiency” teaches generating a reward from memory usage efficiency; pg. 6-7 [0053]: “memory usage efficiency for a certain memory can be obtained by a ratio of a time period that the certain memory is in use (e.g., the memory is live) to a pre-set time period” teaches memory usage efficiency is determined based on memory usage during a period (performance metric) when compared to memory usage in a pre-set time period (baseline performance metric)).
Regarding Claim 20,
Claim 20 recites analogous limitations to claim 2. Therefore, claim 20 is rejected based on the same rationale as claim 2.
Regarding Claim 21,
Claim 21 recites analogous limitations to claim 3. Therefore, claim 21 is rejected based on the same rationale as claim 3.
Regarding Claim 23,
Claim 23 recites analogous limitations to claim 5. Therefore, claim 23 is rejected based on the same rationale as claim 5.
Examiner’s Note
The Examiner respectfully requests of the Applicant in preparing responses, to fully consider the entirety of the reference(s) as potentially teaching all or part of the claimed invention. It is noted, REFERENCES ARE RELEVANT AS PRIOR ART FOR ALL THEY CONTAIN. “The use of patents as references is not limited to what the patentees describe as their own inventions or to the problems with which they are concerned. They are part of the literature of the art, relevant for all they contain.” In re Heck, 699 F.2d 1331, 1332-33, 216 USPQ 1038, 1039 (Fed. Cir. 1983) (quoting In re Lemelson, 397 F.2d 1006, 1009, 158 USPQ 275, 277 (CCPA 1968)). A reference may be relied upon for all that it would have reasonably suggested to one having ordinary skill in the art, including non-preferred embodiments (see MPEP 2123). The Examiner has cited particular locations in the reference(s) as applied to the claim(s) above for the convenience of the Applicant. Although the specified citations are representative of the teachings of the art and are applied to the specific limitations within the individual claim(s), typically other passages and figures will apply as well.
Additionally, any claim amendments for any reason should include remarks indicating clear support in the originally filed specification.
Response to Arguments
Regarding the prior art rejections, Applicant's arguments with respect to the claims have been considered but are moot because the arguments do not apply to the references being used in the current rejection of the limitations. Additionally, the previous 112(b) rejection was withdrawn due to amendments.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ROBERT H BEJCEK II whose telephone number is (571)270-3610. The examiner can normally be reached Monday - Friday: 9:00am - 5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michelle T. Bechtold can be reached at (571) 431-0762. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/R.B./ Examiner, Art Unit 2148
/MICHELLE T BECHTOLD/ Supervisory Patent Examiner, Art Unit 2148