Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-24 are presented for examination.
Claim Objections
Regarding Claim 19, recites the limitation "the respective portion of the uncompiled code that is to bed executed by the computing device." This appears to be a typographical error — "bed" should read "be." As written, the claim language is nonsensical and renders the claim indefinite. For purposes of examination, the Examiner will interpret this as "to be executed by the computing device."
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
The following is a quotation of pre-AIA 35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art. The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is invoked.
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph:
(A) the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function;
(B) the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and
(C) the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function.
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function.
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function.
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. Such claim limitation(s) is/are: “a compiler configured to…” in claim 19, “the compiler is configured to” in claims 22-23.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, applicant may: (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 9, 11, 16, 20, and 24 rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
Regarding Claim 9, the claim recites "wherein the one or more operations are high level operations for training a machine-learning model." Claim 9 depends from Claim 7. However, Claim 7 recites "wherein a workload is defined for high level operations included in the uncompiled code" and does not introduce "one or more operations." The phrase "one or more operations" is first introduced in Claim 8 ("identifying one or more operations in the respective portion of uncompiled code"). Because Claim 9 depends from Claim 7 rather than Claim 8, the recitation of "the one or more operations" in Claim 9 lacks proper antecedent basis. For purposes of examination, the Examiner will interpret Claim 9 as if it depended from Claim 8, or alternatively as if it recited "wherein the high level operations are for training a machine-learning model."
Regarding Claim 11, the claim recites "the workload for the operation is updated dynamically based on the monitoring of the execution of operation on the processing device." The second instance of "operation" is missing the article "the." It is unclear whether this refers to the same "operation" previously recited or a different operation. For purposes of examination, the Examiner will interpret this as "the execution of the operation on the processing device."
Regarding Claim 16, the claim recites "wherein the workload is defined by at least one of: (a) floating-point operations per second (FLOP) utilization; (b) memory bandwidth utilization; or (c) any combination of (a) and (b)." Claim 16 depends from Claim 12, which depends from Claim 1. Claim 1 introduces "a respective workload for each of the plurality of processing devices" — i.e., multiple respective workloads (one per device). Claim 16 refers to "the workload" in the singular without specifying which of the multiple respective workloads is being referenced. It is unclear whether "the workload" refers to any one of the respective workloads, all of them, or a different workload entirely. For purposes of examination, the Examiner will interpret "the workload" as referring to any one of the respective workloads recited in Claim 1.
Regarding Claim 20, the claim recites "the compiler sets the voltage setting and power setting in the power state instruction according to a sampled workload of the plurality of computing devices." Claim 20 depends from Claim 19. Claim 19 introduces "a voltage setting and a frequency setting" but does not introduce a "power setting." The term "power setting" lacks antecedent basis in Claim 19 or any parent claim. It is unclear whether "power setting" is intended to refer to the previously recited "frequency setting," or whether it refers to a new, distinct setting not previously recited. For purposes of examination, the Examiner will interpret "power setting" as "frequency setting."
Regarding Claim 24, the preamble recites "A processing device of a distributed computing system comprising." Later in the body, the claim recites "a network interface configured to electronically communicate with a centralized computing device configured to operate a compiler and manage a distributed computing system." The use of the indefinite article "a" the second time suggests a new, different distributed computing system rather than referencing the one already introduced in the preamble. It is unclear whether the centralized computing device manages the same distributed computing system recited in the preamble or a different one. For purposes of examination, the Examiner will interpret the second recitation as "the distributed computing system."
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1-3, 6-8, 10-17, 19-24 is/are rejected under 35 U.S.C. 103 as being unpatentable over Paul (US 2023/0273832 A1") in view of Mohammad (US 2021/0124404 A1).
Regarding Claim 1, Paul (US 2023/0273832 A1) teaches
A computer-implemented method, comprising:
receiving uncompiled code for a computer program (Para [0016], "a compiler or performance simulator, which may be external to the VPU 100, may compile and analyze the neural network, layer-by-layer, to determine which layers are compute-intensive and which are memory-intensive") Examiner Comments: Paul teaches a compiler that receives a neural network model description (i.e., uncompiled code) and compiles and analyzes it for execution on the VPU accelerator, which corresponds to receiving uncompiled code for a computer program;
compiling the uncompiled code for the computer program to generate compiled code for the computer program (Para [0016], "a compiler or performance simulator ... may compile and analyze the neural network, layer-by-layer"; FIG. 3, Operation 300; Para [0020] "the layers of the neural network may be compiled by a compiler connected to, communicatively coupled to, or the like, the VPU of an ML accelerator") Examiner Comments: Paul teaches compiling the neural network description into a VPU binary image (compiled code) for execution on the accelerator;
retrieving a respective workload for each of the plurality of processing devices defining an analysis of processor utilization for the respective portion of the uncompiled code that is to be executed by the processing device (Para [0016], "The determination of whether the phases or layers of the neural network are compute-intensive or memory-intensive may be based on whether the performance of a layer is limited, bound, constrained, or the like, by the compute bandwidth or the memory bandwidth of the VPU IP. For example, the determination of whether the layers or phases are compute-intensive or memory-intensive may be based on the power consumption of the VPU 100, the DPU 110, or the CMX 112 as the layers of the neural network are executed"; see also FIG. 2 showing FLOPS profile 204 and memory profiles 200, 202) Examiner Comments: Paul teaches analyzing the computational and memory utilization characteristics (i.e., processor utilization) of each layer/portion of the uncompiled code to determine the workload, with the FLOPS profile and memory profiles constituting an analysis of processor utilization;
injecting a power state instruction into a respective portion of the compiled code corresponding the respective portion of the uncompiled code, that is to be executed by each of the plurality of processing device, wherein the power state instruction identifies a voltage setting and a frequency setting that the processing device is instructed to apply when executing the respective portion of the compiled code of the program code (Para [0011], "hooks inserted by the compilation tool to create a command image for the underlying accelerator may exploit the workload phases for better energy efficiency, enabling power-aware compilation"; Para [0028], " Operation 308 may include executing the neural network and dynamically adjusting the power to each layer. In an example, for run-time dynamic power-sharing between the compute and memory units or layers, the local power-management unit on the VPU (e.g., PMU 104 in FIG. 1 ) may adjust the voltage and frequency ratio at the compute and memory layers [V/F]compute and [V/F]CMX at runtime based at least in part on the determination of Fcompute and FCMX derived during compilation of the neural network. The determination of Fcompute and FCMX may be passed to the local PMU at runtime through the use of a special power-management instruction from the compiler.") Examiner Comments: Paul explicitly teaches that the compiler inserts a special power-management instruction (i.e., power state instruction) into the compiled code that sets voltage and frequency for the VPU accelerator at runtime based on the workload analysis of each layer.
Paul did not specifically teach
the compiled code that is targeted for distributed execution by a plurality of processing devices in a distributed computing system, wherein the compiling includes identifying, for each processing device of the plurality of processing devices, a respective portion of the uncompiled code that is to be executed by the processing device.
However, Mohammad (US 2021/0124404 A1) teaches
compiled code that is targeted for distributed execution by a plurality of processing devices in a distributed computing system (Abstract, "A scheme to improve performance of power-constrained computers, comprising a heterogeneous mix of compute elements, by dynamically reacting to changes in the switching capacitance that present workload induces in each heterogeneous compute element"; Para [0010], "An example of a heterogeneous mix of compute elements is a server blade comprising a mix of CPUs and GPUs. Two other examples include a server rack comprising a mix of CPU and GPU blades and an individual SoC (system-on-chip) comprising a mix of CPU cores, GPU cores, or accelerator cores") Examiner Comments: Mohammad teaches a distributed computing system comprising multiple heterogeneous processing devices (CPUs, GPUs, accelerator cores) in server blades and server racks, where workloads are distributed across the processing devices for execution;
wherein the compiling includes identifying, for each processing device of the plurality of processing devices, a respective portion of the uncompiled code that is to be executed by the processing device (Para [0014], "This MIMO control system monitors power consumption of each compute element under its control and maximizes each element's (e.g., a CPU) frequency subject to a constraint on the overall power consumed across all compute elements under control. To compute the maximum frequency, the embodiments dynamically learn the coefficients of a power-frequency model for a given workload and select the maximum frequency that stays within an overall input power limit") Examiner Comments: Mohammad's MIMO control system independently identifies and manages the workload assigned to each of the plurality of heterogeneous compute elements, where each compute element executes a respective portion of the overall workload, teaching identification of code portions for each processing device.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Paul's compiler-based power management technique with Mohammad's distributed workload-aware power limiting across multiple heterogeneous compute elements in order to extend the energy efficiency benefits of compiler-based per-layer DVFS to distributed computing environments comprising multiple heterogeneous processing devices, thereby improving overall system performance under power constraints (Mohammad, Abstract). Both Paul and Mohammad are directed to workload-aware power management for processing devices, and applying Paul's per-layer voltage/frequency adjustments to each device in Mohammad's distributed system would yield the predictable result of improved energy efficiency across the distributed computing system.
Regarding Claim 2, Paul and Mohammad teach
The method of Claim 1.
Paul further teaches
wherein a first processing device and a second processing device of the plurality of processing devices are targeted to execute different portions of the compiled code with a same voltage setting and a same frequency setting according to the power state instruction (Para [0016], “a compiler or performance simulator, which may be external to the VPU 100, may compile and analyze the neural network, layer-by-layer, to determine which layers are compute-intensive and which are memory-intensive. The determination of whether the phases or layers of the neural network are compute-intensive or memory-intensive may be based on whether the performance of a layer is limited, bound, constrained, or the like, by the compute bandwidth or the memory bandwidth of the VPU IP. For example, the determination of whether the layers or phases are compute-intensive or memory-intensive may be based on the power consumption of the VPU 100, the DPU 110, or the CMX 112 as the layers of the neural network are executed. In an example, the compute bandwidth may be limited by the number of compute units, the frequency of operation of the compute units, and the utilization of the compute units. The memory bandwidth may be limited by the bandwidth between the CMX 112 and the compute units and/or the bandwidth from memory external to the VPU 100”) Examiner Comments: When two processing devices execute code portions with similar workload characteristics (e.g., both compute-intensive layers), Paul's system applies the same voltage and frequency settings to both devices.
Regarding Claim 3, Paul and Mohammad teach
The method of Claim 1.
Paul and Mohammad further teach
wherein a first processing device and a second processing device of the plurality of processing devices are targeted to execute different portions of the compiled code with different voltage and frequency settings according to the power state instruction (Paul, Para [0016], different layers have different compute vs. memory intensity, resulting in different voltage and frequency settings; Mohammad, Para [0023], " This MIMO control system monitors power consumption of each compute element under its control and maximizes each element's (e.g., a CPU) frequency subject to a constraint on the overall power consumed across all compute elements under control. To compute the maximum frequency, the embodiments dynamically learn the coefficients of a power-frequency model for a given workload and select the maximum frequency that stays within an overall input power limit”) Examiner Comments: The combination teaches that different processing devices executing portions with different workload characteristics (e.g., one compute-intensive, one memory-intensive) receive different voltage/frequency settings.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Paul's compiler-based power management technique with Mohammad's distributed workload-aware power limiting across multiple heterogeneous compute elements in order to extend the energy efficiency benefits of compiler-based per-layer DVFS to distributed computing environments comprising multiple heterogeneous processing devices, thereby improving overall system performance under power constraints (Mohammad, Abstract). Both Paul and Mohammad are directed to workload-aware power management for processing devices, and applying Paul's per-layer voltage/frequency adjustments to each device in Mohammad's distributed system would yield the predictable result of improved energy efficiency across the distributed computing system.
Regarding Claim 6, Paul and Mohammad teach
The method of Claim 1.
Paul further teach
wherein a plurality of power state instructions are injected into the compiled code for execution by the plurality of processing devices in parallel (Paul, Para [0011], "hooks inserted by the compilation tool to create a command image for the underlying accelerator may exploit the workload phases") Examiner Comments: Paul's multiple per-layer power-management instructions combined with Mohammad's parallel execution across multiple devices teaches injecting a plurality of power state instructions for parallel execution.
Regarding Claim 7, Paul and Mohammad teach
The method of Claim 1.
Paul further teaches
wherein a workload is defined for high level operations included in the uncompiled code (Para [0016], "compile and analyze the neural network, layer-by-layer, to determine which layers are compute-intensive and which are memory-intensive") Examiner Comments: Paul's layer-by-layer analysis defines workloads at the operation/layer level of the neural network, where layers constitute the high level operations of the uncompiled code.
Regarding Claim 8, Paul and Mohammad teach
The method of Claim 7.
Paul further teaches
wherein retrieving the respective workload for each of the plurality of processing devices includes calculating the respective workload for a processing device by identifying one or more operations in the respective portion of uncompiled code set for execution by the processing device and retrieving a workload associated with each of the identified one or more operations (Para [0016], "The determination of whether the phases or layers of the neural network are compute-intensive or memory-intensive may be based on whether the performance of a layer is limited, bound, constrained, or the like, by the compute bandwidth or the memory bandwidth of the VPU IP") Examiner Comments: Paul teaches identifying individual layers (operations) in the neural network code and determining the associated workload (compute vs. memory bound characteristics) for each identified operation.
Regarding Claim 10, Paul and Mohammad teach
The method of Claim 7.
Paul and Mohammad further teach
monitoring an execution of an operation on a processing device of the plurality of processing devices (Paul, Para [0014], "Compiler-based static prediction may be combined with dynamic telemetry obtained from performance counters and temperature sensors or voltage droop circuit sensors. This may allow for more effective proactive prediction of the power sharing across existing architecture partitions"; Mohammad, Para [0014], "This MIMO control system monitors power consumption of each compute element under its control") Examiner Comments: Both Paul and Mohammad teach monitoring the execution of operations on processing devices using performance counters and telemetry;
determining the workload for the operation based on the monitoring of the execution of the operation (Mohammad, Abstract, "dynamically reacting to changes in the switching capacitance that present workload induces in each heterogeneous compute element and learning the coefficients of a power-frequency model for each compute element for the present workload") Examiner Comments: Mohammad teaches determining workload characteristics from monitored execution data for each compute element.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Paul's compiler-based power management technique with Mohammad's distributed workload-aware power limiting across multiple heterogeneous compute elements in order to extend the energy efficiency benefits of compiler-based per-layer DVFS to distributed computing environments comprising multiple heterogeneous processing devices, thereby improving overall system performance under power constraints (Mohammad, Abstract). Both Paul and Mohammad are directed to workload-aware power management for processing devices, and applying Paul's per-layer voltage/frequency adjustments to each device in Mohammad's distributed system would yield the predictable result of improved energy efficiency across the distributed computing system.
Regarding Claim 11, Paul and Mohammad teach
The method of Claim 10.
Mohammad further teaches
wherein the workload for the operation is updated dynamically based on the monitoring of the execution of operation on the processing device (Mohammad, Abstract, "The scheme rapidly re-learns coefficients of the power model and rapidly adapts the frequency as the workload's characteristics shift ensuring that compute elements run at the maximum frequency they can while not exceeding the input power limit") Examiner Comments: Mohammad explicitly teaches dynamically updating the workload model as execution characteristics change, enabling adaptive frequency control.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Paul's compiler-based power management technique with Mohammad's distributed workload-aware power limiting across multiple heterogeneous compute elements in order to extend the energy efficiency benefits of compiler-based per-layer DVFS to distributed computing environments comprising multiple heterogeneous processing devices, thereby improving overall system performance under power constraints (Mohammad, Abstract). Both Paul and Mohammad are directed to workload-aware power management for processing devices, and applying Paul's per-layer voltage/frequency adjustments to each device in Mohammad's distributed system would yield the predictable result of improved energy efficiency across the distributed computing system.
Regarding Claim 12, Paul and Mohammad teach
The method of Claim 1.
Paul and Mohammad further teach
wherein injecting the power state instruction into the compiled code is further based on a cost model (Paul, Para [0024], “Operation 306 may include determining the optimal compute and memory power for each layer to send to the VPU. The determination may include a voltage and/or frequency to send to the DPU 110 and/or the CMX 112 discussed in FIG. 1 , above. Returning to FIG. 3 , once the compute and memory intensive layers have been identified in Operations 302 and 304, a determination may be made of the optimal power level or frequency for the DPU 110, which may lead to the lowest energy within a performance degradation threshold for execution of the neural network (e.g., five percent)”; Mohammad, Abstract, "learning the coefficients of a power-frequency model for each compute element for the present workload") Examiner Comments: Both Paul's optimization model for determining optimal power levels and Mohammad's learned power-frequency model constitute cost models used to determine power state settings.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Paul's compiler-based power management technique with Mohammad's distributed workload-aware power limiting across multiple heterogeneous compute elements in order to extend the energy efficiency benefits of compiler-based per-layer DVFS to distributed computing environments comprising multiple heterogeneous processing devices, thereby improving overall system performance under power constraints (Mohammad, Abstract). Both Paul and Mohammad are directed to workload-aware power management for processing devices, and applying Paul's per-layer voltage/frequency adjustments to each device in Mohammad's distributed system would yield the predictable result of improved energy efficiency across the distributed computing system.
Regarding Claim 13, Paul and Mohammad teach
The method of Claim 12.
Paul further teach
predict locations in the compiled code to inject power state instructions (Paul, Para [0016]: “a compiler-based proactive prediction of compute and memory bound, or intensive, phases or layers of neural network may be employed by the VPU. In such an example, a compiler or performance simulator, which may be external to the VPU 100, may compile and analyze the neural network, layer-by-layer, to determine which layers are compute-intensive and which are memory-intensive. The determination of whether the phases or layers of the neural network are compute-intensive or memory-intensive may be based on whether the performance of a layer is limited, bound, constrained, or the like, by the compute bandwidth or the memory bandwidth of the VPU IP. For example, the determination of whether the layers or phases are compute-intensive or memory-intensive may be based on the power consumption of the VPU 100, the DPU 110, or the CMX 112 as the layers of the neural network are executed. In an example, the compute bandwidth may be limited by the number of compute units, the frequency of operation of the compute units, and the utilization of the compute units. The memory bandwidth may be limited by the bandwidth between the CMX 112 and the compute units and/or the bandwidth from memory external to the VPU 100”) Examiner Comments: Paul's layer by layer analysis predicts where in the compiled code to inject power state instructions;
predict a voltage setting and a frequency setting for each of the power state instructions (Paul, Para [0017]: “Since architectural parameters such as the number and frequency of compute units and bus and memory bandwidth are defined during the design of the VPU, for any deep neural network or ML workload, it may be possible to fully estimate the execution latency and energy use or requirement of the layers of the neural network (e.g., each layer in the neural network, one or more respective layers of the neural network, or the like) and determine whether the performance of the layer (e.g. a particular layer) is limited by compute or memory performance. From the estimation, fractions of the total power budget available to the VPU 100 may be allocated to the compute and memory units. This fraction may be represented by a ratio of Fcompute and FCMX which may be passed to the hardware of the VPU 100 at runtime”) Examiner Comments: Both references teach predicting the voltage/frequency settings for the power state instructions based on the cost model.
Regarding Claim 14, Paul and Mohammad teach
The method of Claim 13.
Paul and Mohammad further teach
wherein the cost model is trained using monitored device traces of a processing device executing high level operations (Paul, Para [0014], "dynamic telemetry obtained from performance counters and temperature sensors"; Mohammad, Abstract, "learning the coefficients of a power-frequency model for each compute element for the present workload") Examiner Comments: Mohammad's power-frequency model is learned/trained using monitored device execution data (performance counters and telemetry), which constitutes device traces from processing devices executing operations.
Regarding Claim 15, Paul and Mohammad teach
The method of Claim 12.
Mohammad further teaches
wherein predictions made by the cost model are further based at least in part on utility costs (Mohammad, Abstract, " the scheme forecasts a maximum frequency that the compute element can run at without exceeding an input power limit for a given workload. The scheme rapidly re-learns coefficients of the power model and rapidly adapts the frequency as the workload's characteristics shift ensuring that compute elements run at the maximum frequency they can while not exceeding the input power limit ") Examiner Comments: Mohammad teaches that the cost model operates subject to external power constraints (input power limits). The input power limit is an external constraint that can incorporate utility/energy cost considerations.
Regarding Claim 16, Paul and Mohammad teach
The method of Claim 12.
Paul further teaches
wherein the workload is defined by at least one of: (a) floating-point operations per second (FLOP) utilization; (b) memory bandwidth utilization; or (c) any combination of (a) and (b) (Paul, Para [0016], "whether the performance of a layer is limited, bound, constrained, or the like, by the compute bandwidth or the memory bandwidth of the VPU IP"; see also FIG. 2 showing FLOPS profile 204) Examiner Comments: Paul explicitly shows FLOPS utilization profiles (FLOP utilization) and memory bandwidth utilization as workload metrics in the analysis, directly teaching that the workload is defined by FLOP utilization, memory bandwidth utilization, or a combination thereof.
Regarding Claim 17, Paul and Mohammad teach
The method of Claim 1.
Paul further teaches
wherein the computer program is for training a machine-learning model (Abstract, "A system for autonomous and proactive power management for energy efficient execution of machine learning workloads") Examiner Comments: Paul explicitly discloses execution of machine learning workloads, which encompasses training of machine-learning models.
Regarding Claim 19, is a system claim corresponding to the method claim above (Claim 1) and, therefore, is rejected for the same reasons set forth in the rejection of claim 1.
Regarding Claim 20, Paul and Mohammad teach
The system of Claim 19.
Mohammad further teach
wherein at least some of the compiled code is targeted to execute synchronously across the plurality of computing devices, and the compiler sets the voltage setting and power setting in the power state instruction according to a sampled workload of the plurality of computing devices (Mohammad, Para [0023], "This MIMO control system monitors power consumption of each compute element under its control and maximizes each element's ... frequency subject to a constraint on the overall power consumed across all compute elements under control") Examiner Comments: Mohammad's MIMO system monitors workload across all compute elements and sets frequency/voltage subject to a global constraint, which teaches setting power state instructions according to a sampled workload across synchronously-executing devices.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Paul's compiler-based power management technique with Mohammad's distributed workload-aware power limiting across multiple heterogeneous compute elements in order to extend the energy efficiency benefits of compiler-based per-layer DVFS to distributed computing environments comprising multiple heterogeneous processing devices, thereby improving overall system performance under power constraints (Mohammad, Abstract). Both Paul and Mohammad are directed to workload-aware power management for processing devices, and applying Paul's per-layer voltage/frequency adjustments to each device in Mohammad's distributed system would yield the predictable result of improved energy efficiency across the distributed computing system.
Regarding Claim 21, Paul and Mohammad teach
The system of Claim 19.
Paul and Mohammad further teach
wherein the compiler coordinates injection of a plurality of power state instructions into the compiled code for execution across the plurality of computing devices to improve performance and reduce energy usage (Paul, Abstract, "autonomous and proactive power management for energy efficient execution"; Mohammad, Abstract, "improve performance of power-constrained computers") Examiner Comments: Coordinating power state instructions across distributed devices to improve performance and reduce energy is the predictable result of combining Paul's compiler-based DVFS with Mohammad's distributed MIMO power management.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Paul's compiler-based power management technique with Mohammad's distributed workload-aware power limiting across multiple heterogeneous compute elements in order to extend the energy efficiency benefits of compiler-based per-layer DVFS to distributed computing environments comprising multiple heterogeneous processing devices, thereby improving overall system performance under power constraints (Mohammad, Abstract). Both Paul and Mohammad are directed to workload-aware power management for processing devices, and applying Paul's per-layer voltage/frequency adjustments to each device in Mohammad's distributed system would yield the predictable result of improved energy efficiency across the distributed computing system.
Regarding Claim 22, Paul and Mohammad teach
The system of Claim 19.
Paul further teaches
wherein the compiler is configured to inject power state instructions that increase a voltage and a frequency of one or more computing devices of the plurality of computing devices to improve performance (Para [0016], “a compiler-based proactive prediction of compute and memory bound, or intensive, phases or layers of neural network may be employed by the VPU. In such an example, a compiler or performance simulator, which may be external to the VPU 100, may compile and analyze the neural network, layer-by-layer, to determine which layers are compute-intensive and which are memory-intensive. The determination of whether the phases or layers of the neural network are compute-intensive or memory-intensive may be based on whether the performance of a layer is limited, bound, constrained, or the like, by the compute bandwidth or the memory bandwidth of the VPU IP. For example, the determination of whether the layers or phases are compute-intensive or memory-intensive may be based on the power consumption of the VPU 100, the DPU 110, or the CMX 112 as the layers of the neural network are executed. In an example, the compute bandwidth may be limited by the number of compute units, the frequency of operation of the compute units, and the utilization of the compute units. The memory bandwidth may be limited by the bandwidth between the CMX 112 and the compute units and/or the bandwidth from memory external to the VPU 100”) Examiner Comments: Paul teaches that for compute-bound operations, the system increases voltage and frequency to the DPU to improve execution performance.
Regarding Claim 23, Paul and Mohammad teach
The system of Claim 19.
Paul further teaches
wherein the compiler is configured to inject power state instructions that lower a voltage and a frequency of one or more computing devices of the plurality of computing devices to reduce power consumption (Para [0024], “This total energy may be considered a baseline total energy. Then, during a loop through all of the layers, for compute intensive layers, the DPU frequency may be reduced along the DPU’s V-F curve, while calculating power and energy repeatedly until the performance drop is close to the performance degradation threshold. For memory intensive layers, the DPU frequency may be reduced repeatedly until the DPU is able to complete the compute cycles before data transfer is completed in cache or memory”) Examiner Comments: Paul teaches that for memory-bound operations where the compute units are not fully utilized, the system reduces voltage and frequency to save power.
Regarding Claim 24, Paul (US 2023/0273832 A1) teaches
A processing device of a distributed computing system comprising:
at least one processor (Abstract, "an apparatus such as system-on-chip (SoC) comprising an accelerator configurable to load and execute a neural network")
execute, with the at least one processor, the compiled code (Para [0042], "the accelerator executes the neural network") Examiner Comments: Paul teaches execution of compiled neural network code on the accelerator;
set a voltage and frequency of the at least one processor when a power state instruction is executed, wherein the power state instruction is injected by the compiler at the centralized computing device based on a workload assigned to a portion of uncompiled code for a respective portion of the compiled code, the workload defining an analysis of processor utilization for the portion of the uncompiled code that is to be executed by the processing device (Para [0028], "The determination of Fcompute and FCMX may be passed to the local PMU at runtime through the use of a special power-management instruction from the compiler"; Para [0008], " In the following examples, using a local power management unit (PMU) on the VPU that can dynamically change the voltage and/or the frequency of a signal to the VPU, while still operating within a fixed power budget from a central processing unit (CPU) level power-management unit can provide more autonomous power management and lower latency when executing the layers of the neural network") Examiner Comments: Paul teaches the compiler inserts a special power-management instruction that directs the local PMU to set voltage and frequency based on the compiler's workload analysis of processor utilization.
Paul did not specifically teach
a network interface configured to electronically communicate with a centralized computing device configured to operate a compiler and manage a distributed computing system;
wherein the processing device is configured to receive compiled code for a computer program from the compiler of the distributed computing system.
However, Mohammad (US 2021/0124404 A1) teaches
a network interface configured to electronically communicate with a centralized computing device configured to operate a compiler and manage a distributed computing system; receiving compiled code from the compiler of the distributed computing system (Para [0010], "a server rack comprising a mix of CPU and GPU blades" — where the server blades are networked together and managed by a centralized control system) Examiner Comments: Mohammad's server rack architecture includes networked processing devices that communicate with and receive workloads from a centralized management and compilation system.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Paul's teaching with Mohammad's in order to extend the compiler-based power management to networked processing devices in a distributed computing system, thereby achieving system-wide energy efficiency and performance optimization across multiple heterogeneous compute elements (Mohammad, Abstract).
Claim(s) 4, 9 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Paul (US 2023/0273832 A1") in view of Mohammad (US 2021/0124404 A1), further in view of Jasoliya (US 10699369 B2).
Regarding Claim 4, Paul and Mohammad teach
The method of Claim 1.
Paul and Mohammad did not specifically teach
wherein the uncompiled code defines high level operations for a computational graph defining a machine-learning model.
However, Jasoliya (US10699369B2) teaches
wherein the uncompiled code defines high level operations for a computational graph defining a machine-learning model (Col 7: ln 30-40, “graphics processor 300 includes a block image transfer (BLIT) engine 304 to perform two-dimensional (2D) rasterizer operations including, for example, bit-boundary block transfers. However, in one embodiment, 2D graphics operations are performed using one or more components of graphics processing engine (GPE) 310”) Examiner Comments: Jasoliya teaches DVFS for ML workloads with computational graphs.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Paul and Mohammad's teaching with Jasoliya’s in order to apply the combination to ML code, as Jasoliya teaches DVFS optimizes energy for machine-learning computations by reducing energy in ML training on distributed systems, as ML workloads are computation-intensive.
Regarding Claim 9, Paul and Mohammad teach
The method of Claim 7.
Paul and Mohammad did not specifically teach
wherein the one or more operations are high level operations for training a machine-learning model.
However, Jasoliya (US10699369B2) teaches
wherein the one or more operations are high level operations for training a machine-learning model (Col 11: ln 60-67, “In one embodiment the additional fixed function logic 516 can also include machine-learning acceleration logic, such as fixed function matrix multiplication logic, for implementations including optimizations for machine learning training or inferencing”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Paul and Mohammad's teaching with Jasoliya’s in order to apply the combination to ML code, as Jasoliya teaches DVFS optimizes energy for machine-learning computations by reducing energy in ML training on distributed systems, as ML workloads are computation-intensive.
Regarding Claim 18, Paul and Mohammad teach
The method of Claim 17.
Paul and Mohammad did not teach
wherein the machine-learning model is a large language model.
However, Jasoliya (US10699369B2) teaches
wherein the machine-learning model is a large language model (Col 11: ln 60-67, “In one embodiment the additional fixed function logic 516 can also include machine-learning acceleration logic, such as fixed function matrix multiplication logic, for implementations including optimizations for machine learning training or inferencing”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Paul and Mohammad's teaching with Jasoliya’s in order to apply the combination to ML code, as Jasoliya teaches DVFS optimizes energy for machine-learning computations by reducing energy in ML training on distributed systems, as ML workloads are computation-intensive.
Claim(s) 5 is/are rejected under 35 U.S.C. 103 as being unpatentable over Paul (US 2023/0273832 A1") in view of Mohammad (US 2021/0124404 A1), further in view of Zhang (US20230334292A1).
Regarding Claim 5, Paul, and Mohammad teach
The method of Claim 1.
Paul, and Mohammad did not specifically teach
wherein the compiling is performed by an accelerated linear algebra (XLA) compiler.
However, Zhang (US20230334292A1) teaches
wherein the compiling is performed by an accelerated linear algebra (XLA) compiler (Para [0089], The XLA compiler is a linear algebra compiler for a specific field, and can speed up operating of a TensorFlow model, possibly without changing source code).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have combined Paul and Mohammad’s teaching to Zhang’s for calculating graph involving fusing multiple noes of each parallel branch group in parallel branch groups to obtain second calculated graph based on first calculated graph by converting a first neural network into a first computational graph, and extracting parallel branch (Zhang [Summary]).
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to AMIR SOLTANZADEH whose telephone number is (571)272-3451. The examiner can normally be reached M-F, 9am - 5pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Wei Mui can be reached at (571) 272-3708. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/AMIR SOLTANZADEH/Examiner, Art Unit 2191 /WEI Y MUI/Supervisory Patent Examiner, Art Unit 2191