Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statements (IDS) submitted on 07/21/2023. The submission is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statements are being considered by the examiner.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-2, 6-8,11-12, 16-18 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Bogdan (US 20210049465 A1) in view of Brothers (US 20160350645 A1) and Carter (US 8250395 B2).
Regarding claim 1, Bogdan teaches:
A method for controlling a processing device to execute an application that employs a neural network (NN), the processing device including a plurality of processing units arranged in a network-on-chip (NoC) to which the NN is mapped, the method comprising: (Claim 1. A method executed by a self-optimizing and self-programming computing system, the method comprising:)
obtaining compiler information, the compiler information including computing loads of the application on the processing units,… . ([0072] With reference to FIG. 1, a method executed by a self-optimizing and self-programming computing system is illustrated. The method includes a step of transforming a target application at compile time from low-level virtual machine intermediate representation instructions into an instruction dependency graph with one or more trained neural network classifiers. In this regard, the instruction dependency graph includes nodes that denote low-level virtual machine intermediate representation instructions, edges that denote data dependencies, and edge weights that denote an amount of data to be transferred between two dependent instructions. The one or more trained neural network classifiers are applied to identify first partitions in the instruction dependency graph having predefined programming features.[0078] The self-optimizing and self-programming computing system is further operable to profiles the instructions to determine data size as edge weights in the instruction dependency graph. It should be appreciated that this method for constructing the instruction dependency graphs combines static and dynamic program analysis as set forth below in more detail.; see also [0091-0110])
and enabling the processing units to perform their respective tasks of the application within their respective adjusted computing time. ([0072] Finally, at runtime, heterogeneous tasks are mapped to hardware by distributed intelligent schedulers to fully utilize hardware components in a heterogeneous hardware platform such that performance is maximized. It should also be appreciated that these tasks can be interdependent and can have different processing element affinities. Characteristically, the distributed intelligent schedulers mapping tasks onto either general CPUs, graphics processing units, or domain-specific programming elements. An exemplary heterogeneous hardware platform is a network on a chip. See also [0090, 0128-0132])
Bogdan does not appear to explicitly teach: the computing loads relating a dataflow type of the NN.
However, Brothers teaches: In one embodiment,[0038] the NN engine may perform the necessary processing in any of a variety of different orders while maintaining all needed data in the internal memory or buffering. For example, the NN engine may generate portions of each output feature map for a tile in layer N+1 by reading and convolving all of the input feature maps defined by the corresponding tile of layer N and adding the results. After generating the corresponding tile of layer N+1, the data for the tile of layer N used to generate the tile of layer N+1 is no longer needed. Accordingly, the NN engine may recycle, delete, free, or overwrite the memory used to store the tile of layer N in order to store results (e.g., the corresponding tile) for layer N+2, and so on. The NN engine may continue to overwrite the intermediate data for a layer as newly generated intermediate data of next layers is generated as described. An example method is described in greater detail in connection with FIG. 5.
Accordingly, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention, to combine the self-optimizing, compiler driven workloads and task mapping of Bogdan with the neural-network and tiling techniques of Brothers in order perform processing of the tiles in any of the variety of different orders while maintaining all needed data in the internal memory or buffer, [0038]. This combination merely applies the knowns techniques for neural network traversal and processing to yield predictable results.
Bogdan does not appear to explicitly teach: determining a scaling factor for computing time of each of the processing units based on the computing loads; adjusting the computing time of the processing units based on the scaling factors;
However, Carter teaches: col 12, line3-14 The DVFS control system then uses one or more values (the number of values is a programmable selection) from the active core count circular buffer and the slack core count circular buffer to determine a utilization slack value of data processing system (step 420). The DVFS control system computes the utilization slack value based on the sum of slack core counts divided by the sum of active core counts. In addition, the DVFS control system computes a new utilization metric to be a difference between a full utilization value and the utilization slack value, such as a new utilization metric that is equal to 100 minus the utilization slack value (step 422). Col9 line 31-47 DVFS control system 302 then computes a new utilization metric to be a difference between a full utilization value and the utilization slack value, such as a new utilization metric that is equal to 100 minus the utilization slack value. DVFS control system 302 then compares the new utilization metric value to identify whether the new utilization metric value is above or below a predetermined utilization threshold. If the new utilization metric value is above the predetermined utilization threshold, then DVFS control system 302 identifies data processing system 300 as being over utilized and increases the frequency of data processing system 300 by one step size. If the new utilization metric value is below the predetermined utilization threshold, then DVFS control system 302 identifies data processing system 300 as being under utilized and decreases the frequency of data processing system 300 by one step size; see also col 7, line 49 – col. 9, line 65.
Accordingly, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention, to combine the self-optimizing, compiler driven workloads and task mapping of Bogdan with the neural-network and tiling techniques of Brothers and the dynamic voltage and frequency scaling of Carter in order to improve execution efficiency, resource utilization of workloads on a NoC platform. This combination merely applies the knowns techniques for neural network traversal and power performance scaling to a compiler based workload management system yielding predictable results.
Regarding claim 2, Bogdan when combined with Carter teaches:
The method of claim 1, wherein the scaling factor for the computing time of each of the processing units is determined at each synchronization stage of the NN based on the computing load on the processing unit and a critical computing load on one of the processing units at the synchronization stage.( [0079] In a variation, the distributed intelligent schedulers receive feedback from the heterogeneous hardware platform, the feedback including a next state and an immediate reward formulated in terms of types of tasks and availability of hardware components. As set forth below in more detail, the distributed intelligent schedulers observe a set of environment states S and executes a set of actions A such that actions represent mapping of tasks onto one type of programming elements in task pools and wherein a probability of selecting an action under a state is called policy π(s, a)=P(a.sub.t=a|s.sub.t, =s). Characteristically, the distributed intelligent schedulers give a negative reward to prevent an action of assigning a task to a busy processing element wherein if a target processing element is busy the task is migrated to a nearby processing element with a cost of network congestion and latency.)
Carter teaches: col 12, line3-14 The DVFS control system then uses one or more values (the number of values is a programmable selection) from the active core count circular buffer and the slack core count circular buffer to determine a utilization slack value of data processing system (step 420). The DVFS control system computes the utilization slack value based on the sum of slack core counts divided by the sum of active core counts. In addition, the DVFS control system computes a new utilization metric to be a difference between a full utilization value and the utilization slack value, such as a new utilization metric that is equal to 100 minus the utilization slack value (step 422). Col9 line 31-47 DVFS control system 302 then computes a new utilization metric to be a difference between a full utilization value and the utilization slack value, such as a new utilization metric that is equal to 100 minus the utilization slack value. DVFS control system 302 then compares the new utilization metric value to identify whether the new utilization metric value is above or below a predetermined utilization threshold. If the new utilization metric value is above the predetermined utilization threshold, then DVFS control system 302 identifies data processing system 300 as being over utilized and increases the frequency of data processing system 300 by one step size. If the new utilization metric value is below the predetermined utilization threshold, then DVFS control system 302 identifies data processing system 300 as being under utilized and decreases the frequency of data processing system 300 by one step size; see also col 7, line 49 – col. 9, line 65.
Regarding claim 6, Carter teaches:
The method of claim 1, wherein the computing time of the processing units is adjusted based on the scaling factors by employing dynamic voltage and frequency scaling (DVFS). (col 7, line 49-61 .FIG. 3 depicts an exemplary block diagram of a dynamic voltage and frequency scaling (DVFS) control system in accordance with an illustrative embodiment. Dynamic voltage and frequency scaling (DVFS) control system 302 may be implemented in data processing system 300 which comprises a plurality of processors 304 and may control the operational parameters of processors 304, such as frequency, voltage, or the like. Each of processors 304 may further comprise a plurality of processor cores 306. In order to compute a utilization associated with each of processor cores 306, each of processor cores 306 comprises core utilization sensor 308. With the illustrative embodiments, core utilization sensor 308 may be implemented using various methods.)
Same motivation as claim 1.
Regarding claim 7, Carter teaches:
The method of claim 6, wherein frequencies at which the processing units operate are adjusted based on the scaling factors. (col 9, line 35-55. DVFS control system 302 then compares the new utilization metric value to identify whether the new utilization metric value is above or below a predetermined utilization threshold. If the new utilization metric value is above the predetermined utilization threshold, then DVFS control system 302 identifies data processing system 300 as being over utilized and increases the frequency of data processing system 300 by one step size. If the new utilization metric value is below the predetermined utilization threshold, then DVFS control system 302 identifies data processing system 300 as being under utilized and decreases the frequency of data processing system 300 by one step size. If the new utilization metric value is equal to the predetermined utilization threshold, then DVFS control system 302 identifies data processing system 300 as being optimally utilized and performs no action on the frequency of data processing system 300. By lowering the frequency by just one step, all voltages of all chips in the system or control domain drop a little bit, spreading out the work so that all cores will remain without slack, and operating at a more energy efficient point.)
Same motivation as claim 1.
Regarding claim 8, Carter teaches:
The method of claim 6, wherein voltages applied to the processing units are adjusted based on the scaling factors. (col 7, line 49-61 .FIG. 3 depicts an exemplary block diagram of a dynamic voltage and frequency scaling (DVFS) control system in accordance with an illustrative embodiment. Dynamic voltage and frequency scaling (DVFS) control system 302 may be implemented in data processing system 300 which comprises a plurality of processors 304 and may control the operational parameters of processors 304, such as frequency, voltage, or the like. Each of processors 304 may further comprise a plurality of processor cores 306. In order to compute a utilization associated with each of processor cores 306, each of processor cores 306 comprises core utilization sensor 308. With the illustrative embodiments, core utilization sensor 308 may be implemented using various methods.)
Regarding claim 11, the claim recites similar limitation as corresponding claim 1 and is rejected for similar reasons as claim 1 using similar teachings and rationale.
Regarding claim 12, the claim recites similar limitation as corresponding claim 2 and is rejected for similar reasons as claim 2 using similar teachings and rationale.
Regarding claim 16, the claim recites similar limitation as corresponding claim 6 and is rejected for similar reasons as claim 6 using similar teachings and rationale.
Regarding claim 17, the claim recites similar limitation as corresponding claim 7 and is rejected for similar reasons as claim 7 using similar teachings and rationale.
Regarding claim 18, the claim recites similar limitation as corresponding claim 8 and is rejected for similar reasons as claim 8 using similar teachings and rationale.
Regarding claim 20, Bogdan teaches:
The apparatus of claim 11, wherein the processing units include deep learning accelerator (DLA) cores. ([0018] In another aspect, SOSPCS provides the following novel contributions: 1) to address programming flexibility and fully exploit the underlying heterogeneous PEs (CPUs, GPUs, and HWAs), SOSPCS automatically partitions target applications in one domain into interdependent tasks with different PE affinities. Some tasks such as FFT prefer executing on accelerators, some such as for-loops prefer executing on GPUs, whereas the rest (e.g., general-purpose code) execute on CPUs and 2) runtime mapping is based on distributed RL to self-optimize the optimal policy under the environment variations. [0053] “HWA” means hardware accelerator. [0074] In a variation, the predefined programming features represent specific types of programming methods (e.g., sequences of program instruction implementing a flow control statement) or specialized functions. Examples of a programming features include, but are not limited to loops for CPUs; loops for GPUs; neurons in a neural network, activation functions, matrix multiplications, vector multiplication, gradient descent, for-loops in a machine learning domain, fast Fourier transforms, matrix multiplication, and combinations thereof. In a refinement, fast Fourier transforms and matrix multiplication are mapped to hardware accelerators and “for-loops” are mapped to graphics processing units with remaining tasks are mapped to central processing units. Typically, the target application has at least two predefined programming features.)
Claims 3 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Bogdan (US 20210049465 A1) in view of Brothers (US 20160350645 A1), Carter (US 8250395 B2) and further view of Barijough “Exploiting Approximations in Real-Time Scheduling”.
Regarding claim 3, Bogdan does not appear to explicitly teach:
The method of claim 2, wherein the dataflow type is layer-by-layer tiling, the NN includes a plurality of layers each being partitioned into one or more tiles that correspond to the processing units, and the scaling factor for the computing time of each of the processing units is determined in a corresponding tile of a corresponding layer of the NN based on the computing load of the corresponding tile and a critical computing load of a critical tile of the corresponding layer.
However, Barijough teaches: On page 290, One well-known technique for MC scheduling on uniprocessors is EDF-VD (Earliest Deadline First with Virtual Deadlines) [6]. EDF-VD is similar to EDF [7] except that the implicit deadlines of all high-criticality tasks are scaled by a factor x ∈ (0, 1) in low-criticality mode. Also on page 318. Experiments and Results We evaluated the quality/latency-aware task graph mapping, scheduling, and budgeting approaches described in this chapter for distributed real-time execution of neural network inference applications in edge computing settings. We applied our approach to an image segmentation and two object detection neural networks from [25–27] implemented on top of the Darknet deep learning framework [26]. We distribute the convolutional neural networks by tiling, fusing, and executing layers in a map-reduce scheme similar to [28]. In our evaluation, we focus on communication budgeting by assuming fixed worst-case bounds for compute tasks. Budgeting such a distributed neural network task graph requires partitioning the total latency budget among each layer groups’ scatter and gather operations. Mapping of this graph entails assigning fused tile stacks of each layer group to one of the edge devices in the local network, where a master node coordinates all task distribution.
Accordingly, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention, to combine the self-optimizing, compiler driven workloads and task mapping of Bogdan with Barijough’s neural networks tiling and fusing technique with scaling factor. One of ordinary skill in the art would have been motivated to make this combination to improve execution and load balancing of compiler scheduled neural network workloads on a NoC, yielding predictable results consistent with known neural network techniques
Regarding claim 13, the claim recites similar limitation as corresponding claim 3 and is rejected for similar reasons as claim 3 using similar teachings and rationale.
Claims 4 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Bogdan (US 20210049465 A1) in view of Brothers (US 20160350645 A1), Carter (US 8250395 B2) and further view of Wang “Accelerating Deep Learning Inference with Cross-Layer Data Reuse on GPUs”.
Regarding claim 4, Bogdan does not appear to explicitly teach:
The method of claim 2, wherein the dataflow type is cross-layer tiling, the NN includes a plurality of layers each being partitioned into one or more tiles, each of the processing units processes corresponding fused partitioned tiles of two or more of the layers, and the scaling factor for the computing time of each of the processing units is determined in corresponding fused tiles at a corresponding synchronization stage of the NN based on the computing load of the corresponding fused tiles and a critical computing load of critical fused tiles at the corresponding synchronization stage. ([0072] In another embodiment, method 500 may be performed to process a first frustum of the neural network through a first plurality of adjacent layers. Method 500 may iterate to process each other frustum through the first plurality of adjacent layers. Method 500 may then be implemented again to process a first frustum of a second and different plurality of adjacent layers having a different partitioning than the first plurality of adjacent layers. Method 500 may be repeated to process the remaining frustums through the second plurality of adjacent layers with the different partitioning.)
However, Wang teaches: On page 2, e. We design a cross-layer data reuse optimization method, which inputs the compute graph of CNN layers and generates the source code for GPUs (Fig. 1). The fusion strategies include analyzing the input graph for fusion, tiling the data and parallelism on devices and optimizing the memory usage on multi-level memory hierarchy. On page 7, Considering data reuse across layers, the parallel model for the fused layers is restricted by the CNN layer parameters. The filter with large size, which is larger than 1, will cause redundant computing for data dependence. Figure 6 shows a tiling example for two fused neural network layers computation on 4 SMs. The SMs are parallel on GPUs, and shared memory of each block is isolate and private, which means that redundant data storage is necessary for on-chip data reuse. See also section 3.2 Tiling and Parallelism on page 7.
Accordingly, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention, to combine the self-optimizing, compiler driven workloads and task mapping of Bogdan with Wang’s cross layer tiling and fusing technique. . One of ordinary skill in the art would have been motivated to make this combination to improve efficiency and synchronization of neural network workloads on parallel processing, yielding predictable results consistent with GPU and NoC optimization techniques.
Regarding claim 14, the claim recites similar limitation as corresponding claim 4 and is rejected for similar reasons as claim 4 using similar teachings and rationale.
Claims 5 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Bogdan (US 20210049465 A1) in view of Brothers (US 20160350645 A1), Carter (US 8250395 B2) and further view of Harlap “Improving ML applications in shared computing environments”.
Regarding claim 5, Bogdan does not appear to explicitly teach:
The method of claim 2, wherein the dataflow type is layer pipeline tiling, the NN includes a plurality of layers each being partitioned into one or more tiles, the processing units, one after another at each synchronization stage, process corresponding tiles of corresponding layers sequentially, and the scaling factor for the computing time of each of the processing units is determined in a corresponding tile of a corresponding layer of the NN based on the computing load of the corresponding tile and a critical computing load of a critical tile of the corresponding layer.
However, Harlap teaches: on page 62, PipeDream uses pipeline parallelism (PP), a new parallelization strategy that combines intra-batch parallelism with inter-batch parallelism. Pipeline-parallel computation involves partitioning the layers of a DNN model into multiple stages, where each stage consists of a consecutive set of layers in the model. Each stage is mapped to a separate GPU that performs the forward pass (and backward pass) for all layers in that stage. On page 63, PipeDream treats model training as a computation pipeline, with each worker executing a subset of the model as a stage. Like with any pipeline, the steady state throughput of the resulting pipeline is the throughput of the slowest stage. On page 66, PipeDream’s optimizer solves dynamic programming problems progressively from the lowest to the highest level. Intuitively, this can be thought of as trying to find the optimal partitioning within a server, and then using these partitions to split a model optimally across servers.
Accordingly, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention, to combine the self-optimizing, compiler driven workloads and task mapping of Bogdan with Harlap’s layer-pipeline tiling and parallelism. One of ordinary skill in the art would have been motivated to make this combination to improve resource utilization and load balance for neural network workloads on different processing plataforms.
Regarding claim 15, the claim recites similar limitation as corresponding claim 5 and is rejected for similar reasons as claim 5 using similar teachings and rationale.
Claims 9, 10 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Bogdan (US 20210049465 A1) in view of Brothers (US 20160350645 A1), Carter (US 8250395 B2) and further view of Chen (US 11410027 B2)
Regarding claims 9, Bogdan teaches the elements of claim 1 as outlined above.
Bogdan does not appear to explicitly teach: calculating a sum of the computing loads on the processing units for each of the dataflow types; selecting one of the dataflow types based on the sums.
However, Chen teaches: col 10, line 7-29, Pipeline resource determiner 1222 then determines a pipeline number 1432 (“total_PCUs”) of the physical compute units and/or the physical memory units required to process a pipeline compute load of the fused operation unit graph 224 on the reconfigurable data processor 100. Stage Compute Load Turning to FIG. 14, for each of the operation units (“for node in fused_graph”) of the fused operation unit graph 224, the stage latency determiner 1232 performs resource determination 1400 by using a resource determination function (e.g., “get_graph_PCUs” 1402) to determine a specific stage compute processing time 1414 (“node_latency_with_one_PCU”) required to process a stage compute load 1424 (“node.get_flop( )”) of a respective one of the operation units of the fused operation unit graph 224 using only one physical compute unit and/or only one physical memory unit. The stage compute load 1424 (“node.get_flop( )”) of the respective one of the operation units, which means a total number of floating point operations (FLOP) required to execute the respective one of the operation units, is determined by its operation type, input dimensionality, and output dimensionality.
within a server, and then using these partitions to split a model optimally across servers.
Accordingly, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention, to combine the self-optimizing, compiler driven workloads and task mapping of Bogdan with Chen load summation techniques for each data type. One of ordinary skill in the art would have been motivated to make this combination to enable the system select between different dataflows based on compute cost thereby improving scheduling and resource utilization on NoC processing platforms. This constitutes the use of a Known NN load analysis techniques to enhance existing compiler driven workloads systems.
Regarding claims 10, Carter teaches:
The method of claim 9, further comprising: determining a scaling factor for computing time of each of the processing units based on the computing loads; adjusting the computing time of the processing units based on the scaling factors; and enabling the processing units to perform their respective tasks of the application within their respective adjusted computing time. (col 12, line3-14 The DVFS control system then uses one or more values (the number of values is a programmable selection) from the active core count circular buffer and the slack core count circular buffer to determine a utilization slack value of data processing system (step 420). The DVFS control system computes the utilization slack value based on the sum of slack core counts divided by the sum of active core counts. In addition, the DVFS control system computes a new utilization metric to be a difference between a full utilization value and the utilization slack value, such as a new utilization metric that is equal to 100 minus the utilization slack value (step 422). Col9 line 31-47 DVFS control system 302 then computes a new utilization metric to be a difference between a full utilization value and the utilization slack value, such as a new utilization metric that is equal to 100 minus the utilization slack value. DVFS control system 302 then compares the new utilization metric value to identify whether the new utilization metric value is above or below a predetermined utilization threshold. If the new utilization metric value is above the predetermined utilization threshold, then DVFS control system 302 identifies data processing system 300 as being over utilized and increases the frequency of data processing system 300 by one step size. If the new utilization metric value is below the predetermined utilization threshold, then DVFS control system 302 identifies data processing system 300 as being under utilized and decreases the frequency of data processing system 300 by one step size.)
Regarding claim 19, the claim recites similar limitation as corresponding claim 9 and is rejected for similar reasons as claim 9 using similar teachings and rationale.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CARLOS A ESPANA whose telephone number is (703)756-1069. The examiner can normally be reached Monday - Friday 8 a.m - 5 p.m EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, LEWIS BULLOCK JR can be reached at (571)272-3759. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/C.A.E./Examiner, Art Unit 2199
/LEWIS A BULLOCK JR/Supervisory Patent Examiner, Art Unit 2199