Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
The instant recited “computer program product comprising a computer readable storage medium,” of Claim 15 is not construed as storage in the form of transitory signals per se, in keeping with ¶ 40 of the specification as filed.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-5, 7-12, 14-19 rejected under 35 U.S.C. 103 as being unpatentable over Dias: 20190303211 hereinafter Di further in view of Narayanan: PipeDream: Generalized Pipeline Parallelism for DNN Training (copy provided by Examiner; copyright 2019 and hereinafter Nar) and further in view of Qiao: Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning (copy provided by Examiner; copyright 2021 and hereinafter Qia).
Regarding claim 1
Di teaches:
A computer-implemented method comprising:
training a model within a parallelized training environment comprising a set of training resources, the model split into a plurality of pieces (Di: Abstract; ¶ 53, 72; Fig 1: system utilizes scaled and shared compute, such as from a cloud, such as operative of GPU processing to extract features from a model, source code therefor, determine metrics thereon to improve the model with respect to measurement metrics; the shared system operative in a high performance computing environment such as upon GPU, GPU is considered a massively parallelized environment; please see additionally “Estimating the WCET of GPU-Accelerated Applications Using Hybrid Analysis,” (hereinafter WCET; provided by Examiner; and considered to discuss well-known subject matter in the art) which describes generating dynamic profile(s) for program execution by analyzing portions, slices, and/or segments of code with respect to execution time upon one or more GPUs in parallel environments),
comprising determining processing resources necessary for a particular piece of the plurality of pieces of source code (Di: ¶ 19, 20, 23; Fig 5: system metrics obtained by executing code, pieces thereof, etc. upon resources of a shared computer environment such as upon a GPU, such as for hybrid analysis thereof);
collecting baseline metrics for the parallelized training environment (Di: Abstract; ¶ 19, 23, 26, 53, 72: system generates prediction of necessary and/or sufficient allocation of resources with respect to service level requirements comprising particular time, performance, etc. metrics such as upon one or more GPUs clustered into a high performance environment),
the baseline metrics comprising at least a time cost of a source code execution for a number of cycles represented by portions of the source code (Di: Abstract; ¶ 19-22, 36, 53, 59, 61, 64: training data consists of service level time and performance metrics for executing code on resources of the shared computing environment wherein the execution time is considered representative of the worst case execution time in the form of number of cycles of a code, portions thereof, etc. wherein a determined worst case execution time is considered a maximum time cost to execute a portion of code and in this case a threshold for providing a verifiable optimization);
determining whether a current number of available training resources is an integer multiple of a minimum threshold of training resources (Di: ¶ 3, 23, 72, 81, 91: system predicts, manages, adjusts, etc. an allocation of plural training resources, which is an integer multiple of one, however the recited resources are considered discrete functional resources such as necessary compute nodes, cores, virtual machines, components, etc.; discrete resources are typically provided in discrete functional multiples provided at or in excess of a minimum threshold of one);
responsive to the available training resources being at or above the minimum threshold (Di: ¶ 3, 23, 72, 81, 91: that is the system is operative with at least a singular discrete resource allocated) of training resources:
initializing one or more try-runs to evaluate allocation adjustments (Di: ¶ 27, 34, 70-73; Figs 3, 6, etc.: system predicts or adjusts resource allocation based on a current prediction, a first or current trial run; by utilizing error metrics to generate an augmented prediction and thereby iteratively generate allocation policies);
identifying a try-run having a highest improvement metric (Di: ¶ 27, 34, 70-73: system adapts allocation by refining allocations with respect to dynamic metrics to better satisfy service level requirements by determine a preferred allocation by which allocation efficiency is harvested by a service provider); and
responsive to the highest improvement metric with respect to a threshold, updating a resource allocation for the parallelized training environment wherein the updated resource allocation comprises an improved allocation of computational resource allocation based on of the set of training resources (id.: such as by saving, instantiating, etc. a determined allocation efficiency representing the particular allocation of a particular trial run with respect to the metric(s) wherein the threshold is considered a satisfaction of a particular ).
Di strongly suggests but does not teach a system operative for training and determining training performance such as of a large language model (LLM) determining an integer multiple of a minimum threshold of training resources as discussed supra thereby initializing one or more try-runs to evaluate vertical scaling (wherein “vertical scaling” refers to a scenario in which the pieces of a model are further split for distribution among computing resources); and initializing one or more try-runs to evaluate horizontal scaling (wherein “horizontal scaling” refers to a scenario in which model pieces are not further split but are instead copied (replicated) for parallel training among current and additional computing resources) such as based on a comparison, determining of resulting integer multiples of training resources with respect to the plural multiples and an explicit minimum threshold of training resources nor does Di explicitly discuss applying an improvement metric of an identified best try run to a predetermined threshold such that responsive to the highest improvement metric being greater than a predetermined threshold, a training pattern for the parallelized training environment is updated , wherein the updated training pattern comprises one or both of a vertical scaling and a horizontal scaling of the set of training resources.
In a related field of endeavor Nar teaches a system and method for training a large neural model within a parallelized training environment comprising a set of training resources the large language model split into a plurality of pieces, each training resource of the set of training resources training over one piece of the plurality of pieces (Nar: Abstract; § 1, 2.1: system trains a DNN, parallelizing the training by using a “split over the available workers,” to practice Intra-batch Parallelism and/or Hybrid Intra-batch Parallelism upon a model decomposed into plural layers or stages by automatically partitioning DNN layers among workers; as a DNN is considered a large model to address language modelling Nar can be considered to discuss a large language model if not necessarily a generative one);
collecting baseline metrics for the parallelized training environment, the baseline metrics comprising at least a time cost of a current training run per a number of cycles for the current training run (Nar: § 3, 3.1, 3.2: system performs profiling to collect timing and size metrics per layer for a profiled training run thereby determining total computation time per GPU for a particular layer(s) of the model to thereby optimize a “balanced pipeline,” or best case load balance);
determining whether a current number of available training resources is an integer multiple of a minimum threshold of training resources (Nar: Abstract; § 1: GPUs, workers, etc., comprise discrete resources and therefor the recited “multiples workers… assigned to a given stage,” comprises integer multiples of worker to provide parallelism for a particular stage, layer, group of layers, etc.);
responsive to the current number of available training resources being an integer multiple of the minimum threshold of training resources (Nar: § 3.3, Memory Overhead: such as a one input pipeline where “a model is divided across n workers, with each worker holding 1/n of the weights,” and/or an integer multiple thereof such as an n input pipeline in which a model, portions thereof are processed in parallel):
initializing one or more try-runs to evaluate vertical scaling (Nar: § 1, 5.2, 3.3, 5.5: the system divides models among workers to thereby determine how to partition the model to the DNN, stages thereof such as upon a GPU, such that portions, layers, etc. of the model are divided across n workers); and initializing one or more try-runs to evaluate horizontal scaling (Nar: § 1, 5.2, 3.3, 5.5: the system divides models among workers but additionally parallelizes the model portions thereof onto a plurality of worker inputs, n inputs where a plurality of workers receive the parallel data);
responsive to the current number of available training resources not being an integer multiple of the minimum threshold of training resources, initializing one or more try-runs that only evaluate vertical scaling (Nar: § 3.3, Memory Overhead: such as an implementation limited as one input pipeline (integer multiple = 1) where “a model is divided across n workers, with each worker holding 1/n of the weights,”);
identifying a try-run having a highest improvement metric (Nar: § 1, 3.1: system “determines how to partition the operators of the DNN based on a short profiling run performed on a single GPU,” thereby determining “three quantities … using a short (few minutes) profiling run to collect timing and size metrics to choose an optimized or balanced pipeline by optimizing partitioning, scheduling, etc. and in this way optimizes sub problems such that it minimizes the time taken for the slowest of pipeline stages, ); and
responsive to the highest improvement metric being greater than a dynamic threshold, updating a training pattern for the parallelized training environment, wherein the updated training pattern comprises one or both of a vertical scaling and a horizontal scaling of the set of training resources. (Nar: § 1, 3.1: system converges upon an optimal solution over vertical and/or horizontal scaling wherein previous values for such scaling serve as thresholds over which the model proceeds toward convergence.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to integrate the metrics and service level thresholds of Di into the variably assigned processing, clusters, GPUs, etc. of Nar for at least the purpose of optimizing resource efficiency for LLM training jobs as taught or suggested by Nar in concert with the Di taught devices and method; one of ordinary skill in the art would have expected only predictable results therefrom.
Di in view of Nar does not explicitly teach a system and method operable to determine a scalar type “improvement metric” value, comparing same with a predetermined threshold for the purpose of iterating try runs under differing scaling parameters of training to measure the improvement metric and update when the metric exceeds a threshold.
In a related field of endeavor Qia teaches a system for determining a “goodput” value representative of system throughput with respect to system efficiency (Qia: Abstract) based on mapping configurations of a learning model (Qia: § 1: “We show that a model of a DL job’s goodput can be learned by observing its throughput and statistical behavior during training, and used for predicting the performance given different resource allocations and batch sizes”); by reassigning resources to improve cluster-wise “goodput,” (Qia: Abstract) said improvement operative with respect to identifying a try-run having a highest improvement metric (Qia: § 1: system operates by “predicting the performance given different resource allocations and batch sizes,” a highest “goodput” is a try run with the best improvement metric); responsive to the highest improvement metric being greater than a predetermined threshold, updating a training pattern for the parallelized training environment such as by updating based on an improved goodput model (Qia: Abstract; § 1, 3: system optimizes job parameters and cluster parameters dynamically re-assigning resources and tuning training parameters based thereon to jointly manage system level parameters in concert with training level parameters to arrive at job based training patterns configuring one or more GPUs using elastic schedulers, variable resources and variable batch sizes). It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to optimize the training of an LLM based on the Di in view of Nar taught horizontal and vertical scaled processing of a plurality of model pieces across a set of training resources such as by employ of the Qia taught system, method, etc. for maximizing such resources with respect to a goodput measure wherein the measure operates or serves as a predetermined threshold used to generate a performance metric against which possible future configuration changes or improvements are measure and for at least the purpose of increasing overall efficiency of training, or the efficiency of portions of training; one of ordinary skill in the art would have expected only predictable results therefrom.
Regarding claim 2
Di in view of Nar in view of Qia teaches or suggests:
The computer-implemented method of claim 1, wherein the training of the large language model initializes in response to the current number of training resources being at least equal to the minimum threshold of training resources (Di: Abstract: a prediction of necessary resources to satisfy a service level agreement is considered a minimum level of training resources); (Nar: § 1, 3, 3.1, 3.2: model is partitioned and a ‘minibatch” profiling run is initialized and executed); (Qia: § 1: system schedules a model to configure and execute in the presence of sufficient combination of resources with respect to parameters for pending deep learning jobs). Examiner takes official notice that initializing training by determining sufficient resources with respect to a minimum threshold was well known in the art before the effective filing date of the instant invention and would have comprised an obvious inclusion for at least the purpose of initializing and executing a profiling run; one of ordinary skill in the art would have expected only predictable results therefrom.
Regarding claim 3
Di in view of Nar in view of Qia teaches or suggests:
The computer-implemented method of claim 1, further comprising determining a maximum threshold of training resources for training the large language model (Di: ¶ 78: a resource allocation calculated which “substantially maximizes an expected utility,”); (Nar: § 1: resources above a maximum such as more GPUs do not reduce iteration time—" A larger batch size enables higher utilization of more compute resources (e.g., more GPUs). But, even with an optimally-retuned learning rate, increasing the batch size often results in a decreased statistical efficiency,”); (Qia: § 1: “goodput” determined with respect to diminishing returns accommodated by adapting when adding resources no longer increases the “goodput” efficiency measure). The claim is thus considered obvious over Di as modified by Nar, and Qia as addressed in the base claim as it would have been obvious to apply the further teaching of Di, Nar, and/or Qia to the modified device of Di , Nar, and Qia; one of ordinary skill in the art would have expected only predictable results therefrom.
Regarding claim 4
Di in view of Nar in view of Qia teaches or suggests:
The computer-implemented method of claim 1, wherein the baseline metrics comprise a first training efficiency metric when using a first number of training resources (Di: Abstract: such as with respect to satisfying a particular service level with a particular set of resources); (Nar:§ 3, 3.1, 3.2: profiling runs determine performance measures under particular configurations of vertical and horizonal resource scaling); (Qia: Abstract; § 1, 3: “goodput” comprises an evaluation of training efficiency for a particular configuration wherein each particular configuration provides a first, nth, etc. training efficiency metric based on the configuration, resources, etc.). The claim is thus considered obvious over Di as modified by Nar, and Qia as addressed in the base claim as it would have been obvious to apply the further teaching of Di, Nar, and/or Qia to the modified device of Di , Nar, and Qia; one of ordinary skill in the art would have expected only predictable results therefrom.
Regarding claim 5
Di in view of Nar in view of Qia teaches or suggests:
The computer-implemented method of claim 4, further comprising determining a second training efficiency metric for at least one try-run, wherein the second training efficiency metric is defined as a ratio between a number of samples used during the respective try-run and a total time of the try-run when using a second number of training resources (Di: Abstract: ¶ 6, 19, 28, etc.: code samples measured against service level agreements with respect to execution time); (Nar: § 5.2: learning rates per layer of executable code normalized as a ratio); (Qia: § 2.1: system computes a ratio of time spent on execution of code set or sample averaged across GPUs). The claim is thus considered obvious over Di as modified by Nar, and Qia as addressed in the base claim as it would have been obvious to apply the further teaching of Di, Nar, and/or Qia to the modified device of Di , Nar, and Qia; one of ordinary skill in the art would have expected only predictable results therefrom.
Regarding claim 7
Di in view of Nar in view of Qia teaches or suggests:
The computer-implemented method of claim 1, further comprising saving a current training state of the parallelized training environment prior to initializing a try-run for evaluating training efficiency metrics (Nar: Abstract; ¶ 3, 3.1, 3.2, 3.3: system stores weights with respect to the setting for a profiling minibatch try run to generate a “goodput,” for a particular setting). The claim is thus considered obvious over Di as modified by Nar, and Qia as addressed in the base claim as it would have been obvious to apply the further teaching of Di, Nar, and/or Qia to the modified device of Di , Nar, and Qia; one of ordinary skill in the art would have expected only predictable results therefrom.
Regarding claims 8, 15—the claims are considered to recite substantially similar subject matter to that of claim 1 and are similarly rejected.
Regarding claims 9, 16—the claims are considered to recite substantially similar subject matter to that of claim 2 and are similarly rejected.
Regarding claims 10,17—the claims are considered to recite substantially similar subject matter to that of claim 3 and are similarly rejected.
Regarding claims 11, 18—the claims are considered to recite substantially similar subject matter to that of claim 4 and are similarly rejected.
Regarding claims 12, 19—the claims are considered to recite substantially similar subject matter to that of claim 5 and are similarly rejected.
Regarding claim 14—the claim is considered to recite substantially similar subject matter to that of claim 7 and is similarly rejected.
Allowable Subject Matter
Claims 6, 13, 30 objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to PAUL C MCCORD whose telephone number is (571)270-3701. The examiner can normally be reached 730-630 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, CAROLYN EDWARDS can be reached at (571) 270-7136. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/PAUL C MCCORD/Primary Examiner, Art Unit 2692
/CAROLYN R EDWARDS/Supervisory Patent Examiner, Art Unit 2692