Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Status of Claims
This action is a responsive to the application filed on 12/11/2025.
Claims 1-20 are pending.
Claims 1-3, 5-13, and 15-20 have been amended.
Response to Arguments
Applicant’s arguments, with respect to the rejection(s) of claim(s) 1-20 under 35 U.S.C. 101, have been fully considered and are persuasive. Therefore, the rejections set forth in the previous office action have been withdrawn.
Applicant’s arguments, with respect to the rejection(s) of claim 1, 11, and 19 under 35 U.S.C. 103, have been considered but are not persuasive. The applicant argues that no reference teaches the amended claim limitations of claims 1, 11, and 19, that now state “wherein adjusting the one or more parameters comprises combining a first gradient determined with respect to the shared plurality of parameters based on the first model unit and a second, different gradient determined with respect to the shared plurality of parameters based on the second model unit”, since Gong does not discuss gradient operations and Deng “does not disclose ‘combining’ gradients”. Due to the broadness of the claim language, the examiner respectfully disagrees.
Deng has been found to teach the amended limitation based on the broadness of the claim language. Deng, sections 3-5.1 teach stacked (shared) weight matrix of a deep neural network layer (unit) weights being trained, and then performing weight “fine-tuning” per layer (units) for a certain amount of iterations while monitoring the “gradient” between layer weights for tuning the weights (combining).
See 35 U.S.C 103 section for full mapping of claim limitations necessitated by applicant amendments.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Gong et al (“Efficient Training of BERT by Progressively Stacking”, 2019) hereinafter Gong, in view of Deng et al (“Scalable Stacking and Learning for Building Deep Architectures”, 2015) hereinafter Deng.
Regarding claims 1, 11, and 19, Gong teaches a computer-implemented method for reducing computational costs of training a machine-learned model; a computing system for reducing computational costs of training a machine-learned model, comprising: one or more processors; one or more tangible, non-transitory computer readable media storing computer-readable instructions that when executed by one or more processors cause the one or more processors to perform operations (section 4 teaches performing the model training embodiments of the disclosure on a “computation environment” including “GPUs” with executable “code” known to be included on a computer system with one or more memories), the operations comprising:
performing, by a computing system comprising one or more computing devices, a first plurality of training iterations with the machine-learned model to adjust one or more parameters of a shared plurality of parameters (sections 3 and 4.1-4.2 teach a GPU executing an “iterative training algorithm based on our stacking technique to train a deep BERT faster”, wherein “we first train a 3-layer BERT for 50,000 steps” for tuning “parameters”), wherein the machine-learned model comprises a first model unit comprising a first plurality of parameters tied to the shared plurality of parameters during the first plurality of training iterations and a second model unit comprising a second plurality of parameters tied to the shared plurality of parameters during the first plurality of training iterations (sections 3 and 4.1-4.2 teach the GPU training a BERT model (machine-learned model) wherein “we first train a 3-layer BERT for 50,000 steps” (iterations) for tuning “parameters” to be shared, and progress training through progressive stacking (tied to the shared plurality of parameters) of model layers (model units) “with parameters” (each of the plurality of model units comprise a plurality of parameters). Section 1 further teaches “Once we have a shallow model, we can stack the shallow model into a deep model by sharing weight between the top self-attention layers and the bottom self-attention layers, and then fine-tune all the parameters.”), and
performing, by the computing system, a second plurality of training iterations with the machine-learned model to adjust one or more parameters of each of the first model unit and second model unit independent of the shared plurality of parameters (section 4.1 teaches via a GPU, sections 3 and 4.1-4.2 teach via a GPU, “we first train a 3-layer BERT for 50,000 steps (untying condition), stack it twice into a 6-layer BERT and then train this 6-layer BERT for 70,000 steps (second plurality of training iterations)” and so on, wherein the layers include “parameters” being trained according to the newly stacked separate layers (each of the first model unit and second model unit independent of the shared plurality of parameters). Further, “[w]hen fine-tuning models on downstream tasks (alternative second plurality of training iterations), we use the same hyperparameter search space as BERT for each down-stream task. We perform a hyperparameter search on the validation set of each task with our baseline model and apply the resulting hyperparameter to other models. We use a new set of random seeds that is different from the seeds for hyperparameter search to prevent over-fitting”).
However, Gong does not explicitly teach and wherein adjusting the one or more parameters comprises combining a first gradient determined with respect to the shared plurality of parameters based on the first model unit and a second, different gradient determined with respect to the shared plurality of parameters based on the second model unit.
Deng teaches and wherein adjusting the one or more parameters comprises combining a first gradient determined with respect to the shared plurality of parameters based on the first model unit and a second, different gradient determined with respect to the shared plurality of parameters based on the second model unit (sections 3-5.1 teach stacked (shared) weight matrix of a deep neural network layer (unit) weights being trained, and then performing weight “fine-tuning” per layer (units) for a certain amount of iterations while monitoring the “gradient” between layer weights for tuning).
Further, Gong at least implies performing, by the computing system, a second plurality of training iterations with the machine-learned model to adjust one or more parameters of each of the first model unit and second model unit independent of the shared plurality of parameters (see mappings above), however Deng teaches performing…a second plurality of training iterations with the machine-learned model to adjust one or more parameters of each of the first model unit and second model unit independent of the shared plurality of parameters (sections 4-5.1 teach stacked weight matrix of a deep neural network layer weights being trained, and then performing weight “fine-tuning” per layer (first model unit and second model unit independent of the shared plurality of parameters) for a certain amount of iterations (second plurality of training iterations) while monitoring the “gradient”).
Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to implement Deng’s teachings of DNN layer weight stacking in matrices for training and further fine-tuning while monitoring gradient calculations into Gong’s teaching of BERT layer weight stacking through copying trained layers and then fine-tuning weights in order to increasing training efficiency and accuracy through “parallel training on potentially very large data sets” (Deng, sections 4.1-4.2 and 7).
Regarding claims 2, 12, and 20, the combination of Gong and Deng teach all the claim limitations of claims 1, 11, and 19 above; and further teach wherein: the shared plurality of parameters is a first shared plurality of parameters (Gong, sections 3 and 4.1-4.2 teach training a BERT model (machine-learned model) through progressive stacking (shared plurality of parameters is a first shared) of model layers “with parameters” (plurality of parameters). Section 1 further teaches “Once we have a shallow model, we can stack the shallow model into a deep model by sharing weight between the top self-attention layers and the bottom self-attention layers, and then fine-tune all the parameters.”); the machine-learned model comprises two model groups respectively comprising a first subset of model units and a second subset of model units of a plurality of model units of the machine-learned model, wherein the first subset comprises the first model unit and the second model unit, and wherein the parameters of each of the second subset of model units is tied to a second shared plurality of parameters during the first plurality of training iterations and the second plurality of training iterations (Gong, sections 3 and 4.1-4.2 teach the GPU training a BERT model (machine-learned model) through progressive stacking (tied to a shared plurality of first/second group parameters) of model layers (first/second subset model units) “with parameters” (parameters of each of the first/second subset of model units). Starting as a compressed “3-layer BERT” (first/second model units) for training (first plurality of training iterations) and fine-tuning (alternative second plurality of training iterations) for updating layers to be stacked (second shared plurality of parameters) for training a “6-layer BERT for 70,000 steps (second plurality of training iterations)”.); and
performing the second plurality of training iterations comprises adjusting one or more of the second shared plurality of parameters (Gong, sections 3 and 4.1-4.2 teach via a GPU, “we first train a 3-layer BERT for 50,000 steps, stack it twice into a 6-layer BERT and then train this 6-layer BERT for 70,000 steps (second plurality of training iterations). In the final step, we stack the 6-layer BERT into a 12-layer BERT, and train the 12-layer BERT for 280,000 steps”, wherein the layers include “parameters” being trained according to the newly stacked separate layers. Here, it is interpreted that “6-layer BERT” train the previously shared parameters (as a “3-layer BERT”) (second plurality of training iterations…adjust…the second shared plurality of parameters) as independent parameters (second plurality of training iterations comprises adjusting one or more of the second shared plurality of parameters), then repeats the stacking process; thus, making the 6-layer BERT the shared parameters (alternate second plurality of training iterations…adjust…the shared plurality of parameters) to be trained independently as a “12-layer BERT”.), wherein adjusting one or more of the second shared plurality of parameters comprises combining a third gradient determined with respect to the second shared plurality of parameters based on a third model unit of the second subset with a fourth, different gradient determined with respect to the second shared plurality of parameters based on a fourth model unit of the second subset (Deng, sections 4-5.1 teach repeatedly training a stacked weight matrix of a deep neural network layer weights when stacking then network layers (second subset), and then performing weight “fine-tuning” per layer (third model unit and fourth) for a certain amount of iterations (second plurality of training iterations) while monitoring the “gradient” (third gradient…fourth, different gradient) between layer weights for tuning (combining)).
Gong and Deng are combinable for the same rationale as set forth above with respect to claims 1, 11, and 19.
Regarding claims 3 and 13, the combination of Gong and Deng teach all the claim limitations of claims 2 and 12 above; and further teach wherein the method further comprises:
performing, by the computing system, a third plurality of training iterations with the machine-learned model to adjust one or more parameters of at least one of the third model unit and the fourth model unit independent of the first shared plurality of parameters and the second shared plurality of parameters (Gong, section 4.1 teaches via a GPU, sections 3 and 4.1-4.2 teach via a GPU, “In the final step, we stack the 6-layer BERT into a 12-layer BERT, and train the 12-layer BERT for 280,000 steps (performing…a third plurality of training iterations)”, wherein the layers include “parameters” being trained according to the newly stacked separate layers (adjust…parameters…of the third model unit and the fourth model unit independent of the first shared plurality of parameters and the second shared plurality of parameters). Further, “[w]hen fine-tuning models on downstream tasks (alternative third plurality of training iterations…independent), we use the same hyperparameter search space as BERT for each down-stream task. We perform a hyperparameter search on the validation set of each task with our baseline model and apply the resulting hyperparameter to other models. We use a new set of random seeds that is different from the seeds for hyperparameter search to prevent over-fitting”).
Regarding claims 4 and 14, the combination of Gong and Deng teach all the claim limitations of claims 1 and 11 above; and further teach wherein performing, by the computing system, the second plurality of training iterations further adjusts one or more of the shared plurality of parameters (Gong, sections 3 and 4.1-4.2 teach via a GPU, “we first train a 3-layer BERT for 50,000 steps, stack it twice into a 6-layer BERT and then train this 6-layer BERT for 70,000 steps (second plurality of training iterations). In the final step, we stack the 6-layer BERT into a 12-layer BERT, and train the 12-layer BERT for 280,000 steps”, wherein the layers include “parameters” being trained according to the newly stacked separate layers. Here, it is interpreted that “6-layer BERT” train the previously shared parameters (as a “3-layer BERT”) independently, then repeats the stacking process; thus, making the 6-layer BERT the shared parameters (second plurality of training iterations further adjusts one or more of the shared plurality of parameters) to be trained independently as a “12-layer BERT”.).
Regarding claims 5 and 15, the combination of Gong and Deng teach all the claim limitations of claims 1 and 11 above; and further teach further comprising: evaluating, by the computing system, one or more gradient statistics associated with at least one of the first plurality of training iterations (Deng, sections 4-5.1 teach stacked weight matrix of a deep neural network layer weights being trained, and then performing weight “fine-tuning” per layer for a certain amount of iterations while monitoring the “gradient”), wherein the first plurality of training iterations is ended based at least in part on the one or more gradient statistics (Deng, sections 3-5.1 teach stacked weight matrix of a deep neural network layer (first model unit and the second model unit) weights being trained until an “optimization problem” is satisfied (ended) while monitoring the computed “gradient” (based at least in part on the one or more gradient statistics) between layer weights).
Gong and Deng are combinable for the same rationale as set forth above with respect to claims 1, 11, and 19.
Regarding claims 6 and 16, the combination of Gong and Deng teach all the claim limitations of claims 1 and 11 above; and further teach wherein the first plurality of training iterations is ended responsive to the first plurality of training iterations exceeding a threshold number of training iterations (Gong, sections 3 and 4.1-4.3 teach via a GPU, “we first train a 3-layer BERT for 50,000 steps (first plurality of training iterations exceeds a threshold number of training iterations), stack it twice into a 6-layer BERT and then train this 6-layer BERT for 70,000 steps”, wherein the layers include “parameters” and the iteration times are tracked to be greater than the “threshold”).
Regarding claims 7 and 17, the combination of Gong and Deng teach all the claim limitations of claims 1 and 11 above; and further teach wherein the first model unit is adjacent to the second model unit (Gong, sections 3 and 4.1-4.2 teach connected (adjacent) 3 layer BERT model layers (units)).
Regarding claims 8 and 18, the combination of Gong and Deng teach all the claim limitations of claims 7 and 17 above; and further teach wherein the first plurality of training iterations is ended based at least in part on, a correlation between gradients of at least the first model unit and the second model unit (Deng, sections 3-5.1 teach stacked weight matrix of a deep neural network layer (first model unit and the second model unit) weights being trained until an “optimization problem” is satisfied (untying condition) while monitoring the computed “gradient” (ended based at least in part on a correlation between gradients) between layer weights (first model unit and the second model unit).).
Gong and Deng are combinable for the same rationale as set forth above with respect to claims 1, 11, and 19.
Regarding claim 9, the combination of Gong and Deng teach all the claim limitations of claim 1 above; and further teach wherein the first model unit and the second model unit share a model unit architecture (Gong, sections 3 and 4.1-4.2 teach connected BERT model layers (first model unit and the second model unit share a model unit architecture)).
Regarding claim 10, the combination of Gong and Deng teach all the claim limitations of claim 9 above; and further teach wherein the model unit architecture comprises a sequence of model layers (Gong, sections 3 and 4.1-4.2 teach connected BERT model layers (sequence of model layers)).
Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CLINT MULLINAX whose telephone number is 571-272-3241. The examiner can normally be reached on Mon - Fri 8:00-4:30 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Alexey Shmatov can be reached on 571-270-3428. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/C.M./Examiner, Art Unit 2123
/ALEXEY SHMATOV/Supervisory Patent Examiner, Art Unit 2123