DETAILED ACTION
This office action is in response to amendments filed on 01/07/2026.
Claims 1-2 and 6-12 have been amended. Claims 1-15 are pending.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Objections to the Specification:
In light of applicant’s amendment to the title (pg. 2), the objection to the specification has been withdrawn.
Rejections Under 35 U.S.C. 112:
In light of applicant’s amendments to the claims (pg. 3-8), claims 6-10 are no longer interpreted under 35 U.S.C. 112(f), and therefore the resulting rejections under 35 U.S.C. 112(a) and 35 U.S.C. 112(b) have been withdrawn.
Prior Art Rejections:
Applicant's arguments regarding the prior art rejections (pg. 11-12) have been fully considered but they are not persuasive.
Applicant argues that the amended independent claim limitation “the number of the second weight parameters for each second layer is a multiple of the number of the first weight parameters for each first layer” is not taught by any of the cited references. Applicant specifically argues that while Gong teaches copying parameters from a trained
L
-layer BERT model (i.e. first model) to a
2
L
-layer BERT model (i.e. second model), the number of parameters in each layer of Gong’s second model is the same as the number of parameters in each layer of the first model, not a multiple thereof. However, examiner respectfully notes that any integer is a multiple of itself (e.g. 10 is a multiple of 10 because
10
×
1
=
10
), and therefore, the number of parameters in each layer being equal between the first and second models falls within the broadest reasonable interpretation of “the number of the second weight parameters for each second layer is a multiple of the number of the first weight parameters for each first layer”.
The prior art rejections have been updated to include the amended limitations and to clarify the reasoning given for the limitations that were not amended.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1, 2, 5-8, 11, 12, and 15 are rejected under 35 U.S.C. 103 as being unpatentable over
Yang et al. (hereinafter Yang), “Speeding up Deep Model Training by Sharing Weights and Then Unsharing” in view of
Gong et al. (hereinafter Gong), “Efficient Training of BERT by Progressively Stacking”.
Regarding Claim 1,
Yang teaches A method for model training, comprising:
training a first model to obtain a parameter set of the trained first model, wherein a plurality of first layers in the first model share same first weight parameters; (Pg. 1, section 1: “Correspondingly, our training method consists of two phases: In the first phase, a neural network is trained with weights shared across its repeated layers, to learn the commonly shared component across weights of different layers;” The neural network trained with weight sharing is the first model.)
training the second model to realize model convergence, wherein the first model and the second model have a same computation graph, and the number of the plurality of second layers is equal to or greater than the number of the plurality of first layers. (Pg. 1, section 1: “In the second phase, the weight-sharing constraint is released, and the network is trained till convergence, to learn a different weight for each layer based on the shared component.” The neural network further trained without weight sharing is the second model. Since the second model is obtained through additional training of the first model without any change to its architecture, the first and second models will necessarily have the same computation graph and number of layers.)
Yang does not appear to explicitly disclose copying the parameter set for a plurality of times to obtain second weight parameters of a plurality of second layers of a second model, wherein the number of the second weight parameters for each second layer is a multiple of the number of the first weight parameters for each first layer;
However, Gong teaches copying the parameter set for a plurality of times to obtain second weight parameters of a plurality of second layers of a second model, wherein the number of the second weight parameters for each second layer is a multiple of the number of the first weight parameters for each first layer; (Pg. 3, section 3: “As is shown in Figure 3, if we have a
L
-layer trained BERT, we can construct a
2
L
-layer BERT by copying its parameters: for
i
≤
L
, the
i
-th layer and the
(
i
+
L
)
-th layer of the constructed BERT have the same parameter of the
i
-th layer of the trained BERT.” Since parameters are copied by layer from the
L
-layer model (i.e. first model) to the
2
L
-layer model (i.e. second model), the number of parameters in each layer of the second model is the same as the number of parameters in each layer of the first model (i.e. the number of parameters in each second layer is a multiple of the number of parameters in each first layer, since any integer is a multiple of itself).)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Yang and Gong. Yang teaches efficient training of a BERT (Bidirectional Encoder Representation from Transformers) model by sharing weights between layers for pre-training and then unsharing weights for fine-tuning. Gong teaches efficient training of a BERT model by training a shallow model, copying its weights to initialize a deeper model, and then fine-tuning. One of ordinary skill would have motivation to combine Yang and Gong because, according to Gong, “the training time will be greatly reduced if we train deep models by stacking from a shallow one” (Gong, Pg. 4, section 3). “According to our results, we find first during pre-training, our proposed method is about 25% faster than several baselines to achieve the same validation accuracy. Second, our final model is competitive and even better than the baseline model on several downstream tasks” (Gong, pg. 2, section 1).
Regarding Claim 2, Yang and Gong teach The method of claim 1, as shown above.
Yang also teaches further comprising:
before training the first model, designating the plurality of first layers to share the first weight parameters in a model to be trained to obtain the first model; and (Pg. 1, section 1: “: In the first phase, a neural network is trained with weights shared across its repeated layers, to learn the commonly shared component across weights of different layers;”)
before copying the parameter set for the plurality of times, designating the plurality of first layers not to share the first weight parameters in the first model to obtain the second model. (Pg. 1, section 1: “In the second phase, the weight-sharing constraint is released, and the network is trained till convergence, to learn a different weight for each layer based on the shared component.”)
Regarding Claim 5, Yang and Gong teach The method of claim 1, as shown above.
Yang also teaches wherein the computation graph has a transformer or Bidirectional Encoder Representations from Transformers (BERT) structure. (Pg. 1, section 1: “In particular, we are interested in speeding up the training of deep networks which are constructed by repeatedly stacking the same layer, with a special focus on the BERT model.”)
Claims 6-8 are product claims, containing substantially the same elements as method claims 1, 2, and 5, respectively. Yang and Gong teach the elements of claims 1, 2, and 5, as shown above.
Gong also teaches A non-transitory computer-readable medium storing a set of instructions that is executable by one or more processors and a plurality of model acceleration units of a server to cause the server to perform a method for model training (Pg. 5, section 4.1: “To fairly compare the speed of different algorithms, we train all models in the same computation environment with 4 NVIDIA Tesla P40 GPUs.” According to paragraph 0032 of the instant application, “Model acceleration units include various hardware execution units produced by different companies and dedicated to specific neural network models…” A computation environment necessarily includes a processor, and NVIDIA Tesla P40 GPUs are model acceleration units.)
Claims 11, 12, and 15 are system claims, containing substantially the same elements as method claims 1, 2, and 5, respectively. Yang and Gong teach the elements of claims 1, 2, and 5, as shown above.
Gong also teaches A server, comprising: one or more processors, and a plurality of model acceleration units, wherein the one or more processors and the plurality of model acceleration units are configured to execute a set of instructions (Pg. 5, section 4.1: “To fairly compare the speed of different algorithms, we train all models in the same computation environment with 4 NVIDIA Tesla P40 GPUs.” According to paragraph 0032 of the instant application, “Model acceleration units include various hardware execution units produced by different companies and dedicated to specific neural network models…” A computation environment necessarily includes a processor, and NVIDIA Tesla P40 GPUs are model acceleration units.)
Claims 3, 9, and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Yang in view of Gong and further in view of
Lu et al. (hereinafter Lu), China Patent CN-113033801-A.
Regarding Claim 3, Yang and Gong teach The method of claim 1, as shown above.
Yang and Gong do not appear to explicitly disclose further comprising: after training the first model and before copying the parameter set for the plurality of times, determining whether an error between a result and an expected result of the first model satisfies a set condition; and copying the parameter set for the plurality of times in response to the set condition being satisfied.
However, Lu teaches further comprising: after training the first model and before copying the parameter set for the plurality of times, determining whether an error between a result and an expected result of the first model satisfies a set condition; and copying the parameter set for the plurality of times in response to the set condition being satisfied. ([0005-0009]: “According to one aspect of the present application, a pre-training method for a neural network model is provided, comprising: Get pre-training data; Inputting the pre-training data into an initial neural network model, and pre-training the initial neural network model in a first training method, wherein the multiple hidden layers in the first training method share one hidden layer parameter; Obtaining a loss value of the initial neural network model; If the loss value of the initial neural network model is less than a preset threshold, the initial neural network model is pre-trained in a second training method, wherein each of the multiple hidden layers in the second training method has a hidden layer parameter.” A loss value is a measure of error between a result and expected result. Following the first training method (i.e. training the first model), the loss value (i.e. error) is obtained, and if it is less than a threshold (i.e. if it satisfies a set condition), the second training method is performed (i.e. parameters are copied and the unshared weights are trained, as taught by Yang and Gong).)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Yang, Gong, and Lu. Yang teaches efficient training of a BERT (Bidirectional Encoder Representation from Transformers) model by sharing weights between layers for pre-training and then unsharing weights for fine-tuning. Gong teaches efficient training of a BERT model by training a shallow model, copying its weights to initialize a deeper model, and then fine-tuning. Lu teaches training a neural network by sharing parameters during pre-training until a loss threshold is met, and then fine-tuning without parameter sharing. One of ordinary skill would have motivation to combine Yang, Gong, and Lu, “In order to further improve the convergence effect of the model while increasing the number of model parameters” (Lu, para. 0042).
Claim 9 is a product claim, containing substantially the same elements as method claim 3. Yang, Gong, and Lu teach the elements of claim 3, as shown above.
Claim 13 is a system claim, containing substantially the same elements as method claim 3. Yang, Gong, and Lu teach the elements of claim 3, as shown above.
Claims 4, 10, and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Yang in view of Gong and further in view of
Ren et al. (hereinafter Ren), “ZeRO-Offload: Democratizing Billion-Scale Model Training”.
Regarding Claim 4, Yang and Gong teach The method of claim 1, as shown above.
Yang and Gong do not appear to explicitly disclose wherein training the first model and training the second model are performed on a server comprising a central processing unit and a plurality of graphics processing units, and a CPU offload mode is adopted for training the second model.
However, Ren teaches wherein training the first model and training the second model are performed on a server comprising a central processing unit and a plurality of graphics processing units, and a CPU offload mode is adopted for training the second model. (Pg. 1, Abstract: “ZeRO-Offload enables large model training by offloading data and compute to CPU. To preserve compute efficiency, it is designed to minimize the data movement to/from GPU, and reduce CPU compute time while maximizing memory savings on GPU. As a result, ZeRO-Offload can achieve 40 TFlops/GPU on a single NVIDIA V100 GPU for 10B parameter model compared to 30TF using PyTorch alone for a 1.4B parameter model, the largest that can be trained without running out of memory. ZeRO-Offload is also designed to scale on multiple-GPUs when available, offering near linear speedup on up to 128 GPUs.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Yang, Gong, and Ren. Yang teaches efficient training of a BERT (Bidirectional Encoder Representation from Transformers) model by sharing weights between layers for pre-training and then unsharing weights for fine-tuning. Gong teaches efficient training of a BERT model by training a shallow model, copying its weights to initialize a deeper model, and then fine-tuning. Ren teaches a CPU offloading strategy to facilitate efficient large model training. One of ordinary skill would have motivation to combine Yang, Gong, and Ren because the second model produced by the unsharing and copying steps of Yang and Gong has a large number of parameters, and “ZeRO-Offload provides an optimal and the only optimal solution in maximizing memory saving while minimizing communication overhead and CPU compute overhead for large model training” (Ren, pg. 2, section 1).
Claim 10 is a product claim, containing substantially the same elements as method claim 4. Yang, Gong, and Ren teach the elements of claim 4, as shown above.
Claim 14 is a system claim, containing substantially the same elements as method claim 4. Yang, Gong, and Ren teach the elements of claim 4, as shown above.
Conclusion
Claims 1-15 are rejected.
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to BENJAMIN M ROHD whose telephone number is (571)272-6445. The examiner can normally be reached Mon-Thurs 8:00-6:00 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Viker Lamardo can be reached at (571) 270-5871. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/B.M.R./Examiner, Art Unit 2147
/VIKER A LAMARDO/Supervisory Patent Examiner, Art Unit 2147