Notice of Pre-AIA or AIA Status
1. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
2. In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
3. Claims 1-4, 7-11 and 14-18 are rejected under 35 U.S.C. 103 as being unpatentable over Lin et al, “Learning Language Specific Sub-network for Multilingual Machine Translation” herein Lin, in view of Zhu et al. “To prune, or not to prune: exploring the efficacy of pruning for model compression”, here in Zhu.
Regarding Claim 1:
Lin discloses a method comprising:
determining a pruning mask for weights of a multilingual machine translation model based on a first pruning threshold, wherein the pruning mask includes at least one entry set to zero and at least one entry set to one (Lin: Section 3.2 discloses a sub network is indicted by a binary mask vector);
training the multilingual machine translation model, while applying the pruning mask to the multilingual machine translation model, for translation between a language pair based on training examples from a bilingual translation corpus (Lin: Sections 3.1-3.3 discloses masks applied to the network either retain the weight 1, or abandon it with zero depending on which specific language pair is being used; fine-tuning the base multilingual model on the specific language pairs to make important weights larger in magnitude is also performed when the masks are produced);
training the multilingual machine translation model, while applying the (Lin: Sections 3.1 discloses defining a multilingual data set where si and ti that represent the source and target language respectfully; Section 3.3 further discloses that training is governed by the language pair mask, i.e., training occurs while applying the mask in a sense that updates are restricted to the masked subnetwork. This is the second batch of training that occurs after the masks have been produced).
Lin does not explicitly disclose:
updating the pruning mask based on a second pruning threshold;
or training the multilingual machine translation model, while applying the updated pruning mask to the multilingual machine translation model.
However, Zhu discloses:
updating the pruning mask based on a second pruning threshold (Zhu: Section 3 discloses the binary weight masks are updated every delta t steps as the network is trained to gradually increase the sparsity, Section 2 additionally discloses a sparsity schedule which determines the weight thresholds);
training the multilingual machine translation model, while applying the updated pruning mask to the multilingual machine translation model (Zhu: discloses the binary weight masks are updated every Δt steps as the network is trained).
Lin in view of Zhu are combinable because they are from the same field of endeavor, i.e., both disclose methods of training a pruning mask to be applied to a deep neural network. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to disclose the multilingual application of Lin and to modify the training steps to including updating the pruning masks every few time steps during the training process. Lin already discloses the benefits of using a pruning mask in the abstract: “These jointly trained models often suffer from performance degradation on rich-resource language pairs,” i.e., a pruning mask will get rid of parameter interference. The motivation for applying updates to these masks later on is clearly laid out in Zhu Section 1: “pruning away (forcing to zero) the less salient connections (parameters) in the neural network has been shown to reduce the number of nonzero parameters in the model with little to no loss in the final model quality,” i.e., providing a mask that continuously updates during the training process can assist in the goal of providing a single light weight deployable model that does not require an excessive number of memory accesses in constrained hardware systems.
Regarding Claim 2:
The proposed combination of Lin in view of Zhu further discloses the method of claim 1, comprising:
determining the first pruning threshold based on a first pruning ratio that specifies a proportion of the weights of the multilingual machine translation model to be zeroed (Lin: Section 3.2 discloses ratio of weights is pruned to the lowest α percent, this is a threshold); and
determining the second pruning threshold based on a second pruning ratio that specifies a proportion of the weights of the multilingual machine translation model to be zeroed (Zhu: Section 3 discloses changing sparsity over time, which means there are successive thresholds as this sparsity changes: “Sparsity is increased from an initial sparsity value si … to a final sparsity value sf over a span of n pruning steps” and “the binary weight masks are updated ever Δt steps”).
Lin in view of Zhu are combinable because they are from the same field of endeavor, i.e., both disclose methods of training a pruning mask to be applied to a deep neural network. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to disclose the multilingual application of Lin and to modify the training steps to including updating the pruning masks every few time steps during the training process. Lin already discloses the benefits of using a pruning mask in the abstract: “These jointly trained models often suffer from performance degradation on rich-resource language pairs,” i.e., a pruning mask will get rid of parameter interference. The motivation for applying updates to these masks later on is clearly laid out in Zhu Section 1: “pruning away (forcing to zero) the less salient connections (parameters) in the neural network has been shown to reduce the number of nonzero parameters in the model with little to no loss in the final model quality,” i.e., providing a mask that continuously updates during the training process can assist in the goal of providing a single light weight deployable model that does not require an excessive number of memory accesses in constrained hardware systems.
Regarding Claim 3:
The proposed combination of Lin in view of Zhu further discloses the method of claim 2, wherein the first pruning ratio and the second pruning ratio are interpolated between zero and a target pruning ratio based on a count of iterations of training and updating the pruning mask (Zhu: Section 3 discloses the mask is updated every Δt steps. This functions from si and a target final pruning ratio sf).
Lin in view of Zhu are combinable because they are from the same field of endeavor, i.e., both disclose methods of training a pruning mask to be applied to a deep neural network. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to disclose the multilingual application of Lin and to modify the training steps to including updating the pruning masks every few time steps during the training process. Lin already discloses the benefits of using a pruning mask in the abstract: “These jointly trained models often suffer from performance degradation on rich-resource language pairs,” i.e., a pruning mask will get rid of parameter interference. The motivation for applying updates to these masks later on is clearly laid out in Zhu Section 1: “pruning away (forcing to zero) the less salient connections (parameters) in the neural network has been shown to reduce the number of nonzero parameters in the model with little to no loss in the final model quality,” i.e., providing a mask that continuously updates during the training process can assist in the goal of providing a single light weight deployable model that does not require an excessive number of memory accesses in constrained hardware systems.
Regarding Claim 4:
The proposed combination of Lin in view of Zhu further discloses the method of claim 1, wherein determining the pruning mask comprises:
comparing magnitudes of the weights of the multilingual machine translation model to the first pruning threshold to determine whether an entry of the pruning mask will be set to one or zero (Lin: discloses fine tuning will amplify the magnitude of the important weights, this is done by ranking the weights in the fine-tuned model and prune the lowest α percent and then setting the remaining indices of parameters to be 1).
Regarding Claim 7:
The combination of Lin in view of Zhu further discloses the method of claim 1, wherein the language pair is a first language pair, the pruning mask is a first pruning mask, and further comprising:
determining a second pruning mask for the weights of the multilingual machine translation model based on a third pruning threshold, wherein the second pruning mask includes at least one entry set to zero and at least one entry set to one (Zhu: Section 3, discloses a binary mask, which necessarily contain entries as 0 or 1, and reaching a sparsity less than n100% necessarily leaves some weights participating (mask = 1) while others are zeroed (mask = 0));
training the multilingual machine translation model, while applying the second pruning mask to the multilingual machine translation model, for translation between a second language pair based on training examples from a second bilingual translation corpus (Zhu: Section 4.3 discloses a English/German bilingual translation corpus providing language pairs, the mask is applied to weights in forward execution during training);
updating the second pruning mask based on a fourth pruning threshold (Zhu: Section 3 discloses updated targets dictated by the schedule, i.e., new st => new cutoff/threshold => updated mask); and
training the multilingual machine translation model, while applying the updated second pruning mask to the multilingual machine translation model, for translation between the second language pair based on training examples from the second bilingual translation corpus (Zhu Section 3 discloses continued training with masks that are updating during training, in the same machine translation setting/dataset).
Lin in view of Zhu are combinable because they are from the same field of endeavor, i.e., both disclose methods of training a pruning mask to be applied to a deep neural network. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to disclose the multilingual application of Lin and to modify the training steps to including updating the pruning masks every few time steps during the training process. Lin already discloses the benefits of using a pruning mask in the abstract: “These jointly trained models often suffer from performance degradation on rich-resource language pairs,” i.e., a pruning mask will get rid of parameter interference. The motivation for applying updates to these masks later on is clearly laid out in Zhu Section 1: “pruning away (forcing to zero) the less salient connections (parameters) in the neural network has been shown to reduce the number of nonzero parameters in the model with little to no loss in the final model quality,” i.e., providing a mask that continuously updates during the training process can assist in the goal of providing a single light weight deployable model that does not require an excessive number of memory accesses in constrained hardware systems.
Regarding Claim 8:
Lin discloses a system comprising: a processor, and a memory (Lin: discloses computing training quantities (gradients, loss based measures, scores and pruning decisions) for network/weights and performing training iterations which are operations necessarily performed by one or more processors executing instructions stored in memory), wherein the memory stores instructions executable by the processor to:
determine a pruning mask for weights of a multilingual machine translation model based on a first pruning threshold, wherein the pruning mask includes at least one entry set to zero and at least one entry set to one (Lin: Section 3.2 discloses a sub network is indicted by a binary mask vector);
train the multilingual machine translation model, while applying the pruning mask to the multilingual machine translation model, for translation between a language pair based on training examples from a bilingual translation corpus (Lin: Sections 3.1-3.3 discloses masks applied to the network either retain the weight 1, or abandon it with zero depending on which specific language pair is being used; fine-tuning the base multilingual model on the specific language pairs to make important weights larger in magnitude is also performed when the masks are produced);
train the multilingual machine translation model, while applying the (Lin: Sections 3.1 discloses defining a multilingual data set where si and ti that represent the source and target language respectfully; Section 3.3 further discloses that training is governed by the language pair mask, i.e., training occurs while applying the mask in a sense that updates are restricted to the masked subnetwork. This is the second batch of training that occurs after the masks have been produced).
Lin does not explicitly disclose:
update the pruning mask based on a second pruning threshold;
or training the multilingual machine translation model, while applying the updated pruning mask to the multilingual machine translation model.
However, Zhu discloses update the pruning mask based on a second pruning threshold (Zhu: Section 3 discloses the binary weight masks are updated every delta t steps as the network is trained to gradually increase the sparsity, Section 2 additionally discloses a sparsity schedule which determines the weight thresholds);
or training the multilingual machine translation model, while applying the updated pruning mask to the multilingual machine translation model (Zhu: discloses the binary weight masks are updated every Δt steps as the network is trained).
Lin in view of Zhu are combinable because they are from the same field of endeavor, i.e., both disclose methods of training a pruning mask to be applied to a deep neural network. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to disclose the multilingual application of Lin and to modify the training steps to including updating the pruning masks every few time steps during the training process. Lin already discloses the benefits of using a pruning mask in the abstract: “These jointly trained models often suffer from performance degradation on rich-resource language pairs,” i.e., a pruning mask will get rid of parameter interference. The motivation for applying updates to these masks later on is clearly laid out in Zhu Section 1: “pruning away (forcing to zero) the less salient connections (parameters) in the neural network has been shown to reduce the number of nonzero parameters in the model with little to no loss in the final model quality,” i.e., providing a mask that continuously updates during the training process can assist in the goal of providing a single light weight deployable model that does not require an excessive number of memory accesses in constrained hardware systems.
Regarding Claim 9:
Claim 9 has been analyzed with regards to claim 2 and is rejected for the same reasons of obviousness used above.
Regarding Claim 10:
Claim 10 has been analyzed with regards to claim 3 and is rejected for the same reasons of obviousness used above.
Regarding Claim 11:
Claim 11 has been analyzed with regards to claim 4 and is rejected for the same reasons of obviousness used above.
Regarding Claim 14:
Claim 14 has been analyzed with regards to claim 7 and is rejected for the same reasons of obviousness used above.
Regarding Claim 15
Lin discloses a non-transitory computer-readable storage medium, comprising executable instructions that, when executed by a processor (Lin: discloses computing training quantities (gradients, loss based measures, scores and pruning decisions) for network/weights and performing training iterations which are operations necessarily performed by one or more processors executing instructions stored in memory that includes non-transitory computer-readable medium), facilitate performance of operations, comprising:
determining a pruning mask for weights of a multilingual machine translation model based on a first pruning threshold, wherein the pruning mask includes at least one entry set to zero and at least one entry set to one (Lin: Section 3.2 discloses a sub network is indicted by a binary mask vector);
training the multilingual machine translation model, while applying the pruning mask to the multilingual machine translation model, for translation between a language pair based on training examples from a bilingual translation corpus (Lin: Sections 3.1-3.3 discloses masks applied to the network either retain the weight 1, or abandon it with zero depending on which specific language pair is being used; fine-tuning the base multilingual model on the specific language pairs to make important weights larger in magnitude is also performed when the masks are produced);
training the multilingual machine translation model, while applying the u(Lin: Sections 3.1 discloses defining a multilingual data set where si and ti that represent the source and target language respectfully; Section 3.3 further discloses that training is governed by the language pair mask, i.e., training occurs while applying the mask in a sense that updates are restricted to the masked subnetwork. This is the second batch of training that occurs after the masks have been produced).
Lin does not explicitly disclose:
update the pruning mask based on a second pruning threshold;
or training the multilingual machine translation model, while applying the updated pruning mask to the multilingual machine translation model.
However, Zhu discloses update the pruning mask based on a second pruning threshold (Zhu: Section 3 discloses the binary weight masks are updated every delta t steps as the network is trained to gradually increase the sparsity, Section 2 additionally discloses a sparsity schedule which determines the weight thresholds);
or training the multilingual machine translation model, while applying the updated pruning mask to the multilingual machine translation model (Zhu: discloses the binary weight masks are updated every Δt steps as the network is trained).
Lin in view of Zhu are combinable because they are from the same field of endeavor, i.e., both disclose methods of training a pruning mask to be applied to a deep neural network. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to disclose the multilingual application of Lin and to modify the training steps to including updating the pruning masks every few time steps during the training process. Lin already discloses the benefits of using a pruning mask in the abstract: “These jointly trained models often suffer from performance degradation on rich-resource language pairs,” i.e., a pruning mask will get rid of parameter interference. The motivation for applying updates to these masks later on is clearly laid out in Zhu Section 1: “pruning away (forcing to zero) the less salient connections (parameters) in the neural network has been shown to reduce the number of nonzero parameters in the model with little to no loss in the final model quality,” i.e., providing a mask that continuously updates during the training process can assist in the goal of providing a single light weight deployable model that does not require an excessive number of memory accesses in constrained hardware systems.
Regarding Claim 16:
Claim 16 has been analyzed with regards to claim 2 and is rejected for the same reasons of obviousness used above.
Regarding Claim 17:
Claim 17 has been analyzed with regards to claim 3 and is rejected for the same reasons of obviousness used above.
Regarding Claim 18:
Claim 18 has been analyzed with regards to claim 4 and is rejected for the same reasons of obviousness used above.
4. Claims 5-6, 12-13 and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Lin in view of Zhu, and further in view of Lee et al. “SNIP: SINGLE-SHOT NETWORK PRUNING BASED ON CONNECTION SENSITIVITY” herein Lee.
Regarding Claim 5:
The proposed combination of Lin in view of Zhu further discloses the method of claim 1, except wherein determining the pruning mask comprises:
determining respective scores for the weights of the multilingual machine translation model based on a training gradient for each weight
and comparing the respective scores of the weights of the multilingual machine translation model to the first pruning threshold to determine whether an entry of the pruning mask will be set to one or zero
However, Lee discloses
determining respective scores for the weights of the multilingual machine translation model based on a training gradient for each weight (Lee: Section 4.1 paragraph 4, discloses score per connection/weight from a training gradient. Respective score “sj” for each weight is based on the derivative with respect to cj (a gradient computed via a forward-backward pass));
and comparing the respective scores of the weights of the multilingual machine translation model to the first pruning threshold to determine whether an entry of the pruning mask will be set to one or zero (Lee: Section 4.1 uses sk as the cutoff value (a threshold derived from the ordered scores), then applies an indicator test to set cj to 1 or 0).
Lin teaches generating using a pruning mask and pruning criterion/thresholding during training to zero a selected weight and continuing training with the pruned/masked model. Zhu teaches a concrete, automated pruning framework that explicitly uses a binary mask applied to weight during training and determines which weights participate during training. Lee discloses a specific way to assign a per-weight saliency score derived from training loss gradients by introducing a multiplicative connection as the basis for retaining or pruning connections. It would have been obvious to one of ordinary skill in the art to disclose determining scores for the weights using a training gradient for each weight in order to determine a pruning mask. The suggestion/motivation for doing so is “we introduce a saliency criterion based on connection sensitivity that identifies structurally important connections in the network for the given task. This eliminates the need for both pretraining and the complex pruning schedule while making it robust to architecture variations” in the abstract of Lee.
Regarding Claim 6:
The proposed combination of Lin, Zhu and Lee further disclose the method of claim 5, wherein the respective score of a weight of the multilingual machine translation model is determined based on a product of the training gradient for the weight and the weight (Lee: Section 4.1 defines a binary mask variable cj applied multiplicatively to each weight wj via c ⊙ w. Lee then computes a per connection score using gj = ∂L(c ⊙ w; D) / ∂cj and uses this as a salience score. Since each masked weight = cjwj this works out to the score gj being determined by the product of each weight and its training gradient).
Lin teaches generating using a pruning mask and pruning criterion/thresholding during training to zero a selected weight and continuing training with the pruned/masked model. Zhu teaches a concrete, automated pruning framework that explicitly uses a binary mask applied to weight during training and determines which weights participate during training. Lee discloses a specific way to assign a per-weight saliency score derived from training loss gradients by introducing a multiplicative connection as the basis for retaining or pruning connections. It would have been obvious to one of ordinary skill in the art to disclose determining scores for the weights using a training gradient for each weight in order to determine a pruning mask. The suggestion/motivation for doing so is it furthers the ability to “the relevance of the retained connections as well as the effect of the network initialization and the dataset on the saliency score.” in Section 1 of Lee.
Regarding Claim 12:
Claim 12 has been analyzed with regards to claim 5 and is rejected for the same reasons of obviousness used above.
Regarding Claim 13:
Claim 13 has been analyzed with regards to claim 6 and is rejected for the same reasons of obviousness used above.
Regarding Claim 19:
Claim 19 has been analyzed with regards to claim 5 and is rejected for the same reasons of obviousness used above.
Regarding Claim 20:
Claim 20 has been analyzed with regards to claim 6 and is rejected for the same reasons of obviousness used above.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to IAN SCOTT MCLEAN whose telephone number is (703)756-4599. The examiner can normally be reached "Monday - Friday 8:00-5:00 EST, off Every 2nd Friday".
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hai Phan can be reached at (571) 272-6338. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/IAN SCOTT MCLEAN/Examiner, Art Unit 2654
/HAI PHAN/Supervisory Patent Examiner, Art Unit 2654