DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Applicant claims the benefit of prior-filed a U.S. Provisional Application No. 63/209,282, filed on June 10, 2021, which is acknowledged.
Drawings
The drawings were received on 06/08/2022. These drawings are acceptable.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on the following date(s): 12/07/2022 and 06/08/2022 have been considered by the examiner.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Lu et al. (US 20220343175, hereinafter ‘Lu’) in further view of Lai et al. (US 20210182662, hereinafter “Lai’).
Regarding independent claim 1, Lu teaches a method of training a student model, the method comprising: (in [0004] Knowledge Distillation (KD) is a compression technique used to transfer the knowledge of a large trained neural network model (i.e. a neural network model with many learned parameters) to a smaller neural network model (i.e. a neural network model with fewer learned parameters than the large trained neural network model). KD utilizes the generalization ability of the larger trained neural network model (referred to as the “teacher model” or “teacher”) using the inference data output by the larger trained model as “soft targets”, which are used as a supervision signal for training a smaller neural network model (called the “student model” or “student”)…)
providing an input to a teacher model that is larger than the student model, wherein a layer of the teacher model outputs a first output vector; providing the input to the student model, wherein a layer of the student model outputs a second output vector; (in [0004] Knowledge Distillation (KD) is a compression technique used to transfer the knowledge of a large trained neural network model (i.e. a neural network model with many learned parameters) to a smaller neural network model (i.e. a neural network model with fewer learned parameters than the large trained neural network model). KD utilizes the generalization ability of the larger trained neural network model (referred to as the “teacher model” or “teacher”) [providing an input to a teacher model that is larger than the student model] using the inference data output by the larger trained model as “soft targets”[ wherein a layer of the teacher model outputs a first output vector], which are used as a supervision signal for training a smaller neural network model (called the “student model” or “student”) [providing the input to the student model]….; And in [0016] In some aspects, the present disclosure provides a device, comprising a processor and a memory. The memory has stored thereon instructions which, when executed by the processor, cause the device to perform a number of operations. A batch of training data comprising one or more labeled training data samples is obtained. Each labeled training data sample has a respective ground truth label. The batch of training data is processed, using a student model comprising a plurality of learnable parameters, to generate, for input data in each data sample in the batch of training data, a student prediction. For each labeled training data sample in the batch of training data, the student prediction and the ground truth label are processed to compute a respective ground truth loss. The batch of training data is processed, using a trained teacher model, to generate, for each labeled training data sample in the batch of training data, a teacher prediction. For each labeled data sample in the batch of training data, the student prediction [providing the input to the student model, wherein a layer of the student model outputs a second output vector] and the teacher prediction [providing an input to a teacher model that is larger than the student model, wherein a layer of the teacher model outputs a first output vector] are processed to compute a respective knowledge distillation loss…
determining an importance value associated with each dimension of the first output vector based on gradients from the teacher model; (in [0015] In some aspects, the present disclosure provides a method for knowledge distillation. A batch of training data comprising one or more labeled training data samples is obtained. Each labeled training data sample has a respective ground truth label. The batch of training data is processed, using a student model comprising a plurality of learnable parameters, to generate, for input data in each data sample in the batch of training data, a student prediction. For each labeled training data sample in the batch of training data, the student prediction and the ground truth label are processed to compute a respective ground truth loss. The batch of training data is processed, using a trained teacher model [based on gradients from the teacher model], to generate, for each labeled training data sample in the batch of training data, a teacher prediction. For each labeled data sample in the batch of training data, the student prediction and the teacher prediction are processed to compute a respective knowledge distillation loss. A weighted loss is determined based on the knowledge distillation loss and ground truth loss [determining an importance value associated with each dimension of the first output vector based on gradients from the teacher model] for each labeled training data sample in the batch of training data. Gradient descent is performed on the student model using the weighted loss to identify an adjusted set of values for the plurality of learnable parameters of the student. The values of the plurality of learnable parameters of the student are adjusted to the adjusted set of values. And in [0002] Machine Learning (ML) is an artificial intelligence technique in which algorithms are used to build a model from sample data that is capable of being applied to input data to perform a specific inference task (i.e., making predictions or decisions based on new data) without being explicitly programmed to perform the specific inference task. Deep learning is one of the most successful and widely deployed machine learning algorithms... Training the neural network involves optimizing the learnable parameters of the neurons, typically using gradient-based optimization algorithms [based on gradients from the teacher model], to minimize a loss function… And in [0059] In the example of FIG. 3, the teacher 232, denoted as T(.Math.), is a relatively large neural network model for an inference task (i.e. a neural network including a large number of learnable parameters implementing a model for an inference task) which has been trained [based on gradients from the teacher model] to optimize the values of the learnable parameters of the large neural network model, and which is to be compressed using KD…)
and updating at least one parameter of the student model to minimize a difference between the second output vector and the first output vector based on the importance values. (in [0067] KL divergence is used to measure the difference between the student predictions (i.e. the student predicted logits 306) and the teacher predictions (i.e. the teacher predicted logits 310). Minimizing the KL divergence, and therefore the KD loss 314 [updating at least one parameter of the student model to minimize a difference between the second output vector and the first output vector based on the importance values.], by adjusting the values of the learnable parameters of the student [and updating at least one parameter of the student model to minimize a difference between the second output vector and the first output vector] 234 should result in the student 234 learning to output student inference data 34 that is close to the teacher inference data 24. And in [0069] At 412, the gradient descent module 218 performs a gradient descent operation on the student 234 using the weighted loss function L to identify an adjusted set of values of the learnable parameters of the student 234. A gradient descent operation may be performed using any appropriate technique known in the field of machine learning to adjust each learnable parameter of the student 234, for example using backpropagation to perform gradient descent on each of the learnable parameters of the student 234. The gradient descent operation performed by the gradient descent module 218 is intended to compute or estimate a partial derivative of the value of each of the learnable parameters of the student 234 [and updating at least one parameter of the student model to minimize a difference between the second output vector and the first output vector] with respect to the weighted loss function L [based on the importance values], using the chain rule as necessary to propagate the weighted loss 330 backward from the output nodes (e.g. an output layer of neurons) through the other nodes of the student 234. The adjusted values of the learnable parameters may be identified as values of the learned parameters that would result in a lower or minimized reweighted loss 330 with respect the current labeled training data sample x, or with respect to the entire batch of training data X.)
Lu teaches the Knowledge Distillation training process using vector data, as noted above. Lu does not expressly teach the output information in vector form.
Lai does expressly teach the output information in vector form, in [0025] For example, a loss function generation module of the model training system generates a loss function L3 that is based on a comparison of the above discussed probability vectors Ps3 and Pt3 generated by the student and teacher models [providing an input to a teacher model that is larger than the student model, wherein a layer of the teacher model outputs a first output vector; providing the input to the student model, wherein a layer of the student model outputs a second output vector], respectively. An example loss function is L3=H (σ(t3/T), σ(s3/T)), where σ is the softmax function, t3 is a logit of the probability vector Pt3 […wherein a layer of the teacher model outputs a first output vector; providing the input to the student model, …], s3 is the logit of the probability vector Ps3 [… wherein a layer of the student model outputs a second output vector], T is a temperature hyperparameter of the teacher and/or the student models, and the function H(.) is a cross-entropy function. The logit function and this example loss function equation is discussed herein later in further detail.
Lai and Lu are analogous art because both involve developing information retrieval and processing techniques using machine learning systems and algorithms.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the prior art for training a reduced scale model based using a full-scale model for executing machine learning tasks as disclosed by Lai with the method of developing information retrieval and processing task using knowledge distillation techniques as disclosed by Lu.
One of ordinary skill in the arts would have been motivated to combine the disclosed methods disclosed by Lai and Lu noted above; Doing so allowing for developing parameter tuning techniques that tunes the parameters of a student/smaller model, such that the student model mimics the behavior of a teacher/larger model, (Lai, 0029).
Regarding claim 2, the rejection of claim 1 is incorporated and Lu in combination with Lai teaches the method of claim 1, wherein further comprising determining a weighted knowledge distillation (WKD) loss based on the first output vector, the second output vector and the importance values, wherein the parameters of the student model are updated based on the determined WKD loss. (in [0068] At 410, the reweighting module 220 determines a weighted loss [wherein further comprising determining a weighted knowledge distillation (WKD) loss based on the first output vector, the second output vector and the importance values], shown here as reweighted loss 330, using a weighted loss function based on the knowledge distillation loss 314 [based on the first output vector, the second output vector and the importance values] and ground truth loss 312 for each respective labeled training data sample x in the batch of training data 302. In some embodiments, the reweighting module 220 determines the reweighted loss 330 by determining a knowledge distillation weight λ.sub.x.sup.KD for the respective labeled training data sample x, determining a ground truth weight λ.sub.x.sup.CE for the respective labeled training data sample x, and computing the reweighted loss 330 (denoted as L) as the sum of the knowledge distillation loss L.sub.KD(x) weighted by the knowledge distillation weight λ.sub.x.sup.KD, and the ground truth loss L.sub.CE(x) weighted by the ground truth weight λ.sub.x.sup.CE The reweighted loss 330 may be computed as a mean across the entire batch of training data X (i.e. the batch of training data 302) using the following weighted loss function:… [0069] At 412, the gradient descent module 218 performs a gradient descent operation on the student 234 using the weighted loss function L to identify an adjusted set of values of the learnable parameters of the student 234... The adjusted values of the learnable parameters may be identified as values of the learned parameters that would result in a lower or minimized reweighted loss 330 with respect the current labeled training data sample x, or with respect to the entire batch of training data X… [0071] In some embodiments, the method 400 may be repeated one or more times with additional batches of training data obtained (e.g., stochastically) from the training dataset 240... )
Regarding claim 3, the rejection of claim 1 is incorporated and Lu in combination with Lai teaches the method of claim 1, wherein the importance values are determined based on a probability of a ground-truth class when the ground-truth class is known. (in [0042] L.sub.CE is therefore a function that is used to compute a Cross-Entropy (CE) loss between the ground-truth label of the labeled training data sample input into the teacher and the student and the output of the student, S.sub.θ(x), and L.sub.KD is a function that is used to compute a KD loss [wherein the importance values are determined based on a probability of a ground-truth class when the ground-truth class is known] based on the Kullback-Leibler (KL) divergence between the teacher prediction data 24 and the student prediction data 34. L.sub.KD may be defined such that the comparison between teacher prediction data 24 and student prediction data 34 is congruent. For example, the teacher prediction data 24 and student prediction data 34 used by L.sub.KD may be the respective models' logits (i.e. pre-normalized predictions), whereas the student prediction data 34 used by L.sub.CE may be the normalized predictions, such as a predicted probability distribution over a plurality of classes for a classification task [wherein the importance values are determined based on a probability of a ground-truth class when the ground-truth class is known], or the student's predicted label for the labeled training data sample x, such that the predicted probability distribution or student's predicted label can be compared to the ground truth label [when the ground-truth class is known] of the labeled training data sample x. Because the student's predicted probability distribution can be derived from the student's logits (by normalizing using a softmax function) [wherein the importance values are determined based on a probability of a ground-truth class when the ground-truth class is known], and the student's predicted label may be derived from the student's predicted probability distribution (by applying an argmax function), various types of comparison operations using different student outputs can be performed by properly defining L.sub.KD and L.sub.CE)
Regarding claim 4, the rejection of claim 3 is incorporated and Lu in combination with Lai teaches the method of claim 3, wherein the importance values are determined based on a last output vector of the teacher model that is output from a last layer of the teacher model. (in [0042] L.sub.CE is therefore a function that is used to compute a Cross-Entropy (CE) loss between the ground-truth label of the labeled training data sample input into the teacher and the student and the output of the student, S.sub.θ(x), and L.sub.KD is a function that is used to compute a KD loss based on the Kullback-Leibler (KL) divergence between the teacher prediction data 24 and the student prediction data 34 [wherein the importance values are determined based on a last output vector of the teacher model that is output from a last layer of the teacher model]. L.sub.KD may be defined such that the comparison between teacher prediction data 24 and student prediction data 34 is congruent. For example, the teacher prediction data 24 and student prediction data 34 used by L.sub.KD may be the respective models' logits (i.e. pre-normalized predictions)…)
Regarding claim 5, the rejection of claim 3 is incorporated and Lu in combination with Lai teaches the method of claim 3, wherein the first output vector of a first dimension and the second output vector of a second dimension are normalized to have a same dimension. (in [0042] L.sub.CE is therefore a function that is used to compute a Cross-Entropy (CE) loss between the ground-truth label of the labeled training data sample input into the teacher and the student and the output of the student, S.sub.θ(x), and L.sub.KD is a function that is used to compute a KD loss based on the Kullback-Leibler (KL) divergence between the teacher prediction data 24 and the student prediction data 34 [wherein the first output vector of a first dimension and the second output vector of a second dimension are normalized to have a same dimension]. L.sub.KD may be defined such that the comparison between teacher prediction data 24 and student prediction data 34 is congruent. For example, the teacher prediction data 24 and student prediction data 34 used by L.sub.KD may be the respective models' logits (i.e. pre-normalized predictions) [wherein the first output vector of a first dimension and the second output vector of a second dimension are normalized to have a same dimension] …)
Regarding claim 6, the rejection of claim 1 is incorporated and Lu in combination with Lai teaches the method of claim 1, wherein the layer of the teacher model corresponds to an intermediate layer of the teacher model. (in [0059] In the example of FIG. 3, the teacher 232, denoted as T(.Math.), is a relatively large neural network model for an inference task [wherein the layer of the teacher model corresponds to an intermediate layer of the teacher model as a large neural network model with that has a layer corresponding the one of the two or more intermediate layers between the input and layer for making inference] (i.e. a neural network including a large number of learnable parameters implementing a model for an inference task) which has been trained to optimize the values of the learnable parameters of the large neural network model, and which is to be compressed using KD. The student 234, denoted as S.sub.θ(.Math.), is a relatively smaller neural network model (i.e. a neural network including a smaller number of learnable parameters than the teacher and implementing a model for the inference task) which, once trained using KD, is to be deployed to a computing device having limited computing resources (e.g. memory and/or processing power) for inference (i.e. to output student inference data for new input data).)
Regarding claim 7, the rejection of claim 1 is incorporated and Lu in combination with Lai teaches the method of claim 1, wherein the layer of the teacher model corresponds to a last layer of the teacher model. (in [0059] In the example of FIG. 3, the teacher 232, denoted as T(.Math.), is a relatively large neural network model for an inference task [wherein the layer of the teacher model corresponds to a last layer of the teacher model as a large neural network model with that has a layer corresponding the last layer for making inference] (i.e. a neural network including a large number of learnable parameters implementing a model for an inference task) which has been trained to optimize the values of the learnable parameters of the large neural network model, and which is to be compressed using KD. The student 234, denoted as S.sub.θ(.Math.), is a relatively smaller neural network model (i.e. a neural network including a smaller number of learnable parameters than the teacher and implementing a model for the inference task) which, once trained using KD, is to be deployed to a computing device having limited computing resources (e.g. memory and/or processing power) for inference (i.e. to output student inference data for new input data).)
Regarding independent claim 8, Lu in combination with Lai teaches a system for training a student model, the system comprising: a memory storing instructions; and a processor configured to execute the instructions to: (in [0109] … Accordingly, the technical solution of the present disclosure may be embodied in the form of a software product. A suitable software product may be stored in a pre-recorded storage device or other similar non-volatile or non-transitory computer readable medium, including DVDs, CD-ROMs, USB flash disk, a removable hard disk, or other storage media, for example. The software product includes machine-executable instructions tangibly stored thereon that enable a processing device (e.g., a personal computer, a server, or a network device) to execute examples of the methods disclosed herein [the system comprising: a memory storing instructions; and a processor configured to execute the instructions to]; And in in [0004] Knowledge Distillation (KD) is a compression technique used to transfer the knowledge of a large trained neural network model (i.e. a neural network model with many learned parameters) to a smaller neural network model (i.e. a neural network model with fewer learned parameters than the large trained neural network model). KD utilizes the generalization ability of the larger trained neural network model (referred to as the “teacher model” or “teacher”) using the inference data output by the larger trained model as “soft targets”, which are used as a supervision signal for training a smaller neural network model (called the “student model” or “student”)…)
Regarding the remaining limitations of claim 8, the limitations are similar to those in claim 1, and are thus rejected under the same rationale.
Regarding claims 9-14, the limitations are similar with those in claims 2-7, and are thus rejected under the same rationale.
Regarding independent claim 15, Lu in combination with Lai teaches a non-transitory computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to: (in [0109] … Accordingly, the technical solution of the present disclosure may be embodied in the form of a software product. A suitable software product may be stored in a pre-recorded storage device or other similar non-volatile or non-transitory computer readable medium [non-transitory computer-readable storage medium storing instructions that, …], including DVDs, CD-ROMs, USB flash disk, a removable hard disk, or other storage media, for example. The software product includes machine-executable instructions tangibly stored thereon that enable a processing device (e.g., a personal computer, a server, or a network device) to execute examples of the methods disclosed herein [non-transitory computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to]; And in in [0004] Knowledge Distillation (KD) is a compression technique used to transfer the knowledge of a large trained neural network model (i.e. a neural network model with many learned parameters) to a smaller neural network model (i.e. a neural network model with fewer learned parameters than the large trained neural network model). KD utilizes the generalization ability of the larger trained neural network model (referred to as the “teacher model” or “teacher”) using the inference data output by the larger trained model as “soft targets”, which are used as a supervision signal for training a smaller neural network model (called the “student model” or “student”)…)
Regarding the remaining limitations of claim 15, the limitations are similar to those in claim 1, and are thus rejected under the same rationale.
Regarding claims 16-20, the limitations are similar with those in claims 2-6, and are thus rejected under the same rationale.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Ping et al. (US 20190180732): teaches in [0042] Knowledge distillation is originally proposed for compressing large models to smaller ones. In deep learning, a smaller student network is distilled from the teacher network by minimizing the loss between their outputs (e.g., L2 or cross-entropy)…
Fukuda (US 20220188643): teaches in abstract method of training a student neural network is provided. The method includes feeding a data set including a plurality of input vectors into a teacher neural network to generate a plurality of output values, and converting two of the plurality of output values from the teacher neural network for two corresponding input vectors into two corresponding soft labels. The method further includes combining the two corresponding input vectors to form a synthesized data vector, and forming a masked soft label vector from the two corresponding soft labels. The method further includes feeding the synthesized data vector into the student neural network, using the masked soft label vector to determine an error for modifying weights of the student neural network, and modifying the weights of the student neural network. And in [0002] In artificial neural networks (ANN), “learning” occurs through a change in weights applied to the data inputs of each neuron in the neural network. An artificial neural network can have one or more layers of neurons depending on the neural network architecture. Training of the neural network can be conducted using training pairs, including input data and an expected output/result (i.e., hard labels). Training the neural network then involves feeding the training pairs into the neural network and generating a prediction about the output (i.e., soft labels). The resulting or predicted output can be compared to the expected output for each of the training pairs to determine the correctness or incorrectness of the predictions… [0022] In various embodiments, a complex deep neural network (i.e., the teacher network) can be trained using a complete dataset with hard targets/labels, where this can be conducted offline. The deep neural network can be a multilayer perceptron. A correspondence can be established between the intermediate outputs of the teacher network and the student network. The outputs from the teacher network can be used to backpropagate calculated error values through the student network, so that the student network can learn to replicate the behavior of the teacher network, rather than learn directly with the hard targets/labels. A teacher network can, thereby, effectively transfer its knowledge to student networks of a smaller size.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to OLUWATOSIN ALABI whose telephone number is (571)272-0516. The examiner can normally be reached Monday-Friday, 8:00am-5:00pm EST..
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael Huntley can be reached at (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/OLUWATOSIN ALABI/ Primary Examiner, Art Unit 2129