DETAILED ACTION
This action is in response to the amendment filed on 12/08/2025, which has been entered into the above identified application. Claims 1-20 are pending in the application and have been examined.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
The status of the claims is as follows.
Claims 1, 3, 4, 6, 11, 12, 14 and 20 are amended. Claims 1-20 are currently pending.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claims 1-3, 5; 12, 13, 15; 20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Lu et al. (US20220343175A1, hereinafter “Lu”).
Regarding Claim 1,
Lu discloses a method of obtaining a key performance indicator (KPI) fast-adaptive artificial intelligence (AI) model, the method comprising; receiving first KPI preference setting information; obtaining a first Al model by training a first initial AI model based on the first KPI preference setting information; receiving second KPI preference setting information; obtaining a second AI model by training a second initial AI model based on the second KPI preference setting information (Lu [0007]; “The teacher 20 is pre-trained, and the student 30 is generally trained using the same training dataset used to train the teacher 20”
Lu [0053]; “The memory 208 may also store the student 234 and teacher 232, each of which may include values of a plurality of learnable parameters (“learnable parameter values”), as well as a plurality of values for hyperparameters (“hyperparameter values”) used to control the structure and operation of the model. Hyperparameter values are usually set prior to training and are not adjusted during training, in contrast to learnable parameter values, which are adjusted during training.”
Lu [0016]; “The memory has stored thereon instructions which, when executed by the processor, cause the device to perform a number of operations. A batch of training data comprising one or more labeled training data samples is obtained. Each labeled training data sample has a respective ground truth label. The batch of training data is processed, using a student model comprising a plurality of learnable parameters, to generate, for input data in each data sample in the batch of training data, a student prediction. For each labeled training data sample in the batch of training data, the student prediction and the ground truth label are processed to compute a respective ground truth loss. The batch of training data is processed, using a trained teacher model, to generate, for each labeled training data sample in the batch of training data, a teacher prediction. For each labeled data sample in the batch of training data, the student prediction and the teacher prediction are processed to compute a respective knowledge distillation loss” wherein the existence of learnable parameter values and hyperparameters meant to control the structure and operation of the model thus reads on obtaining of the model in part through training dependent on the initialized hyperparameter values; wherein the hyperparameters of the student and teacher models being initialized prior to training reads on receiving first and second KPI preference information)
obtaining a distilled Al model by knowledge distillation based on the first Al model and the second Al model (Lu [0015]; “In some aspects, the present disclosure provides a method for knowledge distillation. A batch of training data comprising one or more labeled training data samples is obtained. Each labeled training data sample has a respective ground truth label. The batch of training data is processed, using a student model comprising a plurality of learnable parameters, to generate, for input data in each data sample in the batch of training data, a student prediction. For each labeled training data sample in the batch of training data, the student prediction and the ground truth label are processed to compute a respective ground truth loss. The batch of training data is processed, using a trained teacher model, to generate, for each labeled training data sample in the batch of training data, a teacher prediction. For each labeled data sample in the batch of training data, the student prediction and the teacher prediction are processed to compute a respective knowledge distillation loss. A weighted loss is determined based on the knowledge distillation loss and ground truth loss for each labeled training data sample in the batch of training data. Gradient descent is performed on the student model using the weighted loss to identify an adjusted set of values for the plurality of learnable parameters of the student. The values of the plurality of learnable parameters of the student are adjusted to the adjusted set of values” wherein the student model updated through its learnable parameters adjusted according to gradient descent respective to the loss derived from knowledge distillation thus reads on a distilled AI model based on the teacher and student models)
obtaining the KPI fast-adaptive Al model by meta learning based on the distilled Al model, the first KPI preference setting information and the second KPI preference setting information (Lu [Figure 6];
PNG
media_image1.png
699
632
media_image1.png
Greyscale
Lu [0011]; “In some examples, the reweighting module may determine a ground truth weight to emphasize a ground truth label of a training data sample, and a knowledge distillation weight to emphasize the soft label from the teacher, based on user input (such as input from an expert with insight into the characteristics of a given batch of training data). In some examples, the reweighting module may determine the ground truth weight and the knowledge distillation weight using a meta-reweighting process. The meta-reweighting process computes the weights assigned to the ground truth loss and the knowledge distillation loss based on a meta-reweighting algorithm similar to the meta-reweighting algorithm described by M. Ren, W. Zeng, B. Yang, and R. Urtasun. “Learning to reweight examples for robust deep learning”. In ICML, 2018, arXiv:1803.09050 (hereinafter “Ren”). Meta-reweighting is a meta-learning technique that uses machine learning to assign weights to training data samples based on their gradient directions. To determine the weights in the Ren reference, a meta gradient descent step is performed on existing training weights to minimize the generalization loss on a validation set. This method has shown success in both the Natural Language Processing (NLP) and Computer Vision (CV) fields. However, the goal in the Ren reference is to weigh the relative contributions of different training data samples on model training. In contrast, the meta-reweighting process in the present disclosure weighs the relative contributions of the teacher's soft label and the training data's hard label for each training data sample.” wherein the meta-reweighting of the distilled AI model obtained through the teacher and student model weighting the contributions of samples based on their hyperparameter-derived gradients thus reads on meta-learning of the distilled model based at least in part on the initial distilled AI model before re-weighting as well as the first and second KPI preference setting information (hyperparameters used to facilitate training of the teacher, student models through knowledge distillation to obtain distilled AI model))
Regarding Claim 2,
Lu teaches the method of Claim 1 (and thus the rejection of Claim 1 is incorporated). Lu further discloses applying the KPI fast-adaptive Al model to perform load balancing in a cellular communications system. As discussed above Lu discloses the claimed “KPI fast-adaptive AI model.” The examiner finds that the specific intended application of the model is intended use and thus is not afforded patentable weight.
Regarding Claim 3,
Lu teaches the method of Claim 1 (and thus the rejection of Claim 1 is incorporated). Lu further discloses obtaining the KPI fast-adaptive Al model comprises initializing the KPI fast-adaptive Al model with the distilled AI model by first setting parameters of the KPI fast-adaptive Al model to parameters of the distilled AI model (Lu [Figure 6];
PNG
media_image1.png
699
632
media_image1.png
Greyscale
wherein the meta student initializes its parameters from the distilled student learned parameters in step 520)
Regarding Claim 5,
Lu teaches the method of Claim 1 (and thus the rejection of Claim 1 is incorporated). Lu further discloses wherein the obtaining the distilled Al model comprises using a distillation loss function (Lu [0006]; “FIG. 1 shows a typical configuration 10 for conventional KD. A teacher 20 is used to train a student 30. The teacher 20 receives input data 22 from a dataset used to train the student 30 and generates teacher inference data 24) based on the input data 22. The teacher inference data 24 is used as a soft target for supervision of the student 30, and the student 30 may be trained at least in part using a knowledge distillation loss function based on a comparison of the teacher inference data 24 to the student inference data 34 based on the same input data 22 provided to the teacher 20.”)
Claims 12, 13, and 15 recite a server to perform the method of Claims 1, 3 and 5. Thus, Claims 12, 13, and 15 are rejected for reasons set forth in the rejection of Claims 1, 3 and 5.
Claim 20 recites a non-transitory computer readable medium storing a program to perform the method of Claim 1. Thus, Claim 20 is rejected for reasons set forth in the rejection of Claim 1.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 4, 10, 11; 14 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Lu et al. (US20220343175A1, hereinafter “Lu”) in view of Vandenhende et al. (“Multi-task learning for dense prediction tasks” [2021], hereinafter “Vandenhende”).
Regarding Claim 4,
Lu teaches the method of Claim 3 (and thus the rejection of Claim 3 is incorporated). Lu fails to explicitly disclose but Vandenhende discloses wherein the obtaining the KPI fast-adaptive Al model by meta learning comprises performing, using the KPI fast-adaptive Al model, task adaptation for a first task associated with a first preference vector and a second task associated with a second preference vector to obtain a plurality of first task parameters and a plurality of second task parameters, wherein the first preference vector indicates a first weighting over a plurality of KPIs and the second preference vector indicates a second weighting over the plurality of KPIs (Vandenhende [Page 8 Section 3.1];
PNG
media_image2.png
429
353
media_image2.png
Greyscale
wherein task balancing optimization in multi-task learning through a trained ML model reads on adapting tasks (balancing tasks) for multiple tasks associated with respective preference vectors (weights of the task) to obtain a plurality of first and second task parameters (balanced gradient magnitudes); wherein the preference vectors being weights of the model for a given task reads on the preference vectors indicating weights of the task over a plurality of loss KPIs)
collecting one or more first validation trajectories and one or more second validation trajectories, wherein the first task is associated with a first task policy and with first task parameters, and the second task is associated with a second task policy and with second task parameters (Vandenhende [Page 9 Section 3.1.5 Paragraph 2]; “ In MTL, a Pareto optimal solution is found when the following condition is satisfied: the loss for any task can be decreased without increasing the loss on any of the other tasks. A multiple gradient descent algorithm (MGDA) [65] was proposed in [34] to find a Pareto stationary point. In particular, the shared network weights are updated by finding a common direction among the task-specific gradients. As long as there is a common direction along which the task-specific losses can be decreased, we have not reached a Pareto optimal point yet. An advantage of this approach is that since the shared network weights are only updated along common directions of the task-specific gradients, conflicting gradients are avoided in the weight update step” wherein the task-specific losses determined through updates of the shared network weights reads on validation trajectories associated with a first and second task along with its parameters and respective loss minimization policies)
and updating a plurality of meta parameters of the KPI fast-adaptive Al model using the one or more first validation trajectories and the one or more second validation trajectories, wherein the first and the second tasks are Al models, and the first and the second validation trajectories are histories of the first and the second tasks performing in an environment (Vandenhende [Page 9 Section 3.1.5 Paragraph 2]; “ In MTL, a Pareto optimal solution is found when the following condition is satisfied: the loss for any task can be decreased without increasing the loss on any of the other tasks. A multiple gradient descent algorithm (MGDA) [65] was proposed in [34] to find a Pareto stationary point. In particular, the shared network weights are updated by finding a common direction among the task-specific gradients. As long as there is a common direction along which the task-specific losses can be decreased, we have not reached a Pareto optimal point yet. An advantage of this approach is that since the shared network weights are only updated along common directions of the task-specific gradients, conflicting gradients are avoided in the weight update step” wherein updating the gradients for parameter updates to be pareto optimal through its task-specific losses is performed; wherein the tasks are of a neural network read on as AI models; wherein the task-specific loss functions representative of the first and second task performance read on first and second validation trajectories)
It would have been obvious to modify Lu’s distilled model obtained through knowledge distillation of student and teacher models for load balancing in cellular communications systems to perform Vandenhende’s method of learning the distilled model through parameter updates computed through task adaptation validation trajectories. One would have been motivated to do so “to find a Pareto stationary point … As long as there is a common direction along which the task-specific losses can be decreased, we have not reached a Pareto optimal point yet” (Vandenhende [Page 9 Section 3.1.5 Paragraph 2]) and thus minimize the loss of the distilled model parameter by determining if it is Pareto optimal.
Regarding Claim 10,
Lu teaches the method of Claim 1 (and thus the rejection of Claim 1 is incorporated). Lu fails to explicitly disclose but Vandenhende discloses fine tuning the KPI fast-adaptive Al model to approximate a Pareto front (Vandenhende [Page 9 Section 3.1.5]; “A global optimum for the multi-task optimization objective in Equation 4 is hard to find. Due the complex nature of this problem, a certain choice that improves performance for one task could lead to performance degradation for another task. The task balancing methods discussed beforehand try to tackle this problem by setting the task-specific weights in the loss according to some heuristic. Differently, Sener and Koltun [34] view MTL as a multi-objective optimization problem, with the overall goal of finding a Pareto optimal solution among all tasks. In MTL, a Pareto optimal solution is found when the following condition is satisfied: the loss for any task can be decreased without increasing the loss on any of the other tasks. A multiple gradient descent algorithm (MGDA) [65] was proposed in [34] to find a Pareto stationary point. In particular, the shared network weights are updated by finding a common direction among the task-specific gradients. As long as there is a common direction along which the task-specific losses can be decreased, we have not reached a Pareto optimal point yet” wherein the Pareto optimal solution comprising Pareto Optimal Points that the model is optimized for reads on fine tuning to approximate a Pareto front)
It would have been obvious to modify Lu’s distilled model obtained through knowledge distillation of student and teacher models for load balancing in cellular communications systems to perform Vandenhende’s method of updating the distilled model’s parameters to approximate a Pareto front. It would have been obvious to do so because “An advantage of this approach is that since the shared network weights are only updated along common directions of the task-specific gradients, conflicting gradients are avoided in the weight update set” (Vandenhende [Page 9 Section 3.1.5 Paragraph 2]).
Regarding Claim 11,
Lu teaches the method of Claim 4 (and thus the rejection of Claim 4 is incorporated). Lu fails to disclose but Vandenhende discloses performing the task adaptation comprises: sampling one or more first training trajectories using the KPI fast-adaptive Al model; updating the plurality of first task parameters of the first task policy based on the one or more first training trajectories; sampling one or more second training trajectories using the KPI fast-adaptive Al model; and updating the plurality of second task parameters of the second task policy based on one or more second training trajectories; (Vandenhende [Page 9 Section 3.1.6]; “In Section 3.1, we described several methods for balancing the influence of each task when training a multi-task network”
Vandenhende [Page 8 Section 3.1.1]; “The loss functions L1, L2 belong to the first and second task respectively. By minimizing the loss L w.r.t. the noise parameters σ1, σ2, one can essentially balance the task specific losses during training. The optimization objective in Equation 6 can easily be extended to account for more than two tasks too. The noise parameters are updated through standard backpropagation during training” wherein the training loss functions of the first and second tasks being used for updates of the task balancing and associated parameter updates is performed)
collecting comprises: obtaining the one or more first validation trajectories using the KPI fast-adaptive Al model; and obtaining the one or more second validation trajectories using the KPI fast- adaptive Al model (Vandenhende [Page 9 Section 3.1.5 Paragraph 2]; “ In MTL, a Pareto optimal solution is found when the following condition is satisfied: the loss for any task can be decreased without increasing the loss on any of the other tasks. A multiple gradient descent algorithm (MGDA) [65] was proposed in [34] to find a Pareto stationary point. In particular, the shared network weights are updated by finding a common direction among the task-specific gradients. As long as there is a common direction along which the task-specific losses can be decreased, we have not reached a Pareto optimal point yet. An advantage of this approach is that since the shared network weights are only updated along common directions of the task-specific gradients, conflicting gradients are avoided in the weight update step” wherein the task-specific losses determined through gradient descent (MGDA) reads on validation trajectories associated with a first and second task along with its parameters and respective loss minimization policies)
It would have been obvious to modify Lu’s distilled model obtained through knowledge distillation of student and teacher models for load balancing in cellular communications systems to perform Vandenhende’s method of learning the distilled model through parameter updates computed through task adaptation validation trajectories. One would have been motivated to do so “to find a Pareto stationary point … As long as there is a common direction along which the task-specific losses can be decreased, we have not reached a Pareto optimal point yet” (Vandenhende [Page 9 Section 3.1.5 Paragraph 2]) and thus minimize the loss of the distilled model parameter by determining if it is Pareto optimal.
Claims 14 and 19 recite a server to perform the method of Claims 4 and 10. Thus, Claims 14 and 19 are rejected for reasons set forth in the rejection of Claims 4 and 10.
Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over Lu et al. (US20220343175A1, hereinafter “Lu”) in view of Li et al. (US20210390733A1, hereinafter “Li”) further in view of Vandenhende et al. (“Multi-task learning for dense prediction tasks” [2021], hereinafter “Vandenhende”) and further in view of Bellegarda (US20220383044A1).
Regarding Claim 6,
Lu teaches the method of Claim 5 (and thus the rejection of Claim 5 is incorporated). Lu already discloses training the … model, wherein the … model corresponds to a … teacher (Lu [0007]; “The teacher 20 is pre-trained, and the student 30 is generally trained using the same training dataset used to train the teacher 20”
Lu [0053]; “The memory 208 may also store the student 234 and teacher 232, each of which may include values of a plurality of learnable parameters (“learnable parameter values”), as well as a plurality of values for hyperparameters (“hyperparameter values”) used to control the structure and operation of the model. Hyperparameter values are usually set prior to training and are not adjusted during training, in contrast to learnable parameter values, which are adjusted during training.”
Lu [0016]; “The memory has stored thereon instructions which, when executed by the processor, cause the device to perform a number of operations. A batch of training data comprising one or more labeled training data samples is obtained. Each labeled training data sample has a respective ground truth label. The batch of training data is processed, using a student model comprising a plurality of learnable parameters, to generate, for input data in each data sample in the batch of training data, a student prediction. For each labeled training data sample in the batch of training data, the student prediction and the ground truth label are processed to compute a respective ground truth loss. The batch of training data is processed, using a trained teacher model, to generate, for each labeled training data sample in the batch of training data, a teacher prediction. For each labeled data sample in the batch of training data, the student prediction and the teacher prediction are processed to compute a respective knowledge distillation loss”).
Lu fails to explicitly disclose but Li discloses the first AI model, wherein the first AI model corresponds to a first teacher; second AI model, wherein the second AI model corresponds to a second teacher (Li [0044]; “The plurality of teacher neural networks 152T may receive any data representation of the video data 102 (e.g., the spatial input stream 122, the temporal input stream 124, the pose input stream 126, etc.) to predict the teacher output 154T. Optionally, each teacher neural network 152T may process a single input (e.g., the spatial input stream 122, the temporal input stream 124, and the pose input stream 126) or multiple inputs. In the example shown, a first teacher neural network 152T, 152Ta receives the temporal input stream 124 and determines a first teacher output 154Ta that predicts the activity based on the temporal input stream 124. In some examples, the first teacher neural network 152T is trained specifically to predict the activity using the temporal input stream 124. Using the first teacher output 154Ta and student output 154S, the neural network system 150 determines a first distillation loss 158a. Here, the first distillation loss 158a represents the cross-entropy loss between the logit of the student output 154S and the logit of the first teacher output 154Ta. That is, the first distillation loss represents the difference in the predicted activity probability distribution of the student output 154S and the first teacher output 154Ta.
Continuing with the same example, the neural network system 150 includes a second teacher neural network 152T, 152Tb that receives the pose input stream 126 and determines a second teacher output 154Tb that predicts the activity based on the pose input stream 126. The neural network system 150 determines a second distillation loss 158b using the second teacher output 154Tb and the student output 154S. Here, the second distillation loss 158b represents the cross-entropy loss between the logit of the student output 154S and the logit of the second teacher output 154Tb. In the example shown, only two teacher neural networks 152T are considered for the sake of clarity, however, implementations herein may include any number of teacher neural networks 152T. In some implementations, the plurality of teacher neural networks 152T may also include a teacher neural network 152T to process one or more spatial input streams 122, temporal input streams 124, pose input streams 126, and/or any combination thereof. For each of the teacher neural networks 152T the neural network system 150 determines a respective distillation loss 158.
In some implementations, the neural network system 150 uses the classification loss 156 and the distillation losses 158 to train the student neural networks 152S. That is, the distillation losses and the classification loss 156 are combined to create a total loss 159 that is provided as feedback to train the student neural network 152S”)
Lu reads on training the … model, wherein the … model corresponds to a … teacher. Lu does not read on training the first Al model, wherein the first Al model corresponds to a first teacher; training the second Al model, wherein the second Al model corresponds to a second teacher. However, Li discloses the first AI model, wherein the first AI model corresponds to a first teacher; second AI model, wherein the second AI model corresponds to a second teacher. By performing Lu’s knowledge distillation training using the multiple teacher AI models of Li, the combination discloses training the first Al model, wherein the first Al model corresponds to a first teacher; training the second Al model, wherein the second Al model corresponds to a second teacher.
It would have been obvious to modify Lu’s distilled model obtained through knowledge distillation of student and teacher models for load balancing in cellular communications systems to train the distilled model through Li’s plurality of teacher models and student model. One would have been motivated to do so because “each teacher neural network 152T may process a single input (e.g., the spatial input stream 122, the temporal input stream 124, and the pose input stream 126) or multiple inputs” (Li [0044]) allowing the student model to be trained by multimodal teacher inputs.
The combination of Lu/Li fails to explicitly disclose but Vandenhende discloses collecting a plurality of trajectories … (Vandenhende [Page 9 Section 3.1.5 Paragraph 2]; “ In MTL, a Pareto optimal solution is found when the following condition is satisfied: the loss for any task can be decreased without increasing the loss on any of the other tasks. A multiple gradient descent algorithm (MGDA) [65] was proposed in [34] to find a Pareto stationary point. In particular, the shared network weights are updated by finding a common direction among the task-specific gradients. As long as there is a common direction along which the task-specific losses can be decreased, we have not reached a Pareto optimal point yet. An advantage of this approach is that since the shared network weights are only updated along common directions of the task-specific gradients, conflicting gradients are avoided in the weight update step” wherein the task-specific losses determined through updates of the shared network weights reads on validation trajectories associated with a first and second task along with its parameters and respective loss minimization policies)
Vandenhende reads on collecting a plurality of trajectories. Vandenhende does not read on collecting a plurality of trajectories using the first teacher and the second teacher. However, Lu/Li discloses a first and second teacher. By performing Vandenhende’s trajectory collection using the first and second teachers of Lu/Li, the combination of Lu/Li/Vandenhende discloses collecting a plurality of trajectories using the first teacher and the second teacher
It would have been obvious to modify Lu/Li’s distilled model obtained through knowledge distillation of student and teacher models for load balancing in cellular communications systems to perform Vandenhende’s collection of loss functions representing a plurality of trajectories. One would have been motivated to do so because “each task’s influence on the network weight update can be controlled, either indirectly by adapting the task-specific weights wi in the loss, or directly by operating on the task-specific gradients ∂Li ∂Wsh.” (Vandenhende [Page 8 Section 3.1 Paragraph 3]).
The combination of Lu/Li/Vandenhende fails to explicitly disclose but Bellegarda discloses training the distilled policy to match state-dependent action probability distributions of the first teacher and the second teacher using the distillation loss function (Bellegarda [0245]; “first training model 812 may adjust the language model using knowledge distillation techniques, for example, by calibrating the first portion 804a of the language model to better match probabilities provided by second language model 806 (i.e., training the student model to match the “soft targets” of the teacher model). The amount of calibration may be based in part on the similarity score, such that a higher similarity score may result in a lesser calibration. Similarly, in some examples, adjusting the language model may include determining whether first output distribution 808 corresponds to an output distribution from a language model of a predetermined type (e.g., an output distribution from a teacher model), and adjusting the language model based on the determination. In particular, whether the language model is calibrated (or the amount of the calibration) may be based in part on whether first output distribution 808 corresponds to, or otherwise appears to be, an output distribution from a teacher model, for example” wherein training of the model through a similarity score (distillation loss function) representative of the difference between the student and teacher model probability densities is performed)
It would have been obvious to modify Lu/Li/Vandenhende’s distilled model obtained through knowledge distillation of student and multiple teacher models to perform Bellegarda’s method of training the distilled model to reflect the probability distributions of its teacher models. One would have been motivated to do so to allow the student model “to better match the reconstructed inputs provided by second language model [teacher model]” (Bellegarda [0245]).
Claims 7 and 9; 16 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Lu et al. (US20220343175A1, hereinafter “Lu”) in view of Haidar et al. (US20220335303A1, hereinafter “Haidar”).
Regarding Claim 7,
Lu teaches the method of Claim 5 (and thus the rejection of Claim 5 is incorporated). Lu fails to explicitly disclose but Haidar discloses wherein the distillation loss function expresses a Kullback- Leibler (KL) divergence loss (Haidar [0078] “FIG. 6 shows example sub-operations of operation 514 of the method 500 of FIG. 5. At 602, the student inference data 34 (i.e. the student predicted logits 306) and teacher inference data 24 (i.e. the teacher predicted logits 310) for each labeled training data sample x in the training batch 302 are processed by a knowledge distillation loss module 216 to compute a knowledge distillation loss 314 for the training batch 302 (denoted herein as X). The KD loss 314 between the student predicted logits 306 and the teacher predicted logits 310 may be defined based on Kullback-Leibler (KL) divergence as:
PNG
media_image3.png
47
358
media_image3.png
Greyscale
wherein, as described above, τ is a temperature parameter that controls the concentration level of the distribution, softmax(.) is the used to compute the predicted probability over classes, S.sub.θ(X) denotes the student predicted logits 306, and T(X) denotes the teacher predicted logits 310. For a regression task, as described above, L.sub.KD may be computed by using a mean-squared-error (MSE) loss function on the teacher predicted logits 310 and student predicted logits 306.”)
It would have been obvious for Lu’s distilled AI student and teacher models to use in part Haidar’s Kullback-Leibler (KL) divergence loss for its distillation loss function. One would have been motivated to do so because “training data is augmented by maximizing the output margin between the teacher and the student (i.e. the difference between the teacher inference data 24 and student inference data 34) using a Kullback-Leibler (KL) divergence loss applied to the logit representation outputs of the teacher and student (also referred to as a knowledge distillation loss)” (Haidar [0018]).
Regarding Claim 9,
Lu teaches the method of Claim 5 (and thus the rejection of Claim 5 is incorporated). Lu fails to explicitly disclose but Haidar discloses wherein the distillation loss function expresses a mean-squared error loss (Haidar [0078] “FIG. 6 shows example sub-operations of operation 514 of the method 500 of FIG. 5. At 602, the student inference data 34 (i.e. the student predicted logits 306) and teacher inference data 24 (i.e. the teacher predicted logits 310) for each labeled training data sample x in the training batch 302 are processed by a knowledge distillation loss module 216 to compute a knowledge distillation loss 314 for the training batch 302 (denoted herein as X). The KD loss 314 between the student predicted logits 306 and the teacher predicted logits 310 may be defined based on Kullback-Leibler (KL) divergence as:
PNG
media_image3.png
47
358
media_image3.png
Greyscale
wherein, as described above, τ is a temperature parameter that controls the concentration level of the distribution, softmax(.) is the used to compute the predicted probability over classes, S.sub.θ(X) denotes the student predicted logits 306, and T(X) denotes the teacher predicted logits 310. For a regression task, as described above, L.sub.KD may be computed by using a mean-squared-error (MSE) loss function on the teacher predicted logits 310 and student predicted logits 306.”)
It would have been obvious for Lu’s distilled AI student and teacher models to use Haidar’s mean-squared error loss for its distillation loss function. One would have been motivated to do so because “It will be appreciated that other loss types, such as mean square error (MSE) loss, may be used in the context of other inference task types such as regression tasks” (Haidar [0069]).
Claims 16 and 18 recite a server to perform the method of Claims 7 and 9. Thus, Claims 16 and 18 are rejected for reasons set forth in the rejection of Claims 7 and 9.
Claims 8 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Lu et al. (US20220343175A1, hereinafter “Lu”) in view of Hua et al. (US20210383272A1, hereinafter “Hua”).
Regarding Claim 8,
Lu teaches the method of Claim 5 (and thus the rejection of Claim 5 is incorporated). Lu fails to explicitly disclose but Hua discloses wherein the distillation loss function expresses a negative log likelihood loss (Hua [0086];
PNG
media_image4.png
261
406
media_image4.png
Greyscale
Hua [0110];
PNG
media_image5.png
392
415
media_image5.png
Greyscale
wherein the distillation loss calculated in part through -I(θ) which expresses a negative log-likelihood is performed)
It would have been obvious for Lu’s distilled AI student and teacher models to use in part Hua’s negative log likelihood loss for expression of its distillation loss function. One would have been motivated to do so because “if values of parameters of weight matrix 406 are θ and the performance impacts of parameters are represented by p(X|θ) (the likelihood curves), then the values of respective θ for the parameters w should be around the respective optimal value θ.sub.0 … the model performance is likely maximized/optimized” (Hua [0080]) thus maximizing/optimizing model performance through the usage of negative log likelihoods.
Claim 17 recite a server to perform the method of Claim 8. Thus, Claim 17 is rejected for reasons set forth in the rejection of Claim 8.
Response to Arguments
The Examiner acknowledges the Applicant’s amendments to Claims 1, 3, 4, 6, 11, 12, 14 and 20.
Applicant’s arguments filed December 8th, 2025, traversing the rejection of claims 1-20 under 35 U.S.C. § 112 have been fully considered, and are fully persuasive.
Applicant’s arguments filed December 8th, 2025, traversing the rejection of claims 1-20 under 35 U.S.C. § 101 have been fully considered, and are fully persuasive.
Applicant’s arguments filed December 8th, 2025, traversing the rejection of claims 1-20 under 35 U.S.C. § 102(a)(1) have been fully considered, but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Applicant’s arguments filed December 8th, 2025, traversing the rejection of claim 2 under 35 U.S.C. § 103 have been fully considered, but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Applicant’s arguments filed December 8th, 2025, traversing the rejection of claim 8 and 17 under 35 U.S.C. § 103 have been fully considered, but are not fully persuasive.
Applicant alleges, on page 21 of the Remarks, Hua is not available under § 102(a)(1), because the publication date of Hua (December 9, 2021) is later than the effective filing date of the present application (September 9, 2021, based on provisional application 63/242,417). Applicant further alleges that the present application and Hua were commonly assigned to Samsung Electronics Co., Ltd., at the time the present application was filed (September 6, 2022). Thus, in applicant’s view, Hua is not prior art under § 102(a)(2) due to the exception stated in the law at 35 USC § 102(b)(2)(C).
Examiner respectfully disagrees. Examiner asserts that provisional application 63/242,417 does not explicitly disclose a negative log likelihood. As provisional applications’ priority is determined on a claim-by-claim basis, examiner asserts that the element of “distillation loss function expressing a negative log likelihood loss” is not present in the content of provisional application 63/242,417 and thus Claim 8’s effective filing date remains 09/06/2022. Consequently, the date of Hua is considered eligible prior art for the rejection of Claims 8 and 17 under 35 U.S.C. § 103.
The rejection of Claim 8 under 35 U.S.C. § 103 has been maintained. Similarly, the rejection of Claims 17 under 35 U.S.C. § 101 have been maintained.
Conclusion
Examiner brings applicant’s attention to the following references which are relevant to aspects of the invention:
Fukuda et al. (US20220180206A1, hereinafter “Fukuda”) – relevant to applying the KPI fast-adaptive Al model to perform load balancing in a cellular communications system.
Applicant’s amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JONATHAN J KIM whose telephone number is (571)272-0523. The examiner can normally be reached 8-6.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Matt El can be reached on (571) 270-3264. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/JONATHAN J KIM/Examiner, Art Unit 2141
/MATTHEW ELL/Supervisory Patent Examiner, Art Unit 2141