Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Objections
Claim 1, 11, and 20 objected to because of the following informalities:
Claims 1, 11 and 20 recites:
wherein the soothing factor is adjusted over the plurality of first training phase epochs to reduce a smoothing effect on the generated smoothed TNN model outputs;
The term “soothing” appears in the specification and in each of the independent claims; however, the specification does not describe, define, or otherwise provide written description support for “soothing” factor or operation. The surrounding disclosure instead consistently discusses loss functions and smoothing-type techniques.
Accordingly, “soothing” will be treated as an apparent typographical error for “smoothing.”
Appropriate correction is required.
Claim 11 and 12 objected to because of the following informalities:
Claim 11 appears to have been structured in a manner similar to claim 1; however, as presented, claim 11 appears to be missing two limitations that are instead recited in claim 12. This suggests that a formatting error may have occurred when structuring claim 11, resulting in certain limitations being separated into claim 12.
Appropriate correction is required.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception (i.e., a law of nature, a natural phenomenon, or an abstract idea) without significantly more.
101 Subject Matter Eligibility Analysis
Step 1: Claims 1-20 are within the four statutory categories (a process, machine, manufacture or composition of matter).
Claims 1-10 are directed to a method consisting of a series of steps, meaning that it is directed to the statutory category of process. Claims 11-20 are directed to storage mediums and processors which are machines.
Step 2A Prong One, Step 2A Prong Two, and Step 2B Analysis:
Step 2A Prong One asks if the claim recites a judicial exception (abstract idea, law of nature, or natural phenomenon). If the claim recites a judicial exception, analysis proceeds to Step 2A Prong Two, which asks if the claim recites additional elements that integrate the abstract idea into a practical application. If the claim does not integrate the judicial exception, analysis proceeds to Step 2B, which asks if the claim amounts to significantly more than the judicial exception. If the claim does not amount to significantly more than the judicial exception, the claim is not eligible subject matter under 35 U.S.C. 101.
None of the claims represent an improvement to technology.
Regarding claim 1, the following claim elements are abstract ideas:
computing SNN model outputs for the plurality of input data samples (This is an abstract idea of a mathematical concept and mental process. The limitation recites computing model outputs by applying mathematical operations to input data samples according to defined rules or parameters. Such calculations are mathematical in nature and involve applying formulas to inputs to obtain corresponding outputs, which can be carried out in the human mind or with basic computing tools such as a calculator and pen and paper. Accordingly, this limitation falls within the mathematical concepts and mental processes groupings of abstract ideas. See MPEP 2106.04(a)(2)(I) and 2106.04(a)(2)(III).
applying a smoothing factor to the teacher neural network (TNN) model outputs to generate smoothed TNN model outputs (This is an abstract idea of a mathematical concept and mental process. The limitation recites applying a smoothing factor to model outputs, which involve mathematical calculations such as scaling, averaging, or weighting numerical values according to a defined factor. Such calculations are mathematical in nature and involve applying formulas to numerical outputs, which can be carried out in the human mind or with the aid of basic computing tolls such as a calculator and with pen and paper.);
computing a first loss based on the SNN model outputs and the smoothed TNN model outputs (This is an abstract idea of a mathematical concept and a mental process. The limitation recites computing a loss value based on comparing model outputs, which involves mathematical calculations such as differences, distances, or error functions applied to numerical values. Such calculations are mathematical in nature and involve applying formulas to the outputs to obtain a loss value, which can be carried out in the human mind or with the aid of basic computing tools such as a calculator and with pen and paper.); and
computing an updated set of the SNN model parameters with an objective of reducing the first loss in a following first training phase epoch (This is an abstract idea of a mathematical concept and a mental process. The limitation recites computing updated model parameters based on an objective of reducing a loss value, which involves mathematical calculations such as adjusting numerical values according to defined formulas or rules to minimize error measure. Such calculations are mathematical in nature and involve applying mathematical relationships to update parameters, which can be carried out in the human mind or with the aid of basic computing tools such as a calculator with pen and paper.),
wherein the soothing factor is adjusted over the plurality of first training phase epochs to reduce a smoothing effect on the generated smoothed TNN model outputs (This is an abstract idea of a mathematical concept and mental process. The limitation recites adjusting a numerical factor over training epochs based on its effect on model outputs, which involves mathematical calculations such as modifying a parameter value to change the degree of smoothing applied to numerical outputs. Such calculations are mathematical in nature and involve evaluating and adjusting numerical values according to the desired outcome, which can be carried out in the human mind or with the aid of basic computing tools such as a calculator and pen and paper.);
computing SNN model outputs for the plurality of input data samples from the SNN model (This is abstract idea of mathematical concepts and a mental process. The limitation recites computing model outputs by applying mathematical operations to input data samples according to defined rules or parameters. Such calculations are mathematical in nature and involve applying formulas to numerical inputs to obtain corresponding outputs, which can be carried out in the human mind or with the aid of basic computing tools such as a calculator and pen and paper.);
computing a second loss based on the SNN model outputs and a set of predefined expected outputs for the plurality of input data samples (This is an abstract idea of a mathematical concept and a mental process. The limitation recites computing a loss value by comparing model outputs to predefined expected outputs, which involves mathematical calculations such as differences, error measures, or distance functions applied to numerical values. Such calculations are mathematical in nature and involve applying formulas to the outputs and expected values to obtain a loss value, which can be carried out in the human mind or with the aid of basic computing tools such as a calculator and with pen and paper.); and
computing an updated set of the SNN model parameters with an objective of reducing the second loss in a following second training phase (This is an abstract idea of a mathematical concept and a mental process. The limitation recites computing updated model parameters based on an objective of reducing a loss value, which involves mathematical calculations such as adjusting numerical parameter values according to defined formulas or rules to minimize an error measure. Such calculations are mathematical in nature and involve applying mathematical relationships to update parameters, which can be carried out in the human mind or with the aid of basic computing tools such as a calculator or with pen and paper.),
selecting a final set of SNN model parameters from the updated sets of SNN model parameters computed during the second training phase (This is an abstract idea of a mental process. The limitation recites selecting one set of parameters from among multiple updated sets based on judgement or comparison. Such selection involves evaluation and decision-making that can be performed in the human mind, for example by reviewing candidate parameters sets and choosing one set according to a criterion. Accordingly, this limitation falls within the mental process grouping of abstract ideas.).
The following claim elements are additional elements which, taken alone or in combination with the other elements, do not integrate the judicial exception into a practical application nor amount to significantly more than the judicial exception:
student neural network (This is a high-level recitation of generic computer components for performing the abstract idea. See MPEP 2106.05.)
(SNN) model that is configured by a set of SNN model parameters to generate outputs in respect of input data samples (This limitation merely recites applying the abstract idea using a generic neural network model and does not add a meaningful limitation. See MPEP 2106.05(f).),
obtaining respective teacher neural network (TNN) model outputs for a plurality of input data samples (The limitation amounts to a generic data operation of obtaining stored outputs, i.e., storing and retrieving information from memory, which has been recognized by the courts as a well-understood routine, and conventional activity. See MPEP 2105.04(d)(II)(iv).);
performing a first training phase of the SNN model, the first training phase comprising training the SNN model over a plurality of first training phase epochs, each first training phase epoch comprising (This limitation merely recites training a model over multiple iterations, which is an instruction to apply the abstract idea and does not provide meaningful limitation. See MPEP 2106.05(f).)
performing a second training phase of the SNN model, the second training phase comprising initializing the SNN model with a set of SNN model parameters selected from the updated sets of SNN model parameters computed during the first training phase, the second training phase of the SNN model being performed over a plurality of second training phase epochs, each second training phase epoch comprising (This limitation recites initializing and training a model over multiple iterations, which is an instruction to apply the abstract idea and does not provide a meaningful limitation).
Regarding claim 2, the rejection of claim 1 is incorporated herein. Further, claim 2 recites the following abstract ideas:
wherein in each epoch of the first training phase the smoothing factor is computed as smoothing factor
=
t
t
m
a
x
, where
t
m
a
x
is a constant and a value of t is incremented in each subsequent first training phase epoch (This is an abstract idea of mathematical concepts. The limitation recites computing a smoothing factor using a mathematical formula that involves division of a variable by a constant and incrementing a variable across epochs. This limitation is directed to a mathematical relationship and calculation, independent of any particular technological implementation.).
Regarding claim 3, the rejection of claim 1 is incorporated herein. Further, claim 3 recites the following abstract ideas:
wherein the first loss corresponds to a divergence between the SNN model outputs and the smoothed TNN model outputs (This is an abstract idea of mathematical concepts. The limitation recites defining a loss as a divergence between two sets of numerical outputs, which is a mathematical relationship used to quantify differences between values. Such divergences are mathematical constructs expressed through formulas and calculations.).
Regarding claim 4, the rejection of claim 3 is incorporated herein. Further, claim 4 recites the following abstract ideas:
wherein the first loss corresponds to a Kullback-Leibler divergence between the SNN model outputs and the smoothed TNN model outputs (This is an abstract idea of mathematical concepts. The limitation recites a Kullback-Leibler divergence, which is a specific mathematical formula used to measure divergence between probability distributions. As such, this limitation is direct to a mathematical relationship and calculation.).
Regarding claim 5, the rejection of claim 1 is incorporated herein. Further, claim 5 recites the following abstract ideas:
wherein the second loss corresponds to a divergence between the SNN model outputs and the set of predefined expected outputs (This is an abstract idea of mathematical concepts and a mental process. The limitation recites defining a loss as a divergence between two sets of numerical values, which involves mathematical calculations the quantity differences between outputs and expected values. Such calculations are mathematical in nature and involve comparing numerical values according to defined formulas, which can be carried out in the human mind or with the aid of basic computing tools such as a calculator and pen and paper.)
Regarding claim 6, the rejection of claim 5 is incorporated herein. Further, claim 6 recites the following abstract ideas:
wherein the second loss is computed based on a cross entropy loss function (This is an abstract idea of mathematical concepts. The limitation recites computing a loss using cross entropy loss function, which is a mathematical formula used to quantify differences between predicted values and expected values.).
Regarding claim 7, the rejection of claim 1 is incorporated herein. Further, claim 7 recites the following abstract ideas:
for each first training phase epoch, determining if the computed updated set of the SNN model parameters improves a performance of the SNN model relative to updated sets of SNN model parameters previously computed during the first training phase in respect 30 of a development dataset that includes a set of development data samples and respective expected outputs (This is an abstract idea of a mental process. The limitation recites evaluating and comparing model performance across different parameter sets to determine whether performance has improved. Such evaluation involves judgement and comparison of results against expected outputs, which can be performed in the human mind. Accordingly, this limitation falls within the mental process grouping of abstract ideas.)
The following claim elements are additional elements which, taken alone or in combination with the other elements, do not integrate the judicial exception into a practical application nor amount to significantly more than the judicial exception:
when the computed updated set of the SNN model parameters does improve the performance, update the SNN model parameters to the computed updated set of the SNN model parameters prior to a next first training phase epoch (The step of “updating the SNN model parameters” merely recites updating stored parameter values based on a determination and amounts to an instruction to apply the abstract idea, without adding meaningful limitation.)
Regarding claim 8, the rejection of claim 7 is incorporated herein. Further, claim 8 recites the following additional elements, which taken alone or in combination with other elements, do not integrate the judicial exception into a practical application nor amount to significantly more than the judicial exception:
when the computed updated set of the SNN model parameters does improve the performance, update the SNN model parameters to the computed updated set of the SNN model parameters prior to a next first training phase epoch (This limitation merely recites using a particular set of SNN model parameters to initialize a training phase, which amounts to an instruction to apply the abstract idea and does not provide a meaningful limitation.).
Regarding claim 9, the rejection of claim 7 is incorporated herein. Further, claim 9 recites the following abstract ideas:
for each second training phase epoch, determining if the computed updated set of the SNN model parameters improves a performance of the SNN model relative to updated sets of SNN model parameters previously computed during the second training phase in respect of the development dataset (This is an abstract idea of a mental process. The limitation recites determining whether performance has improved by comparing results associated with different parameter sets. Such comparison and evaluation involves judgement based on observed outcomes and expected outputs, which can be performed in the human mind or with the aid of basic computing tools such as a calculator and pen and paper.),
The following claim elements are additional elements which, taken alone or in combination with the other elements, do not integrate the judicial exception into a practical application nor amount to significantly more than the judicial exception:
when the computed updated set of the SNN model parameters does improve the performance, update the SNN model parameters to the computed updated set of the SNN model parameters prior to a next epoch (The step of “updating the SNN model parameters” merely recites updating store parameter values and amounts to an instruction to apply the abstract idea, without providing meaningful limitation.)
Regarding claim 10, the rejection of claim 9 is incorporated herein. Further, claim 10 recites the following additional elements, which taken alone or in combination with other elements, do not integrate the judicial exception into a practical application nor amount to significantly more than the judicial exception:
wherein the final set of SNN model is the updated set of SNN model parameters computed during the second training phase that best improves the performance of the SNN model during the second training phase (This limitation merely recites using a particular set of model parameters as final parameters, which amounts to an instruction to apply the abstract idea and does not provide a meaningful limitation.).
Regarding claim 11, the following claim elements are abstract ideas:
computing SNN model outputs for the plurality of input data samples (This is an abstract idea of a mathematical concept and mental process. The limitation recites computing model outputs by applying mathematical operations to input data samples according to defined rules or parameters. Such calculations are mathematical in nature and involve applying formulas to inputs to obtain corresponding outputs, which can be carried out in the human mind or with basic computing tools such as a calculator and pen and paper. Accordingly, this limitation falls within the mathematical concepts and mental processes groupings of abstract ideas. See MPEP 2106.04(a)(2)(I) and 2106.04(a)(2)(III).
applying a smoothing factor to the teacher neural network (TNN) model outputs to generate smoothed TNN model outputs (This is an abstract idea of a mathematical concept and mental process. The limitation recites applying a smoothing factor to model outputs, which involve mathematical calculations such as scaling, averaging, or weighting numerical values according to a defined factor. Such calculations are mathematical in nature and involve applying formulas to numerical outputs, which can be carried out in the human mind or with the aid of basic computing tolls such as a calculator and with pen and paper.);
computing a first loss based on the SNN model outputs and the smoothed TNN model outputs (This is an abstract idea of a mathematical concept and a mental process. The limitation recites computing a loss value based on comparing model outputs, which involves mathematical calculations such as differences, distances, or error functions applied to numerical values. Such calculations are mathematical in nature and involve applying formulas to the outputs to obtain a loss value, which can be carried out in the human mind or with the aid of basic computing tools such as a calculator and with pen and paper.); and
computing an updated set of the SNN model parameters with an objective of reducing the first loss in a following first training phase epoch (This is an abstract idea of a mathematical concept and a mental process. The limitation recites computing updated model parameters based on an objective of reducing a loss value, which involves mathematical calculations such as adjusting numerical values according to defined formulas or rules to minimize error measure. Such calculations are mathematical in nature and involve applying mathematical relationships to update parameters, which can be carried out in the human mind or with the aid of basic computing tools such as a calculator with pen and paper.),
wherein the soothing factor is adjusted over the plurality of first training phase epochs to reduce a smoothing effect on the generated smoothed TNN model outputs (This is an abstract idea of a mathematical concept and mental process. The limitation recites adjusting a numerical factor over training epochs based on its effect on model outputs, which involves mathematical calculations such as modifying a parameter value to change the degree of smoothing applied to numerical outputs. Such calculations are mathematical in nature and involve evaluating and adjusting numerical values according to the desired outcome, which can be carried out in the human mind or with the aid of basic computing tools such as a calculator and pen and paper.);
computing SNN model outputs for the plurality of input data samples from the SNN model (This is abstract idea of mathematical concepts and a mental process. The limitation recites computing model outputs by applying mathematical operations to input data samples according to defined rules or parameters. Such calculations are mathematical in nature and involve applying formulas to numerical inputs to obtain corresponding outputs, which can be carried out in the human mind or with the aid of basic computing tools such as a calculator and pen and paper.);
computing a second loss based on the SNN model outputs and a set of predefined expected outputs for the plurality of input data samples (This is an abstract idea of a mathematical concept and a mental process. The limitation recites computing a loss value by comparing model outputs to predefined expected outputs, which involves mathematical calculations such as differences, error measures, or distance functions applied to numerical values. Such calculations are mathematical in nature and involve applying formulas to the outputs and expected values to obtain a loss value, which can be carried out in the human mind or with the aid of basic computing tools such as a calculator and with pen and paper.); and
The following claim elements are additional elements which, taken alone or in combination with the other elements, do not integrate the judicial exception into a practical application nor amount to significantly more than the judicial exception:
student neural network (This is a high-level recitation of generic computer components for performing the abstract idea. See MPEP 2106.05.)
one or more processers and a non-transitory storage medium storing software instructions that (This is a high-level recitation of generic computer components for performing the abstract idea. See MPEP 2106.05.),
obtaining respective teacher neural network (TNN) model outputs for a plurality of input data samples (The limitation amounts to a generic data operation of obtaining stored outputs, i.e., storing and retrieving information from memory, which has been recognized by the courts as a well-understood routine, and conventional activity. See MPEP 2105.04(d)(II)(iv).);
performing a first training phase of the SNN model, the first training phase comprising training the SNN model over a plurality of first training phase epochs, each first training phase epoch comprising (This limitation merely recites training a model over multiple iterations, which is an instruction to apply the abstract idea and does not provide meaningful limitation. See MPEP 2106.05(f).)
performing a second training phase of the SNN model, the second training phase comprising initializing the SNN model with a set of SNN model parameters selected from the updated sets of SNN model parameters computed during the first training phase, the second training phase of the SNN model being performed over a plurality of second training phase epochs, each second training phase epoch comprising (This limitation recites initializing and training a model over multiple iterations, which is an instruction to apply the abstract idea and does not provide a meaningful limitation).
Regarding claim 12, the rejection of claim 11 is incorporated herein. Claim 12 further recites the following abstract ideas:
computing an updated set of the SNN model parameters with an objective of reducing the second loss in a following second training phase (This is an abstract idea of a mathematical concept and a mental process. The limitation recites computing updated model parameters based on an objective of reducing a loss value, which involves mathematical calculations such as adjusting numerical parameter values according to defined formulas or rules to minimize an error measure. Such calculations are mathematical in nature and involve applying mathematical relationships to update parameters, which can be carried out in the human mind or with the aid of basic computing tools such as a calculator or with pen and paper.),
selecting a final set of SNN model parameters from the updated sets of SNN model parameters computed during the second training phase (This is an abstract idea of a mental process. The limitation recites selecting one set of parameters from among multiple updated sets based on judgement or comparison. Such selection involves evaluation and decision-making that can be performed in the human mind, for example by reviewing candidate parameters sets and choosing one set according to a criterion. Accordingly, this limitation falls within the mental process grouping of abstract ideas.).
wherein in each epoch of the first training phase the smoothing factor is computed as smoothing factor
=
t
t
m
a
x
, where
t
m
a
x
is a constant and a value of t is incremented in each subsequent first training phase epoch (This is an abstract idea of mathematical concepts. The limitation recites computing a smoothing factor using a mathematical formula that involves division of a variable by a constant and incrementing a variable across epochs. This limitation is directed to a mathematical relationship and calculation, independent of any particular technological implementation.).
Regarding claim 13, the rejection of claim 11 is incorporated herein. The claim recites similar limitations corresponding to claim 3. Therefore, the same subject matter analysis that was utilized for claim 3, as described above, is equally applicable to claim 13.
Regarding claim 14, the rejection of claim 13 is incorporated herein. The claim recites similar limitations corresponding to claim 4. Therefore, the same subject matter analysis that was utilized for claim 4, as described above, is equally applicable to claim 14.
Regarding claim 15, the rejection of claim 11 is incorporated herein. The claim recites similar limitations corresponding to claim 5. Therefore, the same subject matter analysis that was utilized for claim 5, as described above, is equally applicable to claim 15.
Regarding claim 16, the rejection of claim 15 is incorporated herein. The claim recites similar limitations corresponding to claim 6. Therefore, the same subject matter analysis that was utilized for claim 6, as described above, is equally applicable to claim 16.
Regarding claim 17, the rejection of claim 11 is incorporated herein. The claim recites similar limitations corresponding to claim 7. Therefore, the same subject matter analysis that was utilized for claim 7, as described above, is equally applicable to claim 17.
Regarding claim 18, the rejection of claim 17 is incorporated herein. The claim recites similar limitations corresponding to claim 8. Therefore, the same subject matter analysis that was utilized for claim 8, as described above, is equally applicable to claim 18.
Regarding claim 19, the rejection of claim 17 is incorporated herein. Claim 17 further recites the following abstract ideas:
for each second training phase epoch, determining if the computed updated set of the SNN model parameters improves a performance of the SNN model relative to updated sets of SNN model parameters previously computed during the second training phase in respect of the development dataset (This is an abstract idea of a mental process. The limitation recites determining whether performance has improved by comparing results associated with different parameter sets. Such comparison and evaluation involves judgement based on observed outcomes and expected outputs, which can be performed in the human mind or with the aid of basic computing tools such as a calculator and pen and paper.),
The following claim elements are additional elements which, taken alone or in combination with the other elements, do not integrate the judicial exception into a practical application nor amount to significantly more than the judicial exception:
when the computed updated set of the SNN model parameters does improve the performance, update the SNN model parameters to the computed updated set of the SNN model parameters prior to a next epoch (The step of “updating the SNN model parameters” merely recites updating store parameter values and amounts to an instruction to apply the abstract idea, without providing meaningful limitation.)
wherein the final set of SNN model is the updated set of SNN model parameters computed during the second training phase that best improves the performance of the SNN model during the second training phase (This limitation merely recites using a particular set of model parameters as final parameters, which amounts to an instruction to apply the abstract idea and does not provide a meaningful limitation.).
Regarding claim 20, claim 20 recites similar method steps corresponding to claim 1, implemented in the form of a non-transitory computer-readable medium storing instructions that, when executed by one or more processors, perform the recited steps. Accordingly, the same subject matter analysis applied to claim 1, as described above, is equally applicable to claim 20, and claim 20 is rejected for similar reasons. The recited non-transitory computer-readable medium and one or more processors merely constitute generic computer components for carrying out the recited method steps and do not amount to anything significantly more.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-20 are rejected under the 35 U.S.C. 103 as being unpatentable over Yuan et al., (NPL.: “Revisit Knowledge Distillation: A Teacher-Free Framework (Published: July 2020)) in view of Kim (Pub. No.: US 20200125927 A1 (Filed: 2018)).
Regarding claim 1, Yuan teaches the following limitations:
A method of training a student neural network (SNN) model that is configured by a set of SNN model parameters to generate outputs in respect of input data samples (Yuan, [Abstract] “Knowledge Distillation (KD) aims to distill the knowledge of a cumbersome teacher model into a lightweight student model.” And [Introduction] “We therefore argue that the similarity information between categories cannot fully explain the dark knowledge in KD, and the soft targets from the teacher model indeed provide effective regularization for the student model, which are equally or even more important.” – training a student model by using soft targets from the teacher model to regularize the student model.), comprising
obtaining respective teacher neural network (TNN) model outputs for a plurality of input data samples (Yuan, [Introduction] “the dark knowledge in KD, and the soft targets from the teacher model indeed provide effective regularization for the student model, which are equally or even more important” – because soft targets are the outputs generated by the teacher model for training inputs. Using soft targets necessarily requires obtaining the teacher model’s outputs corresponding to those input data samples so they can be applied during training of the student model.);
performing a first training phase of the SNN model, the first training phase comprising training the SNN model over a plurality of first training phase epochs, each first training phase epoch comprising: computing SNN model outputs for the plurality of input data samples (Yuan, [page 5, section 3] “Given a neural network
S
to train, we first give loss function of LSR for
S
. For each training example
x
,
S
outputs the probability of each label
k
…
:
p
k
x
=
s
o
f
t
m
a
x
(
z
k
)
” – discloses that the student neural network computes output values for each input data sample by mathematically evaluating a softmax function to produce label probabilities, which direct corresponds to computing SNN model outputs for a plurality of input data samples during training.)
applying a smoothing factor to the teacher neural network (TNN) model outputs to generate smoothed TNN model outputs (Yuan, [page 6, section 3] “the output prediction of the teacher network is
p
τ
t
k
=
s
o
f
t
m
a
x
(
z
k
t
)
… where
z
t
is the output logits of the teacher network and
τ
is the temperature to soften
p
t
(k) (written as
p
τ
t
k
after softened).” – teaches applying a smoothing factor to the teacher network outputs, which directly corresponds to generating smoothed TNN model outputs.);
computing a first loss based on the SNN model outputs and the smoothed TNN model outputs (Yuan, [page 6, section 3] “The idea behind knowledge distillation is to let the student (the model S) mimic the teacher by minimizing the cross-entropy loss and KL divergence between the predictions of student and teacher as
L
K
D
=
1
-
α
H
q
,
p
+
α
D
K
L
(
p
τ
t
,
p
r
)
– defines a loss function that is computed using the student model outputs p and the smoothed teacher model outputs, where the smoothing is applied via the temperature parameter
τ
.); and
computing an updated set of the SNN model parameters with an objective of reducing the first loss in a following first training phase epoch (Yuan, [page 5, section 3] “Given a neural network
S
to train, we first give loss function of LSR for
S
. For each training example
x
,
S
outputs the probability of each label
k
∈
1
…
K
:
p
k
x
=
s
o
f
t
m
a
x
(
z
k
)
, where
z
i
is the logit of the neural network
S
…The model
S
can be trained by minimizing the cross-entropy loss:
H
q
,
p
=
-
∑
k
-
1
K
q
k
log
p
k
.
” – discloses training the student neural network S by minimizing a defined loss function. It is inherent in minimizing a loss during training that the parameters of the student neural network are updated so that the loss is reduced in subsequent training epochs.),
performing a second training phase of the SNN model, the second training phase comprising initializing the SNN model with a set of SNN model parameters selected from the updated sets of SNN model parameters computed during the first training phase, the second training phase of the SNN model being performed over a plurality of second training phase epochs, each second training phase epoch comprising (Yuan, [page 7, section 4] “Specifically, we first train the student model in the normal way to obtain a pre-trained model, which is then used as the teacher to train itself by transferring soft targets as in Eq. (4).” – discloses that the student neural network is first trained to obtain a pre-trained set of model parameters, and that those learned parameters are then reused to initialize a subsequent training process in which the model is further trained.)
computing SNN model outputs for the plurality of input data samples from the SNN model (Yuan, [page 7, section 4] “where
p
,
p
τ
t
are the output probability of
S
and
S
t
respectively” – the student model S produces output probabilities p, which are computed by evaluating the SNN on input data samples.);
computing a second loss based on the SNN model outputs and a set of predefined expected outputs for the plurality of input data samples ; and computing an updated set of the SNN model parameters with an objective of reducing the second loss in a following second training phase (Yuan, [page 7, section 4] “The loss function of
T
f
-
K
D
s
e
l
f
to train model
S
is
L
s
e
l
f
=
1
-
α
H
q
,
p
+
α
D
K
L
p
τ
t
,
p
τ
,
where
p
,
p
τ
t
are the output probability of
S
and
S
t
respectively,
τ
is the temperature and
α
is the weight to balance the two terms.” – discloses computing a loss function using p, which represents the outputs of the student neural network, and q, which represents predefined expected outputs (ground-truth labels) in the cross-entropy term
H
(
q
,
p
)
.
Accordingly, this disclosure teaches computing a second loss based on the SNN model outputs and a set of predefined expected outputs for a plurality of input data samples.)
computing an updated set of the SNN model parameters with an objective of reducing the second loss in a following second training phase (Yuan, [page 7, section 4]“ then we try to minimize the KL divergence…The loss function of
T
f
-
K
D
s
e
l
f
to train model
S
is
L
s
e
l
f
=
1
-
α
H
q
,
p
+
α
D
K
L
p
τ
t
,
p
τ
” – discloses training the student neural network by minimizing a defined loss function. Training a neural network by minimizing a loss function inherently requires computing updated model parameters to reduce the loss, since the loss depends on model outputs, which in turn depend on model parameters. Accordingly, the disclosure inherently teaches computing an updated set of SNN model parameters with an objective of reducing the loss in a following training phase epoch.)
selecting a final set of SNN model parameters from the updated sets of SNN model parameters computed during the second training phase (Yuan, [page 8, section 5.1] “Fig. 4 (a) shows the test accuracy of the six models. It can be seen that our
T
f
-
K
D
s
e
l
f
consistently outperforms the baselines. For example, as a powerful model with 34.52M parameters, ResNeXt29 improves itself by 1.05% with self-training (Fig. 4(b)).” – discloses evaluating the student model after completion of the
T
f
-
K
D
s
e
l
f
self-training process. Reporting “test accuracy” values and quantified performance improvements inherently requires that training updates have ceased and that a final set of SNN model parameters is fixed and used for an inference on a test dataset. Under BRI, the act of evaluating and reporting a final test accuracy necessarily involves selecting a final set of SNN model parameters from among the updated parameter sets computed during the second training phase. The explicit attribution to these results of “self-training” identifies the selected parameters as being computed during the second training phase.).
However, Yuan does not teach but Yuan in view of Kim teaches the following limitation:
wherein the smoothing factor is adjusted over the plurality of first training phase epochs to reduce a smoothing effect on the generated smoothed TNN model outputs (Yuan, [page 2, introduction] “For KD, by combining the teacher’s soft targets with the one-hot ground truth label, we find that KD is a learned LSR where the smoothing distribution of KD is from a teacher model but the smoothing distribution of LSR is manually designed” Kim paragraph [0012] “The determining of the loss function may include determining the loss function by applying a first factor to the error rate between the recognition result of the teacher model and the recognition result of the student model, wherein the first factor may be controlled so that a contribution of the teacher model to training of the student model decreases in response to an increase in a training epoch of the student model.” – Yuan teaches that knowledge distillation is a form of learned label smoothing in which the smoothing distribution is derived from teacher model outputs. Kim teaches applying a factor to the teacher-student loss term and controlling the factor so that the contribution of the teacher model decreases in response to an increase in training epochs. Under BRI, the claimed “smoothing factor” reads on Kim’s epoch-controlled factor applied to the teacher-derived term. As the teacher’s contribution decreases, the influence of the teacher derived smoothing distribution necessarily decreases, thereby reducing the smoothing effected on the generated smoothed outputs.);
Accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, having Yuan and Kim before them, to adjust a factor applied to teacher model outputs during knowledge distillation during training such that the contribution to the teacher model decreases over training epochs. Yuan teaches that knowledge distillation corresponds to a form of label smoothing in which a teacher model and its outputs provide a smoothing distribution for training a student model. Kim teaches applying a factor to a loss associated with teacher-student outputs and controlling that factor so that the contribution rate of the teacher model to training the student model is decreased as training progresses, particularly when a low error rate indicates that the student model has sufficiently learned from the teacher. One would have been motivated to make such a combination in order to reduce reliance on teacher-provided outputs once the student model is sufficiently trained, thereby allowing the student model to transition toward learning from ground-truth targets and increasing completion and convergence of the training process (Kim, paragraph [0070]).
Regarding claim 2, Yuan in view of Kim teaches all the elements of claim 1, therefore is rejected for the same reasons as those presented for claim 1. Yuan in view of Kim further teaches:
wherein in each epoch of the first training phase the smoothing factor is computed as smoothing factor
=
t
t
m
a
x
, where
t
m
a
x
is a constant and a value of
t
is incremented in each subsequent first training phase epoch (Kim, paragraph [0093] “ In operation 740, the model training apparatus determines whether a training epoch t is less than a maximum epoch. For example, when the training epoch t is determined to be less than the maximum epoch, operation 750 is performed.” [0095] “In operation 760, the model training apparatus increments the training epoch t by “1.” Also, the process reverts to operation 730 to train the student model θ.sub.S.” – Kim teaches a training process in which a current training epoch t is incremented by one in each iteration and training proceeds until a predefined maximum epoch is reached. This establishes a bounded, linearly increasing epoch variable. Expressing that linear progression as the ratio of the current epoch to the maximum epoch (t/
t
m
a
x
) is the standard mathematical implementation of such an epoch-based linear schedule. While Kim teaches computing a factor based on linear epoch progression, Yuan teaches applying such a factor as a smoothing factor to control loss behavior during training.).
Regarding claim 3, Yuan in view of Kim teaches all the elements of claim 1, therefore is rejected for the same reasons as those presented for claim 1. Yuan in view of Kim further teaches:
wherein the first loss corresponds to a divergence between the SNN model outputs and the smoothed TNN model outputs (Yuan, [page 6, section 3] “For knowledge distillation, the teacher-student learning mechanism is applied to improve the performance of the student… The idea behind knowledge distillation is to let the student (the model S) mimic the teacher by minimizing the cross-entropy loss and KL divergence between the predictions of student and teacher as
L
K
D
=
1
-
α
H
q
,
p
+
α
D
K
L
(
p
τ
t
,
p
r
)
…the teacher network and
τ
is the temperature to soften…(written as
p
τ
t
k
after softened).” – Yuan teaches computing Kullback-Leibler divergence between the student model output distribution and the temperature-softened teacher model output distribution.)
Regarding claim 4, Yuan in view of Kim teaches all the elements of claim 3, therefore is rejected for the same reasons as those presented for claim 3. Yuan in view of Kim further teaches:
wherein the first loss corresponds to a Kullback-Leibler divergence between the SNN model outputs and the smoothed TNN model outputs (Yuan, [page 6, section 3] “For knowledge distillation, the teacher-student learning mechanism is applied to improve the performance of the student… The idea behind knowledge distillation is to let the student (the model S) mimic the teacher by minimizing the cross-entropy loss and KL divergence between the predictions of student and teacher as
L
K
D
=
1
-
α
H
q
,
p
+
α
D
K
L
(
p
τ
t
,
p
r
)
…the teacher network and
τ
is the temperature to soften…(written as
p
τ
t
k
after softened).” – Yuan teaches computing Kullback-Leibler divergence between the student model output distribution and the temperature-softened teacher model output distribution.).
Regarding claim 5, Yuan in view of Kim teaches all the elements of claim 1, therefore is rejected for the same reasons as those presented for claim 1. Yuan in view of Kim further teaches:
wherein the second loss corresponds to a divergence between the SNN model outputs and the set of predefined expected outputs (Yuan, [page 5, section 3] “The model S can be trained by minimizing the cross-entropy loss:
H
q
,
p
=
-
∑
k
=
1
K
q
k
l
o
g
(
p
k
)
” [page 12, Appendix A] “where q is the distribution of ground truth, p is the output distribution of student model,” – Yuan teaches computing a cross-entropy loss between student outputs and ground-truth labels, which corresponds to a divergence between the SNN outputs and predefined expected outputs.)
Regarding claim 6, Yuan in view of Kim teaches all the elements of claim 5, therefore is rejected for the same reasons as those presented for claim 5. Yuan in view of Kim further teaches:
wherein the second loss is computed based on a cross entropy loss function (Yuan, [page 5, section 3] “The model S can be trained by minimizing the cross-entropy loss:
H
q
,
p
=
-
∑
k
=
1
K
q
k
l
o
g
(
p
k
)
” [page 12, Appendix A] “where q is the distribution of ground truth, p is the output distribution of student model,” – Yuan teaches computing a cross-entropy loss between student outputs and ground-truth labels, which corresponds to a divergence between the SNN outputs and predefined expected outputs.).
Regarding claim 7, Yuan in view of Kim teaches all the elements of claim 1, therefore is rejected for the same reasons as those presented for claim 1. Yuan in view of Kim further teaches:
for each first training phase epoch, determining if the computed updated set of the SNN model parameters improves a performance of the SNN model relative to updated sets of SNN model parameters previously computed during the first training phase in respect of a development dataset that includes a set of development data samples and respective expected outputs, and when the computed updated set of the SNN model parameters does improve the performance, update the SNN model parameters to the computed updated set of the SNN model parameters prior to a next first training phase epoch (Yuan [page 8-9, section 5.1] Yuan describes training student models over multiple training epochs (“trained for 200 epochs”) and evaluating performance using “test accuracy,” further stating that the authors “report the mean of the best results” obtained during training, rather than merely reporting final-epoch performance. Reporting the “best results” from a multi-epoch training process necessarily entails determining whether a given set of model parameters improves performance relative to parameter sets computed in earlier epochs. Further, under BRI, identifying and reporting a “best” result distinct from the final result necessarily requires updating and retaining the specific SNN model parameters being overwritten by a subsequent training epoch. Accordingly, Yuan inherently teaches the conditional determination and parameter update required by the claim.).
Regarding claim 8, Yuan in view of Kim teaches all the elements of claim 7, therefore is rejected for the same reasons as those presented for claim 7. Yuan in view of Kim further teaches:
wherein the set of SNN model parameters used to initialize the SNN model for the second training phase is the updated set of SNN model parameters computed during the first training phase that best improves the performance of the SNN model during the first training phase (Yuan, [pages 8-9, section 5.1] Yuan discloses training a student model over multiple epochs with performance evaluated throughout training, explicitly stating that the authors “report the mean of the best results” obtained during the run. Under BRI, the identification of these “best results” necessarily corresponds to a specific set of model parameters existing at the epoch at which the performance was achieved, as model accuracy and loss are direct functions of the parameter state. Further, Yuan’s selection of the “best results” reflects a technical preference for that parameter configuration over others produced during training. A person of ordinary skill in the art would recognize that identifying the best-performing parameter set serves the technical purpose of using that parameter for subsequent operations, rather than discarding it in favor of an inferior parameter state. In standard machine learning optimization, tracking a “best” model state inherently implies preserving and utilizing that state for subsequent phases, such as validation, fine-tuning, or deployment. Accordingly, Yuan inherently teaches initializing subsequent use of the model with the set of parameters computed during the first training phase that best improves model performance, thereby meeting the claim limitation.).
Regarding claim 9, Yuan in view of Kim teaches all the elements of claim 7, therefore is rejected for the same reasons as those presented for claim 7. Yuan in view of Kim further teaches:
for each second training phase epoch, determining if the computed updated set of the SNN model parameters improves a performance of the SNN model relative to updated sets of SNN model parameters previously computed during the second training phase in respect of the development dataset, and when the computed updated set of the SNN model parameters does improve the performance, update the SNN model parameters to the computed updated set of the SNN model parameters prior to a next epoch (Yuan, [pages 8-9, section 5.1] Yuan discloses training a student model over multiple epochs with performance evaluated throughout training, explicitly stating that the authors “report the mean of the best results” obtained during the run. To report the “best results,” Yuan inherently performs a comparison operation throughout training (e.g., at evaluation intervals such as epochs). Specifically, Yuan must determine whether the parameter set from a current epoch achieves improved performance (e.g., higher accuracy) relative to parameter sets computed in earlier epochs. Identifying and reporting the “best” result among many training iterations necessarily relies on this determination of whether a computed updated set improves performance relative to previous sets. Further, upon determining that a current parameter set improves performance over the previous best, Yuan inherently updates the stored designation of the SNN model parameters to reflect this new optimal state prior to proceeding to the next epoch. Under BRI, “updating the SNN parameters” encompasses updating the parameter configuration retained as the “best model” or “current optimal state.” Yuan’s disclosure of tracking and reporting the best-performing iteration confirms that this conditional update process occurs, whereby the system maintains and updates an optimal parameter state when, and only when, improved performance is achieved. Accordingly, Yuan discloses evaluating model performance during training and conditionally updated the designated optimal parameter set when improvement occurs, thereby meeting the limitation.)
Regarding claim 10, Yuan in view of Kim teaches all the elements of claim 9, therefore is rejected for the same reasons as those presented for claim 9. Yuan in view of Kim further teaches:
wherein the final set of SNN model is the updated set of SNN model parameters computed during the second training phase that best improves the performance of the SNN model during the second training phase (Yuan, [pages 8-9, section 5.1] Yuan discloses training the student model over multiple epochs with performance evaluated throughout training, explicitly stating that the authors “report the mean of the best results” obtained during the run. This disclosure describes a protocol in which the system identifies the specific iteration where performance was maximized, rather than defaulting to the final chronological epoch. Under BRI, the claimed “parameters computed during the second training phase that best improves the performance” corresponds to the specific parameter set associated with these reported “best results.” Because a model’s performance result is a direct function of its parameter state at the moment, Yuan’s identification of the best result indicates that the system distinguishes the parameter set that achieved optimal performance from other sets computed during the phase. Therefore, by selecting and reporting the “best results,” Yuan teaches a process where the relevant parameters are those that yielded the highest performance improvement during the phase, consistent with the claim limitation.).
Regarding claim 11, Yuan teaches the following limitations:
obtaining respective teacher neural network (TNN) model outputs for a plurality of input data samples (Yuan, [Introduction] “the dark knowledge in KD, and the soft targets from the teacher model indeed provide effective regularization for the student model, which are equally or even more important” – because soft targets are the outputs generated by the teacher model for training inputs. Using soft targets necessarily requires obtaining the teacher model’s outputs corresponding to those input data samples so they can be applied during training of the student model.);
performing a first training phase of the SNN model, the first training phase comprising training the SNN model over a plurality of first training phase epochs, each first training phase epoch comprising: computing SNN model outputs for the plurality of input data samples (Yuan, [page 5, section 3] “Given a neural network
S
to train, we first give loss function of LSR for
S
. For each training example
x
,
S
outputs the probability of each label
k
…
:
p
k
x
=
s
o
f
t
m
a
x
(
z
k
)
” – discloses that the student neural network computes output values for each input data sample by mathematically evaluating a softmax function to produce label probabilities, which direct corresponds to computing SNN model outputs for a plurality of input data samples during training.)
applying a smoothing factor to the teacher neural network (TNN) model outputs to generate smoothed TNN model outputs (Yuan, [page 6, section 3] “the output prediction of the teacher network is
p
τ
t
k
=
s
o
f
t
m
a
x
(
z
k
t
)
… where
z
t
is the output logits of the teacher network and
τ
is the temperature to soften
p
t
(k) (written as
p
τ
t
k
after softened).” – teaches applying a smoothing factor to the teacher network outputs, which directly corresponds to generating smoothed TNN model outputs.);
computing a first loss based on the SNN model outputs and the smoothed TNN model outputs (Yuan, [page 6, section 3] “The idea behind knowledge distillation is to let the student (the model S) mimic the teacher by minimizing the cross-entropy loss and KL divergence between the predictions of student and teacher as
L
K
D
=
1
-
α
H
q
,
p
+
α
D
K
L
(
p
τ
t
,
p
r
)
– defines a loss function that is computed using the student model outputs p and the smoothed teacher model outputs, where the smoothing is applied via the temperature parameter
τ
.); and
computing an updated set of the SNN model parameters with an objective of reducing the first loss in a following first training phase epoch (Yuan, [page 5, section 3] “Given a neural network
S
to train, we first give loss function of LSR for
S
. For each training example
x
,
S
outputs the probability of each label
k
∈
1
…
K
:
p
k
x
=
s
o
f
t
m
a
x
(
z
k
)
, where
z
i
is the logit of the neural network
S
…The model
S
can be trained by minimizing the cross-entropy loss:
H
q
,
p
=
-
∑
k
-
1
K
q
k
log
p
k
.
” – discloses training the student neural network S by minimizing a defined loss function. It is inherent in minimizing a loss during training that the parameters of the student neural network are updated so that the loss is reduced in subsequent training epochs.),
performing a second training phase of the SNN model, the second training phase comprising initializing the SNN model with a set of SNN model parameters selected from the updated sets of SNN model parameters computed during the first training phase, the second training phase of the SNN model being performed over a plurality of second training phase epochs, each second training phase epoch comprising (Yuan, [page 7, section 4] “Specifically, we first train the student model in the normal way to obtain a pre-trained model, which is then used as the teacher to train itself by transferring soft targets as in Eq. (4).” – discloses that the student neural network is first trained to obtain a pre-trained set of model parameters, and that those learned parameters are then reused to initialize a subsequent training process in which the model is further trained.)
computing SNN model outputs for the plurality of input data samples from the SNN model (Yuan, [page 7, section 4] “where
p
,
p
τ
t
are the output probability of
S
and
S
t
respectively” – the student model S produces output probabilities p, which are computed by evaluating the SNN on input data samples.);
computing a second loss based on the SNN model outputs and a set of predefined expected outputs for the plurality of input data samples ; and computing an updated set of the SNN model parameters with an objective of reducing the second loss in a following second training phase (Yuan, [page 7, section 4] “The loss function of
T
f
-
K
D
s
e
l
f
to train model
S
is
L
s
e
l
f
=
1
-
α
H
q
,
p
+
α
D
K
L
p
τ
t
,
p
τ
,
where
p
,
p
τ
t
are the output probability of
S
and
S
t
respectively,
τ
is the temperature and
α
is the weight to balance the two terms.” – discloses computing a loss function using p, which represents the outputs of the student neural network, and q, which represents predefined expected outputs (ground-truth labels) in the cross-entropy term
H
(
q
,
p
)
.
Accordingly, this disclosure teaches computing a second loss based on the SNN model outputs and a set of predefined expected outputs for a plurality of input data samples.)
However, Yuan does not teach but Yuan in view of Kim teaches the following limitation:
A system for training a student neural network model, comprising one or more processers and a non-transitory storage medium storing software instructions that, when executed by the one or more processors, configure the system to perform a method comprising (Kim, paragraph [0114] “ Referring to FIG. 10, the model training apparatus 1000 includes a processor 1010 and a memory 1020. The model training apparatus 1000 is an apparatus configured to train a student model for a data recognition, and is implemented as, for example, a single processor or multi-processor.” [0131] “The instructions or software to control computing hardware… Examples of a non-transitory computer-readable storage medium include read-only memory (ROM), random-access programmable read only memory (PROM)..”):
wherein the smoothing factor is adjusted over the plurality of first training phase epochs to reduce a smoothing effect on the generated smoothed TNN model outputs (Yuan, [page 2, introduction] “For KD, by combining the teacher’s soft targets with the one-hot ground truth label, we find that KD is a learned LSR where the smoothing distribution of KD is from a teacher model but the smoothing distribution of LSR is manually designed” Kim paragraph [0012] “The determining of the loss function may include determining the loss function by applying a first factor to the error rate between the recognition result of the teacher model and the recognition result of the student model, wherein the first factor may be controlled so that a contribution of the teacher model to training of the student model decreases in response to an increase in a training epoch of the student model.” – Yuan teaches that knowledge distillation is a form of learned label smoothing in which the smoothing distribution is derived from teacher model outputs. Kim teaches applying a factor to the teacher-student loss term and controlling the factor so that the contribution of the teacher model decreases in response to an increase in training epochs. Under BRI, the claimed “smoothing factor” reads on Kim’s epoch-controlled factor applied to the teacher-derived term. As the teacher’s contribution decreases, the influence of the teacher derived smoothing distribution necessarily decreases, thereby reducing the smoothing effected on the generated smoothed outputs.);
Accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, having Yuan and Kim before them, to adjust a factor applied to teacher model outputs during knowledge distillation during training such that the contribution to the teacher model decreases over training epochs. Yuan teaches that knowledge distillation corresponds to a form of label smoothing in which a teacher model and its outputs provide a smoothing distribution for training a student model. Kim teaches applying a factor to a loss associated with teacher-student outputs and controlling that factor so that the contribution rate of the teacher model to training the student model is decreased as training progresses, particularly when a low error rate indicates that the student model has sufficiently learned from the teacher. One would have been motivated to make such a combination in order to reduce reliance on teacher-provided outputs once the student model is sufficiently trained, thereby allowing the student model to transition toward learning from ground-truth targets and increasing completion and convergence of the training process (Kim, paragraph [0070]).
Regarding claim 12, Yuan in view of Kim teaches all the elements of claim 11, therefore is rejected for the same reasons as those presented for claim 11. Yuan in view of Kim further teaches:
computing an updated set of the SNN model parameters with an objective of reducing the second loss in a following second training phase (Yuan, [page 7, section 4]“ then we try to minimize the KL divergence…The loss function of
T
f
-
K
D
s
e
l
f
to train model
S
is
L
s
e
l
f
=
1
-
α
H
q
,
p
+
α
D
K
L
p
τ
t
,
p
τ
” – discloses training the student neural network by minimizing a defined loss function. Training a neural network by minimizing a loss function inherently requires computing updated model parameters to reduce the loss, since the loss depends on model outputs, which in turn depend on model parameters. Accordingly, the disclosure inherently teaches computing an updated set of SNN model parameters with an objective of reducing the loss in a following training phase epoch.)
selecting a final set of SNN model parameters from the updated sets of SNN model parameters computed during the second training phase (Yuan, [page 8, section 5.1] “Fig. 4 (a) shows the test accuracy of the six models. It can be seen that our
T
f
-
K
D
s
e
l
f
consistently outperforms the baselines. For example, as a powerful model with 34.52M parameters, ResNeXt29 improves itself by 1.05% with self-training (Fig. 4(b)).” – discloses evaluating the student model after completion of the
T
f
-
K
D
s
e
l
f
self-training process. Reporting “test accuracy” values and quantified performance improvements inherently requires that training updates have ceased and that a final set of SNN model parameters is fixed and used for an inference on a test dataset. Under BRI, the act of evaluating and reporting a final test accuracy necessarily involves selecting a final set of SNN model parameters from among the updated parameter sets computed during the second training phase. The explicit attribution to these results of “self-training” identifies the selected parameters as being computed during the second training phase.).
wherein in each epoch of the first training phase the smoothing factor is computed as smoothing factor
=
t
t
m
a
x
, where
t
m
a
x
is a constant and a value of
t
is incremented in each subsequent first training phase epoch (Kim, paragraph [0093] “ In operation 740, the model training apparatus determines whether a training epoch t is less than a maximum epoch. For example, when the training epoch t is determined to be less than the maximum epoch, operation 750 is performed.” [0095] “In operation 760, the model training apparatus increments the training epoch t by “1.” Also, the process reverts to operation 730 to train the student model θ.sub.S.” – Kim teaches a training process in which a current training epoch t is incremented by one in each iteration and training proceeds until a predefined maximum epoch is reached. This establishes a bounded, linearly increasing epoch variable. Expressing that linear progression as the ratio of the current epoch to the maximum epoch (t/
t
m
a
x
) is the standard mathematical implementation of such an epoch-based linear schedule. While Kim teaches computing a factor based on linear epoch progression, Yuan teaches applying such a factor as a smoothing factor to control loss behavior during training.).
Regarding claim 13, Yuan in view of Kim teaches all the elements of claim 11, therefore is rejected for the same reasons as those presented for claim 11. The claim recites similar limitations corresponding to claim 3 and is rejected for similar reasons as claim 3 using similar teachings and rationale.
Regarding claim 14, Yuan in view of Kim teaches all the elements of claim 13, therefore is rejected for the same reasons as those presented for claim 13. The claim recites similar limitations corresponding to claim 4 and is rejected for similar reasons as claim 4 using similar teachings and rationale.
Regarding claim 15, Yuan in view of Kim teaches all the elements of claim 11, therefore is rejected for the same reasons as those presented for claim 11. The claim recites similar limitations corresponding to claim 5 and is rejected for similar reasons as claim 5 using similar teachings and rationale.
Regarding claim 16, Yuan in view of Kim teaches all the elements of claim 15, therefore is rejected for the same reasons as those presented for claim 15. The claim recites similar limitations corresponding to claim 6 and is rejected for similar reasons as claim 6 using similar teachings and rationale.
Regarding claim 17, Yuan in view of Kim teaches all the elements of claim 11, therefore is rejected for the same reasons as those presented for claim 11. The claim recites similar limitations corresponding to claim 7 and is rejected for similar reasons as claim 7 using similar teachings and rationale.
Regarding claim 18, Yuan in view of Kim teaches all the elements of claim 17, therefore is rejected for the same reasons as those presented for claim 17. The claim recites similar limitations corresponding to claim 8 and is rejected for similar reasons as claim 8 using similar teachings and rationale.
Regarding claim 19, Yuan in view of Kim teaches all the elements of claim 17, therefore is rejected for the same reasons as those presented for claim 17. Yuan in view of Kim further teaches:
for each second training phase epoch, determining if the computed updated set of the SNN model parameters improves a performance of the SNN model relative to updated sets of SNN model parameters previously computed during the second training phase in respect of the development dataset, and when the computed updated set of the SNN model parameters does improve the performance, update the SNN model parameters to the computed updated set of the SNN model parameters prior to a next epoch, and wherein the final set of SNN model is the updated set of SNN model parameters computed during the second training phase that best improves the performance of the SNN model during the second training phase (Yuan, [page 8-9] Yuan discloses training a student model over multiple epochs with performance evaluated throughout training, explicitly stating that the authors “report the mean of the best results” obtained during the run. To identify and report the “best results,” Yuan inherently compares model performance across epochs during the second training phase and determines whether a parameter set from a current epoch improves performance relative to parameter sets computed in earlier epochs. When improved performance is identified, Yuan inherently updates the designated SNN model parameter state to reflect this new optimal configuration prior to proceeding to the next epoch. Under the BRI, maintaining and updating the “best model” state constitutes updating the SNN model parameters when performance improves. Further, because Yuan reports the “best results” obtained during training rather than results from a particular epoch, the final set of SNN model parameters corresponds to the updated parameter configuration associated with the highest performance achieved during the second training phase. Under BRI, this parameter configuration constitutes the final set of SNN model parameters. Accordingly, Yuan discloses per-epoch performance comparison, conditional updating of SNN model parameters upon performance improvement, and selection of the final model as the parameter set that best improves performance during the second training phase, thereby meeting the claim limitation.).
Regarding claim 20, Yuan teaches the following limitations:
obtaining respective teacher neural network (TNN) model outputs for a plurality of input data samples (Yuan, [Introduction] “the dark knowledge in KD, and the soft targets from the teacher model indeed provide effective regularization for the student model, which are equally or even more important” – because soft targets are the outputs generated by the teacher model for training inputs. Using soft targets necessarily requires obtaining the teacher model’s outputs corresponding to those input data samples so they can be applied during training of the student model.);
performing a first training phase of the SNN model, the first training phase comprising training the SNN model over a plurality of first training phase epochs, each first training phase epoch comprising: computing SNN model outputs for the plurality of input data samples (Yuan, [page 5, section 3] “Given a neural network
S
to train, we first give loss function of LSR for
S
. For each training example
x
,
S
outputs the probability of each label
k
…
:
p
k
x
=
s
o
f
t
m
a
x
(
z
k
)
” – discloses that the student neural network computes output values for each input data sample by mathematically evaluating a softmax function to produce label probabilities, which direct corresponds to computing SNN model outputs for a plurality of input data samples during training.)
applying a smoothing factor to the teacher neural network (TNN) model outputs to generate smoothed TNN model outputs (Yuan, [page 6, section 3] “the output prediction of the teacher network is
p
τ
t
k
=
s
o
f
t
m
a
x
(
z
k
t
)
… where
z
t
is the output logits of the teacher network and
τ
is the temperature to soften
p
t
(k) (written as
p
τ
t
k
after softened).” – teaches applying a smoothing factor to the teacher network outputs, which directly corresponds to generating smoothed TNN model outputs.);
computing a first loss based on the SNN model outputs and the smoothed TNN model outputs (Yuan, [page 6, section 3] “The idea behind knowledge distillation is to let the student (the model S) mimic the teacher by minimizing the cross-entropy loss and KL divergence between the predictions of student and teacher as
L
K
D
=
1
-
α
H
q
,
p
+
α
D
K
L
(
p
τ
t
,
p
r
)
– defines a loss function that is computed using the student model outputs p and the smoothed teacher model outputs, where the smoothing is applied via the temperature parameter
τ
.); and
computing an updated set of the SNN model parameters with an objective of reducing the first loss in a following first training phase epoch (Yuan, [page 5, section 3] “Given a neural network
S
to train, we first give loss function of LSR for
S
. For each training example
x
,
S
outputs the probability of each label
k
∈
1
…
K
:
p
k
x
=
s
o
f
t
m
a
x
(
z
k
)
, where
z
i
is the logit of the neural network
S
…The model
S
can be trained by minimizing the cross-entropy loss:
H
q
,
p
=
-
∑
k
-
1
K
q
k
log
p
k
.
” – discloses training the student neural network S by minimizing a defined loss function. It is inherent in minimizing a loss during training that the parameters of the student neural network are updated so that the loss is reduced in subsequent training epochs.),
performing a second training phase of the SNN model, the second training phase comprising initializing the SNN model with a set of SNN model parameters selected from the updated sets of SNN model parameters computed during the first training phase, the second training phase of the SNN model being performed over a plurality of second training phase epochs, each second training phase epoch comprising (Yuan, [page 7, section 4] “Specifically, we first train the student model in the normal way to obtain a pre-trained model, which is then used as the teacher to train itself by transferring soft targets as in Eq. (4).” – discloses that the student neural network is first trained to obtain a pre-trained set of model parameters, and that those learned parameters are then reused to initialize a subsequent training process in which the model is further trained.)
computing SNN model outputs for the plurality of input data samples from the SNN model (Yuan, [page 7, section 4] “where
p
,
p
τ
t
are the output probability of
S
and
S
t
respectively” – the student model S produces output probabilities p, which are computed by evaluating the SNN on input data samples.);
computing a second loss based on the SNN model outputs and a set of predefined expected outputs for the plurality of input data samples ; and computing an updated set of the SNN model parameters with an objective of reducing the second loss in a following second training phase (Yuan, [page 7, section 4] “The loss function of
T
f
-
K
D
s
e
l
f
to train model
S
is
L
s
e
l
f
=
1
-
α
H
q
,
p
+
α
D
K
L
p
τ
t
,
p
τ
,
where
p
,
p
τ
t
are the output probability of
S
and
S
t
respectively,
τ
is the temperature and
α
is the weight to balance the two terms.” – discloses computing a loss function using p, which represents the outputs of the student neural network, and q, which represents predefined expected outputs (ground-truth labels) in the cross-entropy term
H
(
q
,
p
)
.
Accordingly, this disclosure teaches computing a second loss based on the SNN model outputs and a set of predefined expected outputs for a plurality of input data samples.)
computing an updated set of the SNN model parameters with an objective of reducing the second loss in a following second training phase (Yuan, [page 7, section 4]“ then we try to minimize the KL divergence…The loss function of
T
f
-
K
D
s
e
l
f
to train model
S
is
L
s
e
l
f
=
1
-
α
H
q
,
p
+
α
D
K
L
p
τ
t
,
p
τ
” – discloses training the student neural network by minimizing a defined loss function. Training a neural network by minimizing a loss function inherently requires computing updated model parameters to reduce the loss, since the loss depends on model outputs, which in turn depend on model parameters. Accordingly, the disclosure inherently teaches computing an updated set of SNN model parameters with an objective of reducing the loss in a following training phase epoch.)
selecting a final set of SNN model parameters from the updated sets of SNN model parameters computed during the second training phase (Yuan, [page 8, section 5.1] “Fig. 4 (a) shows the test accuracy of the six models. It can be seen that our
T
f
-
K
D
s
e
l
f
consistently outperforms the baselines. For example, as a powerful model with 34.52M parameters, ResNeXt29 improves itself by 1.05% with self-training (Fig. 4(b)).” – discloses evaluating the student model after completion of the
T
f
-
K
D
s
e
l
f
self-training process. Reporting “test accuracy” values and quantified performance improvements inherently requires that training updates have ceased and that a final set of SNN model parameters is fixed and used for an inference on a test dataset. Under BRI, the act of evaluating and reporting a final test accuracy necessarily involves selecting a final set of SNN model parameters from among the updated parameter sets computed during the second training phase. The explicit attribution to these results of “self-training” identifies the selected parameters as being computed during the second training phase.).
However, Yuan does not teach but Yuan in view of Kim teaches the following limitation:
A non-transitory computer readable medium storing software instructions that, when executed by the one or more processors, configure the one or more processors to perform a method comprising (Kim, paragraph [0131] “The instructions or software to control computing hardware, for example, one or more processors or computers, to implement the hardware components and perform the methods as described above, and any associated data, data files, and data structures, may be recorded, stored, or fixed in or on one or more non-transitory computer-readable storage media. Examples of a non-transitory computer-readable storage medium include read-only memory (ROM), random-access programmable read only memory (PROM)…”):
wherein the smoothing factor is adjusted over the plurality of first training phase epochs to reduce a smoothing effect on the generated smoothed TNN model outputs (Yuan, [page 2, introduction] “For KD, by combining the teacher’s soft targets with the one-hot ground truth label, we find that KD is a learned LSR where the smoothing distribution of KD is from a teacher model but the smoothing distribution of LSR is manually designed” Kim paragraph [0012] “The determining of the loss function may include determining the loss function by applying a first factor to the error rate between the recognition result of the teacher model and the recognition result of the student model, wherein the first factor may be controlled so that a contribution of the teacher model to training of the student model decreases in response to an increase in a training epoch of the student model.” – Yuan teaches that knowledge distillation is a form of learned label smoothing in which the smoothing distribution is derived from teacher model outputs. Kim teaches applying a factor to the teacher-student loss term and controlling the factor so that the contribution of the teacher model decreases in response to an increase in training epochs. Under BRI, the claimed “smoothing factor” reads on Kim’s epoch-controlled factor applied to the teacher-derived term. As the teacher’s contribution decreases, the influence of the teacher derived smoothing distribution necessarily decreases, thereby reducing the smoothing effected on the generated smoothed outputs.);
Accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, having Yuan and Kim before them, to adjust a factor applied to teacher model outputs during knowledge distillation during training such that the contribution to the teacher model decreases over training epochs. Yuan teaches that knowledge distillation corresponds to a form of label smoothing in which a teacher model and its outputs provide a smoothing distribution for training a student model. Kim teaches applying a factor to a loss associated with teacher-student outputs and controlling that factor so that the contribution rate of the teacher model to training the student model is decreased as training progresses, particularly when a low error rate indicates that the student model has sufficiently learned from the teacher. One would have been motivated to make such a combination in order to reduce reliance on teacher-provided outputs once the student model is sufficiently trained, thereby allowing the student model to transition toward learning from ground-truth targets and increasing completion and convergence of the training process (Kim, paragraph [0070]).
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Daravanh Phakousonh whose telephone number is (571)272-6324. The examiner can normally be reached Mon - Thurs 7 AM - 5 PM, Every other Friday 7 AM - 4PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached at 571-272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Daravanh Phakousonh/Examiner, Art Unit 2121
/James D. Rutten/Primary Examiner, Art Unit 2121