Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on February 26, 2026 has been entered.
Remarks
This Office Action is in response to applicant’s amendment filed on February 26, 2026, under which claims 1-3, 5-9 and 11-15 are pending and under consideration.
Response to Arguments
Applicant’s amendments have overcome the previous prior art rejections. However, upon further consideration, new grounds of rejection have been made in this action. Applicant’s arguments directed to the prior art, as well as its arguments directed to the § 112(f) interpretation are addressed in detail below.
§ 112(f) Interpretation
In regards to the means-plus-function interpretation of “input device,” applicant argues:
The BRI in view of the specification makes it clear, for example, that the "input device" should be understood as structure. The specification describes a wide variety of potential input devices which are broader than the specific example of the microphone or camera that the Office Action indicates from [0042].
…
It is submitted that it is clear to skilled person that an input device would include any suitable device by which second processing system input data could be generated, including microphones, camera, vibration sensors, pressure sensors, temperature sensors, motion sensors, and so forth. Specifically, the input device is not a generic placeholder for structure but is instead a structural component arranged to capture data which is to be processed in the second processing system, and the specific input device used in any given implementation is dependent on the type of data which the previously-trained neural network (PTNN) and student neural network (SNN) are configured to process.
…
For at least the foregoing reasons it is respectfully asserted that claim 15 should not be interpreted under 35 USC 112(f) and it is hereby requested that this interpretation be withdrawn.
(Applicant’s response, pages 9-10).
Applicant’s arguments are not persuasive or the following reasons. The fact that a term (here, “input device”) broadly covers different types of structures disclosed in the specification does not mean that the term is limited to those specific structures for purposes of determining whether a term invokes § 112(f). To the contrary, means-plus-function terms can ordinarily cover a large number of structures described in the specification.
The specification merely provides various examples of “input device,” but does not set forth a special definition for this term, nor does the specification limit “device” to those examples mentioned in applicant’s response, or even sensors in general. For example, paragraph [0083] refers to “input devices configured to sense or receive,” where the term “sense or receive” clearly indicates that “input device” is not limited to sensors.
Applicant’s response suggests that “input device” should have a narrower definition that requires a certain specific structure, such as a hardware sensor. However, as discussed above, neither the claim nor the specification limits “input device” to a sensor, nor does the claim require the input device to have any specific physical structure. In general, the term “device” itself is not limited to physical hardware devices but instead is a nonce word (see MPEP § 2181) that can covers software devices, and the limitation of “in communication with” includes software communications as well as hardware communications.
Furthermore, the fact that the “input device” is used with a neural network does not narrow the input device to a specific structure, since a neural network can accept data generically as input, and does not require any specific means for generating such data.
Therefore, the term “input device” is still interpreted as a means-plus-function limitation under § 112(f). If the applicant wishes to avoid means-plus-function interpretation, the applicant could delete this term or use different claim language.
Prior Art Rejections
Applicant argues:
As noted in the Office Action with respect to claim 4, Li fails to disclose that the first processing system and the second processing system are separate, distinct, systems. The claims, as amended, specify that the second processing system is different to, and remote from, the first processing system. This is in line with the specific examples described in the application in which the first processing system is a server-based processing system, while the second processing system is implemented in smaller mobile devices, such as personal computers or the like.
(Applicant’s response, page 11).
To address the new limitations that cover the concepts discussed above, the Examiner now applies Sharifi et al. (US 20230036764 A1). Therefore, applicant’s arguments are moot under the new ground of rejection.
In regards to Hinton, which is still cited in the new grounds of rejection, applicant argues:
With regards to the rejection of claim 4, the Office Action argues that the use of different values of a temperature parameter when training the previously-trained neural network (PTNN) and when optimizing, or training, the student neural network (SNN) is disclosed in Hinton. Applicant respectfully disagrees.
As cited in the Office Action, Hinton discusses the use of a special high temperature when performing training for model distillation. However, nothing in Hinton teaches or suggests that a temperature value used when training a cumbersome, larger, model is lower than a temperature value used when training a smaller, distilled, model.
While Hinton does appear to discuss that a temperature value used when training the distilled model is higher than a regular temperature value, used during inference, Hinton does not disclose that a temperature value used to train the cumbersome model is smaller than a temperature value used to train the distilled model. In fact, Hinton does not appear to discuss the temperature value used to train the cumbersome model at all. The only distinction that Hinton appears to discuss with regards to the use of higher or lower temperature values is that higher values are used when training the distilled model, and lower values may be used after training, in other words when using the distilled model for inference.
See for example, section 2, paragraph 2 of Hinton:
"In the simplest form of distillation, knowledge is transferred to the distilled model by training it on a transfer set and using a soft target distribution for each case in the transfer set that is produced by using the cumbersome model with a high temperature in its softmax. The same high temperature is used when training the distilled model, but after it has been trained it uses a temperature of 1." (Emphasis added).
As such, the claimed relationship between the temperature value used when training a previously-trained neural network (PTNN) and the temperature value used when optimizing a student neural network (SNN), is not disclosed in Hinton, or indeed in any of the cited documents. Applicant respectfully submits that this specific relationship between the two values of the temperature parameter would not be obvious to the person skilled in the art.
(Applicant’s response, page 11).
In response, the Examiner has cited a new reference, Aggarwal, to address the temperature limitations of the claim.
For clarity of record and to explain the Examiner’s reasoning, the Examiner notes that although applicant is correct in stating that Hinton does not explicitly teach the training process for the teacher model, the Examiner considers the temperature of 1 to be the implied normal condition for training the teacher model. It is also well known in the art that the standard softmax function is:
q
i
=
exp
z
i
∑
j
e
x
p
(
z
j
)
Hinton’s distillation process uses a special formulation of the softmax function where the temperature T is an explicit parameter:
q
i
=
exp
z
i
T
∑
j
exp
z
j
T
This formula reduces to the standard softmax function when T = 1.
When training the student model in knowledge distillation, T is set to a high value greater than 1. However, unlike the student model, the teacher model is trained normally. Thus, in the absence of this special circumstance of using T > 1, it is readily understood that T in a normal training process (such as the training the teacher model) would typically be set to 1.
Therefore, the Examiner submits that by teaching the special case of T > 1 for the student model, Hinton also reasonably suggest a lower temperature for the teacher. However, since applicant raised the issue of whether Hinton’s teachings alone are sufficient, the Examiner has cited a new reference Aggrawal to explicitly teach training with the standard softmax (i.e., where T = 1).
Claim Objections
Claims 1 and 15 are objected to because of the following informalities:
Claim 1 should be amended as shown below in order to make it clear (see the requirement of 37 CFR 1.71(a) for “full, clear, concise, and exact terms”) that the terms in the body refer to the same items recited in the preamble:
A computer-implemented method of optimising a student neural network (SNN), based on a previously-trained neural network (PTNN) trained on first data (FD) using a first processing system (FPS), the method comprising:
obtaining, from [[a]]the first processing system (FPS), at a second processing system (SPS), [[a]]the previously-trained neural network (PTNN), where the previously-trained neural network (PTNN) was trained using the first data (FD) by the first processing system (FPS) using a first value of a temperature parameter, the temperature parameter controlling a classification confidence of the previously-trained neural network (PTNN), wherein the second processing system (SPS) is different to and remote from the first processing system (FPS)…
Note that the applicant uses “the” for the first instance of the term “student neural network (SNN)” in the 3rd paragraph of the claim. Therefore, the article “the” should also be used at the locations shown above for other terms. Otherwise, the claim is unclear as to whether the terms in the preamble and the terms in the first paragraph are the same or different elements.
Claim 15 should be amended as follows for the same reasons given above in regards to the corresponding language of claim 1 (i.e., to avoid confusion of whether the identical terms in the preamble and body of the claim are the same):
obtain, from the first processing system (FPS), [[a]]the previously-trained neural network (PTNN), where the previously-trained neural network (PTNN) was trained using the first data (FD) by the first processing system (FPS) using a first value of a temperature parameter
Appropriate correction is required. For purposes of examination, the claims have been interpreted in the manner of the suggested revision.
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
The following is a quotation of pre-AIA 35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art. The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is invoked.
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph:
(A) the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function;
(B) the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and
(C) the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function.
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function.
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function.
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.
Claim 15 recites the following limitation that invokes § 112(f):
“an input device in communication with the second processing system…generate second processing system input data (SPSID) from the input device”
Here, “device” is considered to be a generic placeholder, while “in communication with the second processing system…generate second processing system input data (SPSID from” is functional language coupled to this term.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
Support is found in paragraph [0042], which teaches examples of a microphone and a camera as input devices. Paragraph [0082] lists further examples (“input devices configured to sense or receive other types of data, including optical, vibration, pressure, temperature, motion is also contemplated”).
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, applicant may: (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
1. Claims 1, 3, and 12-15 are rejected under 35 U.S.C. 103 as being unpatentable over Li et al. (US 2016/0078339 A1) (“Li”) in view of Sharifi et al. (US 20230036764 A1) (“Sharifi”), Aggarwal, Neural Networks and Deep Learning, Springer 2018, pp. 1-17 and 108-118 (“Aggarwal”), and Hinton et al., “Distilling the Knowledge in a Neural Network,” arXiv:1503.02531v1 [stat.ML] 9 Mar 2015 (“Hinton”).
Note: Sharifi is newly applied but was made of record in the previous action.
As to claim 1, Li teaches a computer-implemented method of optimising a student neural network (SNN), based on a previously-trained neural network (PTNN) trained on first data (FD) […], [[0030]: “initialization component 124 receives from accessing component 122 a fully trained teacher DNN of size NT, which is already trained according to techniques known by one skilled in the art.” [0050]: “the determined teacher DNN is initialized (which may be performed using initialization component 124 of FIG. 1) and trained (which may be performed using training component 126 of FIG. 1)….In one embodiment where the teacher DNN is trained in step 510, labeled or transcribed data may be used according to techniques known in the art of DNN model training.” That is, the teacher DNN corresponds to a “previously-trained neural network (PTNN)”, and is trained using component 126 of the DNN model generator 120 shown in FIG. 1 (see also [0027]).] the method comprising:
obtaining […] at a second processing system (SPS), a previously-trained neural network (PTNN), where the previously trained neural network (PTNN) was trained using first data (FD) […]
generating, at the second processing system (SPS) [[0020]: “a system architecture suitable for implementing an embodiment of the invention and designated generally as system 100.” [0021]: “It should be understood that any number of data sources, storage components or data stores, client devices and DNN model generators may be employed within the system 100 within the scope of the present invention. Each may comprise a single device or multiple devices cooperating in a distributed environment. For instance, the DNN model generator 120 may be provided via multiple computing devices or components arranged in a distributed environment that collectively provide the functionality described herein.” The instant claim language does not require any structural relationship or structural distinction between the first and second processing systems, nor does it state that the two systems are separate or different from another. Furthermore, system 100 in Li may be regarded as having a second processing system in the form of components that are used to perform operations of the student model.], second processing system input data (SPSID) from an input device of the second processing system; [Li, [0043]: “In particular, for each iteration, a small piece of unlabeled (or un-transcribed) data 310 is provided to both student DNN 301 and teacher DNN 302.” In regards to “an input device,” [0022] teaches: “the unlabeled data in data source(s) 108 is provided by one or more deployment-feedback loops, as described above. For example, usage data from spoken search queries performed on search engines may be provided as un-transcribed data.” The data source is connected to a network. See [0021]: “Among other components not shown, system 100 includes network 110 communicatively coupled to one or more data source(s) 108, storage 106, client devices 102 and 104, and DNN model generator 120… Network 110 may include, without limitation, one or more local area networks (LANs) and/or wide area networks (WANs). Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets, and the Internet…” The types of networks described here imply a hardware interface for communications. Thus, the data sources 108 and the network 110 are collectively understood to have a communications interface (input device) that is in communication with the model generator 120, such that the input data is obtained “from” such an interface. Alternatively or additionally, this limitation is also met by an input device on (and in communication with) the client device. See [0023]: “the client device is capable of receiving input data such as audio and image information usable by a DNN system described herein that is operating in the device. For example the client device may have a microphone or line-in for receiving audio information, a camera for receiving video or image information, or a communication component (e.g. Wi-Fi functionality) for receiving such information from another source, such as the Internet or a data source 108.” Note that from [0022] the content in data source is from deployment feedback, i.e., from a client device of the deployment, and that the client device is part of the system.]
using the second processing system (SPS) to generate second processing system output data (SPSOD) in response to inputting the second processing system input data (SPSID) to a student neural network; [Li, [0043]: “In particular, for each iteration, a small piece of unlabeled (or un-transcribed) data 310 is provided to both student DNN 301 and teacher DNN 302. Using forward propagation, the posterior distribution (output distribution 351 and 352) is determined.” That is, referring to FIG. 3, output 351 corresponds to output data (SPSOD) in response to inputting the data 310 into the student neural network 301. See also [0057].]
using the second processing system (SPS) to generate reference output data (ROD) from the previously-trained neural network (PTNN) in response to inputting second data (SD) to the previously-trained neural network (PTNN); [[0043]: “In particular, for each iteration, a small piece of unlabeled (or un-transcribed) data 310 is provided to both student DNN 301 and teacher DNN 302. Using forward propagation, the posterior distribution (output distribution 351 and 352) is determined.” That is, referring to FIG. 3, data 310 corresponds to second data input to the teacher network to generate the output 352 (“reference output data”) from the teacher network.], wherein the second data (SD) is a subset of the second processing system input data (SPSID) for use in optimising the student neural network (SNN); [Li, [0057]: “At step 540, using a subset of the un-labeled training data received at step 530, the output distribution for the teacher DNN and the output distribution for the student DNN are determined.”] and
optimising the student neural network (SNN) for processing the second data (SD) with the second processing system (SPS), by using the second processing system (SPS) to adjust a plurality of parameters of the student neural network (SNN) [[0043]: “The error signal may be calculated by determining the KL divergence between distributions 351 and 352, or by using regression, or other suitable technique, and may be determined using evaluating component 128 of FIG. 1…For example, as shown at 370, using back propagation the weights of student DNN 301 are updated using the error signal.”] such that a difference (DIFF) between the reference output data (ROD), and second output data (SOD) generated by the student neural network (SNN) in response to inputting the second data (SD) to the student neural network (SNN), satisfies a stopping criterion […]. [Satisfaction of a stopping criterion is taught in the form of no further convergence, or which can be based on a threshold. See [0036]: “In particular, in an embodiment, evaluating component 128 evaluates the output distributions of the student and teacher DNNs, determines the difference (which may be determined as an error signal) between the outputs and also determines whether the student is continuing to improve or whether the student is no longer improving (i.e. the student output distribution shows no further trend towards convergence with the teacher output).” [0049]: “At a high level, one embodiment of method 500 iteratively optimizes the student DNN, based on the difference between its output and the teacher's output, until it converges with the teacher DNN.” [0037]: “evaluating component 128 apply a threshold to determine convergence of the teacher DNN and student DNN output distributions. Where the threshold is not satisfied, iteration may continue, thereby further training the student to approximate the teacher. Where the threshold is satisfied, then convergence is determined (indicating the student output distribution is sufficiently close enough to the teacher DNN's output distribution) and the student DNN may be considered trained and further may be deployed on a client device or computer system.” See also [0044], [0060] for similar teachings.]
Li does not explicitly teach:
(1) The previously-trained neural network (PTNN) being trained “using/by a first processing system (FPS)” (as recited in the preamble and first element of the claim) and being obtained “from a first processing system (FPS)” where the two processing systems are defined by the limitation “wherein the second processing system (SPS) is different to and remote from the first processing system (FPS).”
(2) The limitations of the PTNN being trained “using a first value of a temperature parameter, the temperature parameter controlling a classification confidence of the previously-trained neural network (PTNN).”
(3) “wherein the student neural network (SNN) is optimised using a second value for a temperature parameter of the student neural network (SNN), the second value being higher than the first value such that a classification confidence of the optimised student neural network (SNN) is lower than the classification confidence of the previously-trained neural network (PTNN).”
Sharifi teaches a previously-trained neural network (PTNN) trained “using/by a first processing system (FPS)” and obtained “from a first processing system (FPS)” “wherein the second processing system (SPS) is different to and remote from the first processing system (FPS).” [[0066]: “The teacher machine-learned model 202 can be deployed to the user computing device 102 (e.g., from the server computing system 130 and/or training computing system 150).” [0054]: “In particular, the model trainer 160 can train a teacher model 140 based on a set of training data 162. The training data 162 can include, for example, labeled and/or unlabeled training examples. The teacher model 140 can be deployed to the user computing device 102. The user computing device 102 can locally train the student model(s) 122 based on the teacher model(s) 120.” That is, referring to FIG. 1A, the combination of the training computer system 150 and server computing system 130 collectively corresponds to a first processing system (FPS) that trains the model is and remote from the user computing system 102, which receives the teacher model when it is deployed to the user computing system. See also [0046] and [0051]-[0053].]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Li with the teachings of Sharifi by implementing a server and client system such previously-trained neural network (PTNN) being trained “using/by a first processing system (FPS)” (as recited in the preamble and first element of the claim) and is obtained “from a first processing system (FPS)” “wherein the second processing system (SPS) is different to and remote from the first processing system (FPS).” The motivation would have been to enable a central server to aggregate data and centrally train machine learning models for use by particular client devices (see Sharifi, [0003]: “In particular, in some scenarios, data can be uploaded from user computing devices to the server computing device. The server computing device can train various machine-learned models on the centrally collected data and then evaluate the trained models.”), while maintaining data security (see Sharifi [0038]: “by performing distillation of the machine learning model on the edge device, data security can be enhanced by preventing third parties having access to user sensitive data during distillation of a personalized student model.”).
The combination of references thus far does not teach the remaining limitations (2) and (3) listed above.
Aggarwal teaches training “using a first value of a temperature parameter, the temperature parameter controlling a classification confidence of the previously-trained neural network (PTNN).” [Page 14 and equation 1.12 teaches the standard softmax function, which has the form
Φ
v
-
i
=
exp
v
i
∑
j
=
1
k
e
x
p
(
v
j
)
as shown in the equation, which can also be equivalently written as
q
i
=
exp
z
i
∑
j
e
x
p
(
z
j
)
in a manner consistent with the notation in Hinton (see Hinton, equation 1). Furthermore, § 3.2.5.1 (page 118) teaches the use of the softmax in training. Here, the standard softmax function by definition has a temperature parameter value of 1 (as evidenced by Hinton, discussed below), and such temperature parameter is therefore, by definition, controlling a classification confidence of the previously-trained neural network (PTNN) by being the unscaled baseline input into the softmax. i.e., a temperature of 1.]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of the references combined thus far with the teachings of Aggarwal by using a standard softmax function for the structure and thus the training of the PTNN such that the PTNN is trained “using a first value of a temperature parameter, the temperature parameter controlling a classification confidence of the previously-trained neural network (PTNN).” The motivation would have been to use a function that is capable of outputting probabilities for multiple categorical classifications, as suggested by Aggarwal (page 14, Fig. 1.9 and its caption: “An example of multiple outputs for categorical classification with the use of a softmax layer”; page 14, § 1.2.1.4: “Note that the three outputs correspond to the probabilities of the three classes, and they convert the three outputs of the final hidden layer into probabilities with the softmax function.”).
The combination of references thus far does not teach the limitation that “wherein the student neural network (SNN) is optimised using a second value for a temperature parameter of the student neural network (SNN), the second value being higher than the first value such that a classification confidence of the optimised student neural network (SNN) is lower than the classification confidence of the previously-trained neural network (PTNN).”
Hinton, which teaches knowledge distillation (see title), teaches the use of a temperature value and teaches the above limitations. In particular, Hinton teaches “wherein the student neural network (SNN) is optimised using a second value for a temperature parameter of the student neural network (SNN)” [In general, the concept of a temperature is taught in § 2: “Neural networks typically produce class probabilities by using a “softmax” output layer that converts the logit, zi, computed for each class into a probability, qi, by comparing zi with the other logits.
q
i
=
exp
z
i
T
∑
j
exp
z
j
T
(1) where T is a temperature that is normally set to 1. Using a higher value for T produces a softer probability distribution over classes.” Note that T affects the softmax output, and thus the classification confidence by definition of the softmax shown in the equation in [0023].] “the second value being higher than the first value such that a classification confidence of the optimised student neural network (SNN) is lower than the classification confidence of the previously-trained neural network (PTNN)” [§ 2, paragraphs 1-2: “…Using a higher value for T produces a softer probability distribution over classes. In the simplest form of distillation, knowledge is transferred to the distilled model by training it on a transfer set and using a soft target distribution for each case in the transfer set that is produced by using the cumbersome model with a high temperature in its softmax. The same high temperature is used when training the distilled model, but after it has been trained it uses a temperature of 1.” § 1, second-to-last paragraph: “Our more general solution, called “distillation”, is to raise the temperature of the final softmax until the cumbersome model produces a suitably soft set of targets. We then use the same high temperature when training the small model to match these soft targets.” That is, during the distillation process, a high temperature is used for training the student model by being applied to both the outputs of the teacher (cumbersome) and the loss function of student (distilled) models. In regards to the limitation of this value being “lower than the classification confidence of the previously-trained neural network (PTNN),” Aggarwal teaches the standard softmax function of
q
i
=
exp
z
i
∑
j
e
x
p
(
z
j
)
that is used in training a neural network (in the case of the combination references thus far, the training of the teacher/PTNN), and Hinton teaches that knowledge distillation is a special case in which the student/SNN is trained using modified softmax function
q
i
=
exp
z
i
T
∑
j
exp
z
j
T
where the temperature is T > 1 such that “Using a higher value for T produces a softer probability distribution over classes” (i.e., “a classification confidence of the optimised student neural network (SNN) is lower”). Therefore, the combination of generic training of the teacher/PTNN using the standard softmax function and the distillation training of the student/SNN using the modified softmax with T > 1 results in the instant limitation.]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Li with the teachings of Hinton by implementing the technique of Hinton in using a temperature parameter during the knowledge distillation process, so as to arrive at the limitations of the instant dependent claim. The motivation would have been to soften the targets of the teacher model, so as to provide more information per training case such that the student model can be trained on less data and at a higher learning rate, as suggested by Hinton (§ 1, paragraph 4: “When the soft targets have high entropy, they provide much more information per training case than hard targets and much less variance in the gradient between training cases, so the small model can often be trained on much less data than the original cumbersome model and using a much higher learning rate.”).
As to claim 3, the combination of Li, Sharifi, Aggarwal, and Hinton teaches the computer-implemented method according to claim 1, wherein the plurality of parameters comprises a plurality of weights (w0 . . . j) connecting a plurality of neurons (N0...i) in the student neural network (SNN), and a plurality of biases (B) of activation functions (F(S)) controlling outputs (Y) of the neurons (N0…i), [Li, [0065]: “to update the parameters or node weights of the student DNN, which may be performed using back propagation.” Li, [0030]: “An example DNN model suitable for use as a student DNN or is described in connection to FIG. 2.” Li, [0039]: “With reference to FIG. 2, the input and output of DNN model 201 are denoted as x and o (210 and 250 of FIG, 2), respectively. Denote the input vector at layer l (220 of FIG. 2) as vl (with v0=x), the weight matrix as Wl, and bias vector as al. Then for a DNN with L hidden layers (240 of FIG. 2), the output of the l-th hidden layer is: Vl+1=σ(z(vl)), 0≦l<L where z(vl)=wlvl+al and σ(x)=1/(1+ex) is the sigmoid function applied element-wise.” Note that the sigmoid function here is an activation function.] and wherein the: optimising a student neural network (SNN) for processing the second data (SD) with the second processing system (SPS), by using the second processing system (SPS) to adjust a plurality of parameters of the student neural network (SNN) such that a difference (DIFF) between the reference output data (ROD), and second output data (SOD) generated by the student neural network (SNN) in response to inputting the second data (SD) to the student neural network (SNN), satisfies a stopping criterion, [The Examiner notes that this part of the claim merely refers to the limitations already recited in the parent claim and taught by the cited reference.] comprises:
iteratively adjusting the weights (w0. . . j) and the biases (B) of the student neural network (SNN) until the difference (DIFF) between the reference output data (ROD), and the second output data (SOD), is less than a predetermined value. [[0034]: “and repeats this cycle until the output distributions converge (or are otherwise sufficiently close)”; “[0037]: “embodiments of evaluating component 128 determine whether to complete another iteration (for example, another iteration comprising: updating the student DNN based on the error, passing un-labeled data through the student and teacher DNNs, and evaluating their output distributions)… evaluating component 128 apply a threshold to determine convergence of the teacher DNN and student DNN output distributions. Where the threshold is not satisfied, iteration may continue, thereby further training the student to approximate the teacher. Where the threshold is satisfied, then convergence is determined (indicating the student output distribution is sufficiently close enough to the teacher DNN's output distribution) and the student DNN may be considered trained and further may be deployed on a client device or computer system.” Here, the threshold corresponds to a “predetermined value.”]
As to claim 12, the combination of Li, Sharifi, Aggarwal, and Hinton teaches the computer-implemented method according to claim 1, further comprising:
using the second processing system to generate test output data (TOD) from the student neural network in response to test input data (TID), the test input data (TID) having corresponding expected output data (EOD) that is expected from the student neural network in response to inputting the test input data to the student neural network; [Li, [0043]: “In particular, for each iteration, a small piece of unlabeled (or un-transcribed) data 310 is provided to both student DNN 301 and teacher DNN 302. Using forward propagation, the posterior distribution (output distribution 351 and 352) is determined.” That is, referring to FIG. 3, data 310 corresponds to a test input data, and the output from the teacher model 302 correspond to an expected output data. Note that the instant claim does not define “test input data” in a manner different from a training input.] and further comprising:
constraining the optimising a student neural network (SNN) for processing the second data (SD) with the second processing system (SPS), such that a difference between the generated test output data (TOD), and the expected output data (EOD), is less than a second predetermined value. [Li, [0036]: “In particular, in an embodiment, evaluating component 128 evaluates the output distributions of the student and teacher DNNs, determines the difference (which may be determined as an error signal) between the outputs and also determines whether the student is continuing to improve or whether the student is no longer improving (i.e. the student output distribution shows no further trend towards convergence with the teacher output).” Li, [0049]: “At a high level, one embodiment of method 500 iteratively optimizes the student DNN, based on the difference between its output and the teacher's output, until it converges with the teacher DNN.” Li, [0037]: “evaluating component 128 apply a threshold to determine convergence of the teacher DNN and student DNN output distributions. Where the threshold is not satisfied, iteration may continue, thereby further training the student to approximate the teacher. Where the threshold is satisfied, then convergence is determined (indicating the student output distribution is sufficiently close enough to the teacher DNN's output distribution) and the student DNN may be considered trained and further may be deployed on a client device or computer system.” See also Li, [0044], [0060] for similar teachings. That is, the “threshold” described here corresponds to a second predetermined value. The training outcome of the student network is being “constrained” such that its performance satisfies the performance criterion measured by this threshold.]
As to claim 13, the combination of Li, Sharifi, Aggarwal, and Hinton teaches the computer-implemented method according to claim 1, wherein the first processing system (FPS) is a cloud-based processing system or a server-based processing system or a mainframe-based processing system, [Li, [0026]: “storage 106 may be embodied as one or more information stores, including memory on client device 102 or 104, DNN model generator 120, or in the cloud.” Li, [0028]: “DNN model generator 120 and its components 122, 124, 126, and 128 may be embodied as a set of compiled computer instructions or functions, program modules, computer software services, or an arrangement of processes carried out on one or more computer systems, such as computing device 700, described in connection to FIG. 7, for example.” Li, [0080]: “Distinction is not made between such categories as “workstation,” “server,” “laptop,” “handheld device,” etc., as all are contemplated within the scope of FIG. 1 and with reference to “computing device.”” That is, a cloud-based and server-based system is taught.] and/or wherein the second processing system (SPS) is an on-device-based processing system or a mobile device-based processing system. [Since the instant claim recites alternate expression delaminated by “and/or,” the second wherein clause does not need to be met by the prior art when the first wherein clause is met.]
As to claim 14, the combination of Li, Sharifi, Aggarwal, and Hinton teaches a non-transitory computer-readable storage medium comprising instructions which when executed on a processor cause the processor to carry out the method according to claim 1. [Li, [0086]: “one or more computer-readable media having computer-executable instructions embodied thereon that, when executed by a computing system having a processor and memory…” Li, [0081]: “Computing device 700 typically includes a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by computing device 700 and includes both volatile and nonvolatile media, removable and non-removable media… Computer storage media does not comprise signals per se.”]
As to claim 15, this claim is directed to a system for performing operations that are the same or substantially the same as those of claim 1. Therefore, the rejection made to claim 1 is applied to claim 15.
Furthermore, Li teaches the system components of a system (SY) for optimising a student neural network (SNN)… [[0020]: “a block diagram is provided showing aspects of one example of a system architecture suitable for implementing an embodiment of the invention and designated generally as system 100.” The remaining limitations in this part of the claim, namely as the SNN, PTNN, FD, and FPS are taught for the reasons given for the corresponding limitations in claim 1.] the system (SY) comprising:
a second processing system (SPS) comprising one or more processors (PROC); [[0080]: “With reference to FIG. 7, computing device 700 includes a bus 710 that directly or indirectly couples the following devices: memory 712, one or more processors 714”]
an input device in communication with the second processing system (SPS); [This limitation is taught for the reasons discussed for the “input device” limitation of claim 1.]
a memory (MEM) in communication with the one or more processors (PROC) of the second processing system (SPS), the memory comprising instructions, which when executed by the one or more processors (PROC) of the second processing system (SPS), cause the second processing system (SPS) to… [[0026]: “Storage 106 generally stores information including data, computer instructions (e.g. software program instructions, routines, or services), and/or models used in embodiments of the invention described herein… storage 106 may be embodied as one or more information stores, including memory on client device 102 or 104, DNN model generator 120, or in the cloud.” [0086]: “one or more computer-readable media having computer-executable instructions embodied thereon that, when executed by a computing system having a processor and memory…”]
2. Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over Li in view of Sharifi, Aggarwal, and Hinton, and further in view of Routray et al. (US 2020/0342339 A1) (“Routray”).
As to claim 2, the combination of Li, Sharifi, Aggarwal, and Hinton teaches the computer-implemented method according to claim 1, further comprising: receiving, with the second processing system (SPS), second processing system input data (SPSID); [[0043]: “In particular, for each iteration, a small piece of unlabeled (or un-transcribed) data 310 is provided to both student DNN 301 and teacher DNN 302.”], but does not explicitly teach the remaining limitations of the instant dependent claim.
Routray teaches the remaining limitations of “using the second processing system (SPS) to identify a subset of the second processing system input data (SPSID) to use as the second data (SD);” [[0051]: “The illustrative embodiments provide a mechanism to select a subset of enterprise data that may be annotated for use with a cognitive system to train the cognitive system for achieving the purposes of the enterprise…”] and wherein: identifying a subset of the second processing system input data (SPSID) to use as the second data (SD), comprises: sampling the second processing system input data (SPSID), and including the sampled second processing system input data in the subset if the sampled second processing system input data increases a diversity metric of the subset.” [[0051]: “…This selection utilizes a diversity based selection of portions of training data based on a statistical distribution of the data set where the diversity based selection over-samples portions of data at tail ends of the distribution (minor classifications) while under-sampling portions of data that are more prominently represented in the distribution (base classifications).” See also [0053].]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of the references combined thus far with the teachings of Routray by implementing the diversity-based selection method to identify a subset for use as the second data, so as to arrive at the limitations of the instant dependent claim. The motivation for doing so would have been to select a subset of data that have features representing all possible combinations of features (see Routray, [0037]: “to select a subset of entries in the data set, e.g., cases that have features that would represent all possible combinations of the features in correlation with the desired determination.”).
3. Claims 5-9 are rejected under 35 U.S.C. 103 as being unpatentable over Li in view of Sharifi, Aggarwal, and Hinton, and further in view of Bao et al., “Using Distillation to Improve Network Performance after Pruning and Quantization,” MLMI 2019, September 18–20, 2019, Jakarta, Indonesia (“Bao”).
As to claim 5, the combination of Li, Sharifi, Aggarwal, and Hinton teaches the computer-implemented method according to claim 3, wherein the student neural network comprises the plurality of neurons (N0...i), [Li, [0030]: “An example DNN model suitable for use as a student DNN or is described in connection to FIG. 2.” Li, [0039]: “With reference to FIG. 2, the input and output of DNN model 201 are denoted as x and o (210 and 250 of FIG, 2), respectively. Denote the input vector at layer l (220 of FIG. 2) as vl (with v0=x), the weight matrix as Wl…” Note that Li, FIG. 2 teaches neurons and weight connections in student neural network 301.], but does not teach the method further comprising the list of alternatives recited in the instant dependent claim 5.
Bao, which pertains to machine learning model compression (see title), teaches “using the second processing system (SPS) to prune the optimised student neural network by removing one or more neurons (N0...i) from the optimised student neural network; and/or using the second processing system (SPS) to prune the optimised student neural network by removing one or more connections defined by the weights (w0...j) from the optimised student neural network; and/or” [Abstract: “This paper will construct a deep neural network model compression framework based on weight pruning, weight quantization and knowledge distillation.” § 2.1, paragraph 1: “Pruning removes redundant or unimportant parameters from deep neural networks to reduce model parameter storage space and prevents network from over-fitting. Based on whether the whole node or filter is deleted at one time, parameter pruning can be subdivided into structured and unstructured pruning (also known as coarse-grained pruning and fine-grained pruning). Unstructured pruning considers every element in each filter, eliminating unimportant parameters in the filter, while structured pruning directly considers deleting the entire filter.” Note that removing nodes/filters corresponds to pruning a neuron, while removing parameters corresponds to weight pruning. See also Bao FIG. 2, which teaches “remove the least important neuron” as part of the pruning step conducted in conjunction with quantization.] “using the second processing system (SPS) to quantize the optimised student neural network by reducing a precision of the weights (w0...j) of the optimised student neural network;” [§ 2.2, paragraph 2: “Quantization is another important method of neural network parameter compression. Its core idea is to reduce the number of bits per weight to compress the original network.” Specifically, the quantization is performed on the student neural network, as described in § 3, paragraph 3: “Secondly, We need to quantize[19] the pruned student model. The core operation of quantization is to select the appropriate scaling function.”] “and/or using the second processing system (SPS) to cluster the weights of the optimised student neural network.” [This part of the claim recites an alternative in the list of alternatives, and thus does not need to be met when the other alternatives are met.]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of the references combined thus far with the teachings of Bao by implementing the compression method of Bao involving pruning and quantization in the method of Li (as modified thus far), so as to arrive at the limitations of the instant dependent claim (with respect to at least one of the first three items in the alternative expression). The motivation for doing so would have been to enable a neural network to be compressed in a manner that reduces complexity while maintaining most of the performance (see Bao, § 4, last paragraph: “reduces the computational complexity, reduces the memory storage space of the model, and does not lose much accuracy”).
As to claim 6, Li teaches the computer-implemented method according to claim 1, wherein the student neural network comprises a plurality of neurons (N0…i), and wherein the plurality of parameters comprises a plurality of weights (w0...j) connecting the plurality of neurons (N0…i) in the student neural network, […] [Li, [0030]: “An example DNN model suitable for use as a student DNN or is described in connection to FIG. 2.” Li, [0039]: “With reference to FIG. 2, the input and output of DNN model 201 are denoted as x and o (210 and 250 of FIG, 2), respectively. Denote the input vector at layer l (220 of FIG. 2) as vl (with v0=x), the weight matrix as Wl…” Note that Li, FIG. 2 teaches neurons and weight connections in student neural network 301.], but does not teach the remaining limitations of the instant claim.
Bao, which pertains to machine learning model distillation and quantization (see title), teaches the above limitations. In particular, Bao teaches “wherein the: optimising a student neural network (SNN) for processing the second data (SD) with the second processing system (SPS), by using the second processing system (SPS) to adjust a plurality of parameters of the student neural network (SNN) such that a difference between the reference output data (ROD), and second output data (SOD) generated by the student neural network (SNN) in response to inputting the second data (SD) to the student neural network (SNN), satisfies a stopping criterion, comprises: reducing a precision of the weights (w0 . . . j) such that the difference between the reference output data (ROD), and the second output data (SOD), remains less than a predetermined limit;” [Abstract: “This paper will construct a deep neural network model compression framework based on weight pruning, weight quantization and knowledge distillation.” § 2.2, paragraph 2: “Quantization is another important method of neural network parameter compression. Its core idea is to reduce the number of bits per weight to compress the original network.” Specifically, the quantization is performed on the student neural network, as described in § 3, paragraph 3: “Secondly, We need to quantize[19] the pruned student model. The core operation of quantization is to select the appropriate scaling function.” In regards to the limitation of that the difference…remains less than a predetermined limitation,” the “difference” is already taught by the reference Li, and Bao teaches an analogous concept in the form of the distillation loss. See FIG. 3, which teaches “minimizing distillation loss.” Since the training press increases the performance of the student model (and thus the distance in the form of a distillation loss) decreases over time, so as to remain, at some point, lower than an earlier distillation loss that reads on the limitation of “a predetermined limit.”] and/or: removing neurons (N0 . . . i) and/or connections defined by the weights (w0 . . . j) such that the difference between the reference output data (ROD), and the second output data (SOD), remains less than the predetermined limit. [Since the instant claim recites an alternative expression denoted by “and/or,” the alternate expression as a whole is already met by the first alternative. Nonetheless, Bao also teaches this in FIG. 2, which teaches “remove the least important neuron” as part of the pruning step conducted in conjunction with quantization.]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of the references combined thus far with the teachings of Bao by implementing the compression method of Bao involving pruning and quantization in the method of Li (as modified thus far), so as to arrive at the limitations of the instant dependent claim. The motivation for doing so would have been to enable a neural network to be compressed in a manner that reduces complexity while maintaining most of the performance (see Bao, § 4, last paragraph: “reduces the computational complexity, reduces the memory storage space of the model, and does not lose much accuracy”).
As to claim 7, Li teaches the computer-implemented method according to claim 1, wherein the plurality of parameters comprises a plurality of weights (w0…j) connecting a plurality of neurons (N0…i) in the student neural network (SNN); [Li, [0030]: “An example DNN model suitable for use as a student DNN or is described in connection to FIG. 2.” Li, [0039]: “With reference to FIG. 2, the input and output of DNN model 201 are denoted as x and o (210 and 250 of FIG, 2), respectively. Denote the input vector at layer l (220 of FIG. 2) as vl (with v0=x), the weight matrix as Wl, and bias vector as al.] and wherein the previously-trained neural network (PTNN) comprises a plurality of weights connecting a plurality of neurons in the previously-trained neural network (PTNN), [Li, [0042]: “the invention teacher DNN 302 comprises a trained DNN model, which may be trained according to standard techniques known to one of ordinary skill in the art (such as the technique described in connection to FIG. 2).” See also FIG. 3, which shows the nodes and weight connections of the teacher model 302.] […].
The combination of references thus far does not explicitly teach “wherein the weights of the student neural network (w0...j) are represented with a lower precision than the weights of the previously-trained neural network (PTNN).”
Bao, which pertains to machine learning model compression (see title), teaches the above limitations. In particular, Bao teaches “wherein the weights of the student neural network (w0...j) are represented with a lower precision than the weights of the previously-trained neural network (PTNN).” [Abstract: “This paper will construct a deep neural network model compression framework based on weight pruning, weight quantization and knowledge distillation.” § 2.2, paragraph 2: “Quantization is another important method of neural network parameter compression. Its core idea is to reduce the number of bits per weight to compress the original network.” Specifically, the quantization is performed on the student neural network, as described in § 3, paragraph 3: “Secondly, We need to quantize[19] the pruned student model. The core operation of quantization is to select the appropriate scaling function.” In regards to the limitation of that the difference…remains less than a predetermined limitation,” the “difference” is already taught by the reference Li, and Bao teaches an analogous concept in the form of the distillation loss. See FIG. 3, which teaches “minimizing distillation loss.” Since the training press increases the performance of the student model (and thus the distance in the form of a distillation loss) decreases over time, so as to remain, at some point, lower than an earlier distillation loss that reads on the limitation of “a predetermined limit.”]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of the references combined thus far with the teachings of Bao by implementing the compression method of Bao involving pruning and quantization in the method of Li (as modified thus far), so as to arrive at the limitations of the instant dependent claim. The motivation for doing so would have been to enable a neural network to be compressed in a manner that reduces complexity while maintaining most of the performance (see Bao, § 4, last paragraph: “reduces the computational complexity, reduces the memory storage space of the model, and does not lose much accuracy”).
As to claim 8, the combination of Li and Bao teaches the computer-implemented method according to claim 7, as set forth above.
Bao further teaches “wherein the student neural network (SNN) is provided by performing a quantization process on the previously-trained neural network (PTNN), and wherein the quantization process comprises providing the weights (w0…j) of the student neural network (SNN) by reducing a precision of the weights of the previously-trained neural network (PTNN) such that the weights of the student neural network (SNN) are represented with a lower precision than the weights of the previously-trained neural network (PTNN).” [Abstract: “Firstly, the model will be double coarse-grained compression with pruning and quantization, then the original network will be used as the teacher network to guide the compressed student network.” As shown in FIG. 1, the student network is formed by pruning and quantizing the teacher network. As discussed above, quantization results in a reduction of precision by reducing the number of bits per weight (see § 2.2).]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of the references combined thus far with the teachings of Bao, including the teachings discussed above for the instant claim, so as to arrive at the limitations of the instant dependent claim. Since the teachings of Bao cited above are part of the techniques already discussed in the rejection of the parent dependent claim 7, the motivation for doing so is the same as the motivation given in the rejection of the parent dependent claim 7.
As to claim 9, the combination of Li and Bao teaches the computer-implemented method according to claim 8, as set forth above.
Li further teaches “further comprising: using the second processing system (SPS) […] to provide the student neural network (SNN), prior to optimising the student neural network (SNN) for processing the second data (SD) with the second processing system (SPS).” [[0052]: “At step 520 a second DNN model is initialized. The second DNN model serves as a “student DNN” for learning from the teacher DNN determined in step 510.” As shown in FIG. 5, step 520 is performed prior to the use of training data in the subsequent steps 540-560.]
Bao further teaches “to perform the quantization process on the previously-trained neural network (PTNN)” in a manner that is to provide the student neural network (SNN) [Abstract: “Firstly, the model will be double coarse-grained compression with pruning and quantization, then the original network will be used as the teacher network to guide the compressed student network.” As shown in FIG. 1, the student network is formed by pruning and quantizing the teacher network. That is, the quantization is performed on the teacher neural network in order to obtain the student neural network.]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of the references combined thus far with the teachings of Bao, including the teachings discussed above for the instant claim, so as to arrive at the limitations of the instant dependent claim. Since the teachings of Bao cited above are part of the techniques already discussed in the rejection of the parent dependent claim 7, the motivation for doing so is the same as the motivation given in the rejection of the parent dependent claim 7.
4. Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Li in view of Sharifi, Aggarwal, and Hinton, and further in view of Xia et al. (US 10,810,491 B1) (“Xia”).
As to claim 11, the combination of Li, Sharifi, Aggarwal, and Hinton teaches the computer-implemented method according to claim 1, […], and wherein the: optimising a student neural network (SNN) for processing the second data (SD) with the second processing system (SPS), by using the second processing system (SPS) to adjust a plurality of parameters of the student neural network (SNN) such that a difference between the reference output data (ROD), and second output data (SOD) generated by the student neural network (SNN) in response to inputting the second data (SD) to the student neural network (SNN), satisfies a stopping criterion, is performed subsequently in time to the: using the second processing system (SPS) to generate second processing system output data (SPSOD) in response to inputting second processing system input data (SPSID) to the student neural network (SNN). [Li, FIG. 5 teaches that the process of optimizing the student model (i.e., 540-560) occurs with repetition. Therefore, the claimed sequence of operations is met because a later set of iterations of the training process is performed after an earlier set of iterations in which the output data is generated. In other words, this limitation is met by the iterative manner of the process shown in FIG. 5 of Li.]
The combination of references thus far does not explicitly teach “wherein the second processing system output data (SPSOD) is provided to a user, and substantially in real-time.”
Xia, which pertains to “real-time visualization of machine learning models” (title), teaches “wherein the second processing system output data (SPSOD) is provided to a user, and substantially in real-time.” [Col. 5, lines 42-45: “FIG. 1 illustrates an example system environment in which real time visualizations of various characteristics of complex machine learning models may be provided to clients, according to at least some embodiments.” See also col. 4, lines 25-29: “Metrics which can be used to compare different concurrently-trained model variants may be generated and displayed using a dynamically updated easy-to-understand visualization interface (e.g., a web-based console or graphical user interface) in various embodiments. The visualizations may be provided to clients while the models are still being trained.” Note that the visualizations may include outputs of a model, analogous to the output data of the instant claim. See, e.g., col. 17, lines 25-30: “FIG. 9 illustrates example low-dimensional mappings of machine learning model outputs which may be provided by a visualization tool, according to at least some embodiments.”]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of the references combined thus far with the teachings of Xia by implementing the visualization technique of Xia for the second processing system output data such that the second processing system output data (SPSOD) is provided to a user, and substantially in real-time. The motivation would have been to provide visualization that enables tuning and debugging of complex machine learning models. See Xia, col. 2, lines 56-58: “Various embodiments of methods and apparatus for generating visualizations enabling tuning and debugging of complex multi-layer machine learning models are described.”
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. The following references depict the state of the art.
Mirzadeh et al., “Improved Knowledge Distillation via Teacher Assistant,” arXiv:1902.03393v2 [cs.LG] 17 Dec 2019 teaches conventional techniques in knowledge distillation.
Shridhar et al. (US 2021/0279595 A1) teaches that T=1 is the conventional softmax (see paragraph 86).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to YAO DAVID HUANG whose telephone number is (571)270-1764. The examiner can normally be reached Monday - Friday 9:00 am - 5:30 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached at (571) 270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Y.D.H./Examiner, Art Unit 2124
/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124