DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Remarks
Remarks page 5, Applicant contends:
Amended claim 7 no longer invokes 112f interpretation.
Response:
Applicant’s arguments with respect to claim(s) 7 have been considered but are moot because the new ground of rejection contain elements that have not been previously examined or does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Remarks page 6, Applicant contends:
Amended claim 7 does not invoke 112f and recites clear structural elements.
Response:
Applicant’s arguments with respect to claim(s) 7 have been considered but are moot because the new ground of rejection contain elements that have not been previously examined or does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Remarks page 7-8, Applicant contends:
Claim 1 is amended to better distinguish from Hinton.
Response:
Applicant’s arguments with respect to claim(s) 1 have been considered but are moot because the new ground of rejection contain elements that have not been previously examined or does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Remarks page 9, Applicant contends:
Hinton does not teach inputting of the same user interaction data into the two models, performing knowledge tracing, or reducing of model size
Response:
Hinton teaches the inputting of the same data into two models, wherein one of the models is smaller than the other to create a reduced size model (knowledge distillation).
[Hinton Introduction page 1]: “For tasks like speech and object recognition, training must extract structure from very large, highly redundant datasets but it does not need to operate in real time and it can use a huge amount of computation… The cumbersome model could be an ensemble of separately trained models or a single very large model trained with a very strong regularizer such as dropout [9]. Once the cumbersome model has been trained, we can then use a different kind of training, which we call “distillation” to transfer the knowledge [knowledge distillation] from the cumbersome model [model 1] to a small model [model 2] that is more suitable for deployment.”
Support for the idea of the same data being input into both models is given in more detail in claim 1 teachings, but is noted by Hinton as Hinton discusses being able to provide the same training data during the transferring to the small model where the small model is optimized to match the output of the cumbersome or larger model.
[Hinton Introduction page 2]: “The transfer set that is used to train the small model could consist entirely of unlabeled data [1] or we could use the original training set. We have found that using the original training set works well, especially if we add a small term to the objective function that encourages the small model to predict the true targets as well as matching the soft targets provided by the cumbersome model.”
Applicant’s arguments with respect to claim(s) 1 involving knowledge tracing have been considered but are moot because the new ground of rejection contain elements that have not been previously examined or does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Remarks page 10, Applicant contends:
Hinton does not teach or suggest the combined features of amended claim 1.
Response:
Hinton is not expected to teach all of the elements of amended claim 1, as claim 1 (as amended) contains elements previous in claims rejected under 103. As a result, a 103 rejection utilizing the teachings of the rolled up claims is expected for elements of amended claim 1.
Knowledge distillation is noted to be an element of Hinton ([Hinton Introduction page 1]: “Once the cumbersome model has been trained, we can then use a different kind of training, which we call “distillation” to transfer the knowledge [knowledge distillation] from the cumbersome model to a small model that is more suitable for deployment.”).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1, 4, 6, 7, 10, 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hinton et al (“Distilling the Knowledge in a Neural Network”), referred to as Hinton in this document, and further in view of Zhao (US 20240412075 A1), referred to as Zhao in this document, and further in view of Brown et al (US 20160217701 A1), referred to as Brown in this document.
Regarding Claim 1:
Hinton teaches:
A method of reducing a size of an artificial intelligence model by an electronic device
[Hinton 5.1 JFT Dataset page 5]: “This training used two types of parallelism [2]. First, there were many replicas of the neural net running on different sets of cores and processing different mini-batches from the training set. Each replica computes the average gradient on its current mini-batch and sends this gradient to a sharded parameter server which sends back new values for the parameters. These new values reflect all of the gradients received by the parameter server [A method of reducing a size of an artificial intelligence model by an electronic device (This is to map the electronic device limitation. The artificial intelligence is mapped along with the mapping of “a first model”.)] since the last time it sent parameters to the replica. Second, each replica is spread over multiple cores by putting different subsets of the neurons on each core.”
Inputting, by the processor, an input value including interaction information on whether a user answers a question correctly to a first model trained to perform a user knowledge tracing (KT) task
[Hinton Introduction page 1]: “For tasks like speech and object recognition, training must extract structure from very large, highly redundant datasets but it does not need to operate in real time and it can use a huge amount of computation… The cumbersome model could be an ensemble of separately trained models or a single very large model trained [Inputting, by the processor, an input value including interaction information on whether a user answers a question correctly to a first model trained to perform a user knowledge tracing (KT) task] with a very strong regularizer such as dropout [9]. Once the cumbersome model has been trained, we can then use a different kind of training, which we call “distillation” to transfer the knowledge from the cumbersome model to a small model that is more suitable for deployment.”
obtaining, by the processor, an output value of the first model based on the input value including the interaction information by using the first model, wherein the output value of the first model indicates a probability value that the user answers the question correctly
[Hinton Introduction page 2]: “An obvious way to transfer the generalization ability of the cumbersome model to a small model is to use the class probabilities produced by the cumbersome model as “soft targets” [obtaining, by the processor, an output value of the first model based on the input value including the interaction information by using the first model, wherein the output value of the first model indicates a probability value that the user answers the question correctly] for training the small model.
Inputting, by the processor, the input value including the interaction information on whether the user answers the question correctly to a second model
and training, by the processor, the second model based on the output value of the first model by using a loss function based on the output value of the first model and an output value of the second model
wherein the first model is the artificial intelligence model larger in size than the second model
[Hinton Introduction page 2]: “An obvious way to transfer the generalization ability of the cumbersome model to a small model [wherein the first model is the artificial intelligence model larger in size than the second model] is to use the class probabilities produced by the cumbersome model as “soft targets” [and training, by the processor, the second model based on the output value of the first model by using a loss function based on the output value of the first model and an output value of the second model] for training the small model. For this transfer stage, we could use the same training set [Inputting, by the processor, the input value including the interaction information on whether the user answers the question correctly to a second model] or a separate “transfer” set. When the cumbersome model is a large ensemble of simpler models, we can use an arithmetic or geometric mean of their individual predictive distributions as the soft targets. When the soft targets have high entropy, they provide much more information per training case than hard targets and much less variance in the gradient between training cases, so the small model can often be trained on much less data than the original cumbersome model and using a much higher learning rate.”
Support for the small model output being directed by a loss function to correct the output to match the larger model is given in [Hinton Introduction page 2], as the objective function is noted to encourage the small model to match the soft targets of the cumbersome model.
[Hinton Introduction page 2]: “The transfer set that is used to train the small model could consist entirely of unlabeled data [1] or we could use the original training set. We have found that using the original training set works well, especially if we add a small term to the objective function that encourages the small model to predict the true targets as well as matching the soft targets provided by the cumbersome model.”
Hinton does not explicitly teach:
including a processor and a memory storing computer-readable instructions executable by the processor, the method comprising
Inputting, by the processor, an input value including interaction information on whether a user answers a question correctly to a first model trained to perform a user knowledge tracing (KT) task
obtaining, by the processor, an output value of the first model based on the input value including the interaction information by using the first model, wherein the output value of the first model indicates a probability value that the user answers the question correctly
Inputting, by the processor, the input value including the interaction information on whether the user answers the question correctly to a second model
Zhao teaches:
including a processor and a memory storing computer-readable instructions executable by the processor, the method comprising
[Zhao 0005]: “In another aspect of the present disclosure, an apparatus for compressing an artificial neural network is provided. The apparatus includes a memory and one or more processors [including a processor] coupled to the memory [a memory storing computer-readable instructions executable by the processor, the method comprising]. The processor(s) are configured to determine an architecture of a teacher model.”
One of ordinary skill in the art, prior to the effective filing date, would have been motivated to combine Hinton and Zhao. Hinton and Zhao are in the same field of endeavor of machine learning. One of ordinary skill In the art would be motivated to combine Hinton and Zhao in order to incorporate the use of computer parts and a user terminal. The utilization of computer parts allows an implementation of the method in a computer, such as a server ([Zhao 0005]: “In another aspect of the present disclosure, an apparatus for compressing an artificial neural network is provided. The apparatus includes a memory and one or more processors coupled to the memory. The processor(s) are configured to determine an architecture of a teacher model.”). A user terminal allows the utilization of user control or user input into the device to give a user control of aspects of the system such as transferring data to and from the device ([Zhao 0133]: “Further, it should be appreciated that modules and/or other appropriate means for performing the methods and techniques described herein can be downloaded and/or otherwise obtained by a user terminal and/or base station as applicable. For example, such a device can be coupled to a server to facilitate the transfer of means for performing the methods described herein.).
Brown teaches:
Inputting, by the processor, an input value including interaction information on whether a user answers a question correctly to a first model
obtaining, by the processor, an output value of the first model based on the input value including the interaction information by using the first model, wherein the output value of the first model indicates a probability value that the user answers the question correctly
Inputting, by the processor, the input value including the interaction information on whether the user answers the question correctly to a second model
[Brown 0008]: “A response to the question is received and the assessment agent evaluates the response to generate an observable. The observable comprises information related to the response, which may be information representing whether the student has supplied a correct response [Inputting, by the processor, an input value including interaction information on whether a user answers a question correctly to a first model trained to perform a user knowledge tracing (KT) task][ Inputting, by the processor, the input value including the interaction information on whether the user answers the question correctly to a second model] or an incorrect response. A posterior estimate of the student's ability is then calculated by incorporating the observable into an ability model that models the student's ability. The student's ability may comprise the probability [obtaining, by the processor, an output value of the first model based on the input value including the interaction information by using the first model, wherein the output value of the first model indicates a probability value that the user answers the question correctly] that the student will provide a correct response to the question. This posterior estimate may then be compared with the difficulty of the question to determine whether the student has acquired a skill or mastered the material. The difficulty of the question may comprise the probability that a plurality of students will provide a correct response to the question.”
trained to perform a user knowledge tracing (KT) task
[Brown Abstract]: "In one embodiment, a method for analyzing the learning of a student [trained to perform a user knowledge tracing (KT) task] includes administering, by an assessment agent, a task to a student, the task comprising a question having an associated difficulty. The assessment agent receives a response to the question from the student and evaluates the response to generate an observable, the observable comprising information related to the response."
Support for the premise of analyzing the learning of a student be considered a form of knowledge tracing is given by the current application indicating that such a task is for tracking a student’s improvement or mastery of knowledge elements ([Current Application line 16 page 13]: "For example, KT may model a student's knowledge state to track each individual's master state improvement in a domain under test. Before deep learning became popular, as a statistical model, item response theory (IRT) (Gonz'alez-Brenes, Huang, and Brusilovsky 2014; Khajah et al. 2014; Yudelson, Koedinger, and Gordon 2013; Pel'anek 2017; Gervet et al. 2020) and Bayesian knowledge tracing (BKT) were used to assess students' mastery of knowledge elements.")
One of ordinary skill in the art, prior to the effective filing date, would have been motivated to combine Hinton and Brown. Hinton and Brown are in the same field of endeavor of machine learning. One of ordinary skill in the art would have been motivated to combine Hinton and Brown to utilize information related to knowledge tracing with knowledge distillation for analyzing information about answers and predicting student’s ability to get an answer correct in order to improve aspects related to online learning ([Brown 0005]: “On-line learning may be improved by further analysis and characterization of the learning process. For example, formal characterization of learning has been previously explored, in particular, using item response theory (IRT). IRT supposes that the probability of a correct response to an item on a test is a mathematical function of person and item parameters, such as intelligence and difficulty. Formal statistical analyses of student responses that apply IRT have been used to construct scales of learning, as well as to design and calibrate standardized tests. However, IRT approaches depend critically on static statistical models and analyses which lend themselves to analyses of student learning only when a test is complete, as opposed to during the test itself. Further, IRT approaches only define student ability and question difficulty indirectly. Accordingly, there is a need for improvements in on-line education and learning”).
Regarding Claim 4:
The method of claim 1 is taught by Hinton, Zhao, and Brown.
Hinton teaches:
wherein the training of the second model is performed after a label of an output value of the second model is set to a label of the output value of the first model
[Hinton Introduction page 2]: “Our more general solution, called “distillation”, is to raise the temperature of the final softmax until the cumbersome model produces a suitably soft set of targets. We then use the same high temperature when training the small model to match these soft targets [wherein the training of the second model is performed after a label of an output value of the second model is set to a label of the output value of the first model]. We show later that matching the logits of the cumbersome model is actually a special case of distillation.”
Where matching the soft targets is seen as showing the labels of the second model output being one of the labels the first model, as matching a target is seen as having the labels match, thus the labels would be the same. Further support for the idea that the labels the cumbersome model (first model) uses and the small model (second model) uses are the same are given in:
[Hinton Introduction page 1]: “The cumbersome model could be an ensemble of separately trained models or a single very large model trained with a very strong regularizer such as dropout [9]. Once the cumbersome model has been trained, we can then use a different kind of training, which we call “distillation” to transfer the knowledge from the cumbersome model to a small model that is more suitable for deployment.”
The quote noting the small model (second model) is intended or more suitable to replace the cumbersome model (first model) in deployment means the small model is replicating the outputs of the cumbersome/first model in order to replace the model.
Regarding Claim 6:
The method of claim 4 is taught by Hinton, Zhao, and Brown.
Brown teaches:
[Brown Abstract]: "In one embodiment, a method for analyzing the learning of a student [providing a service for the KT using the second model] includes administering, by an assessment agent, a task to a student, the task comprising a question having an associated difficulty. The assessment agent receives a response to the question from the student and evaluates the response to generate an observable, the observable comprising information related to the response."
The idea of providing a task or service using the second or distilled model is noted in Hinton.
[Hinton Introduction page 1]: “For tasks like speech and object recognition, training must extract structure from very large, highly redundant datasets but it does not need to operate in real time and it can use a huge amount of computation… The cumbersome model could be an ensemble of separately trained models or a single very large model trained with a very strong regularizer such as dropout [9]. Once the cumbersome model has been trained, we can then use a different kind of training, which we call “distillation” to transfer the knowledge from the cumbersome model to a small model that is more suitable for deployment.”
The motivation to combine with Brown is the same as the motivation in claim 1.
Regarding Claim 7:
Claim 7 is analogous to claim 1.
Regarding Claim 10:
The device of claim 7 is taught by Hinton, Zhao, and Brown.
Hinton teaches:
wherein the training of the second model is performed after a label of an output value of the second model is set to a label of the output value of the first model
[Hinton Introduction page 2]: “Our more general solution, called “distillation”, is to raise the temperature of the final softmax until the cumbersome model produces a suitably soft set of targets. We then use the same high temperature when training the small model to match these soft targets [wherein the training of the second model is performed after a label of an output value of the second model is set to a label of the output value of the first model]. We show later that matching the logits of the cumbersome model is actually a special case of distillation.”
Where matching the soft targets is seen as showing the labels of the second model output being one of the labels the first model, as matching a target is seen as having the labels match, thus the labels would be the same. Further support for the idea that the labels the cumbersome model (first model) uses and the small model (second model) uses are the same are given in:
[Hinton Introduction page 1]: “The cumbersome model could be an ensemble of separately trained models or a single very large model trained with a very strong regularizer such as dropout [9]. Once the cumbersome model has been trained, we can then use a different kind of training, which we call “distillation” to transfer the knowledge from the cumbersome model to a small model that is more suitable for deployment.”
The quote noting the small model (second model) is intended or more suitable to replace the cumbersome model (first model) in deployment means the small model is replicating the outputs of the cumbersome/first model in order to replace the model.
Regarding Claim 12:
The device of claim 10 is taught by Hinton, Zhao, and Brown.
Claim 12 is analogous to claim 6 aside what is taught below.
Zhao teaches:
through a terminal
[Zhao 0133]: “Further, it should be appreciated that modules and/or other appropriate means for performing the methods and techniques described herein can be downloaded and/or otherwise obtained by a user terminal [through a terminal] and/or base station as applicable. For example, such a device can be coupled to a server to facilitate the transfer of means for performing the methods described herein. Alternatively, various methods described herein can be provided via storage means (e.g., RAM, ROM, a physical storage medium such as a compact disc (CD) or floppy disk, etc.), such that a user terminal and/or base station can obtain the various methods upon coupling or providing the storage means to the device. Moreover, any other suitable technique for providing the methods and techniques described herein to a device can be utilized.”
The motivation to utilize a user terminal or terminal is the same motivation as the motivation utilized to combine with Zhao in claim 7.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Lathrop et al (US 20200074874 A1) is considered pertinent art, as Lathrop et al discusses the use of machine learning in the prediction of answers from a student, which probability of a student getting a question correct is within the current disclosure.
Yan et al (US 11200497 B1) is considered pertinent art, as Yan et al discusses the use of knowledge distillation of a model trained to perform a particular task to a distilled model that can perform the task while being smaller.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CHRISTOPHER D DEVORE whose telephone number is (703)756-1234. The examiner can normally be reached Monday-Friday 7:30 am - 5 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael J Huntley can be reached at (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/C.D.D./Examiner, Art Unit 2129
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129