DETAILED ACTION
1. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
2. Applicant’s submission filed 08 October 2025 [hereinafter Response] has been entered, where:
Claims 2 has been amended.
Claims 3, 10, and 14 have been cancelled.
Claims 1, 2, 4-9, 11-13, and 15-18 are pending.
Claims 1, 2, 4-9, 11-13, and 15-18 are rejected.
Claim Rejections - 35 U.S.C. § 101
3. Examiner WITHDRAWS the rejection under Section 101 to Applicant’s claims 1, 2, 4-9, 11-13, and 15-18.
Claim Rejections - 35 U.S.C. § 103
4. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
5. The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
6. This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
7. Claims 1, 2, 4-9, 11-13, and 15-18 are rejected under 35 U.S.C. 103 as being unpatentable over US Published Application 20210004676 to Jaderberg et al. [hereinafter Jaderberg] in view of US Published Application 20210182660 to Amirguliyev et al. [hereinafter Amirguliyev] and Oehmcke et al., “Knowledge Sharing for Population Based Neural Network Training,” Springer (2018) [hereinafter Oehmcke].
Regarding claims 1 and 12, Jaderberg teaches [a] process for evolving and providing access to a teacher neural network model trained on private data for use in evolving a child neural network model for solving a predetermined problem or making a prediction (Jaderberg ¶ 0039 teaches [t]he method (that is, process) enables neural network training to be carried out more efficiently and to produce a better trained neural network), of claims 1 and 12, the process comprising:
evolving a teacher neural network model in accordance with one or more domain factors (Jaderberg ¶ 0119 teaches a supervised learning task is specified as the machine learning task for the system (that is, the “supervised learning task” is one or more domain factors); e.g., Jaderberg teaches that these one or more domain factors include a deep reinforcement machine learning task (Jaderberg ¶ 0111), a machine translation learning task (Jaderberg ¶ 0119), and a set of training operations for a populations of pairs of candidate and discriminator neural networks of a GAN task (Jaderberg ¶ 0126)), the evolving including:
(i) creating by a first subsystem, including a first server, a first population of candidate teacher neural network models and assigning a unique candidate identifier to each of the candidate teacher neural network models in the first population (Jaderberg, Fig. 1, teaches an initial population P that assigns a unique candidate identifier to each of the candidate teacher neural network models in the first population [Examiner annotations in dashed-line text boxes]:
PNG
media_image1.png
635
556
media_image1.png
Greyscale
Jaderberg ¶ 0045 teaches example population based neural network training system (that is, creating by a first subsystem, including a first server, a first population of candidate teacher neural network models and assigning a unique candidate identifier to each of the candidate teacher neural network models in the first population); Jaderberg ¶ 0149 teaches [i]mplementations of the subject matter described in this specification can be implemented in a computing system that includes a back end component, e.g., as a data server) (that is, a first subsystem, including a first server); see also Jaderberg ¶ 0142, which teaches an “engine” is a software-based system, subsystem, or process that is programmed to perform one or more specific functions. Generally, an engine will be implemented as one or more software modules or components, installed on one or more computers in one or more locations. In some cases, one or more computers will be dedicated to a particular engine; in other cases, multiple engines can be installed and running on the same computer or computers);
(ii) transmitting the first population of candidate teacher neural network models with assigned candidate identifiers, to a second subsystem, including a second server (Jaderberg ¶ 0039 teaches that the training is such that it may be performed by a distributed system and the described approach only requires values of parameters, hyperparameters, and quality measures to be communicated between candidates (that is, the “distributed system” provides the second subsystem, including a second server); Jaderberg ¶ 0072 teaches [a] population based neural network training system 100 benefits from local optimization by executing, asynchronously and in parallel, an iterative training process for each candidate neural network 120A-N in the population (that is, “local optimization” accessing the population by transmitting the first population of candidate teacher models)), . . . ;
(iii) training by the second subsystem the first population of candidate teacher neural network models (Jaderberg ¶ 0005 teaches [d]uring the training of the neural network, the system maintains a plurality of candidate neural networks (that is, training . . . the first population of candidate teacher models); Jaderberg ¶ 0039 teaches that the training is such that it may be performed by a distributed system and the described approach only requires values of parameters, hyperparameters, and quality measures to be communicated between candidates (that is, the “distributed system” provides the second subsystem)) . . . ;
(iv) determining by the second subsystem a set of performance metrics for each of the candidate teacher neural network models (Jaderberg ¶ 0051 teaches [d]uring training, the network parameters, the hyperparameters, and the quality measure for a candidate neural network are updated in accordance with training operations, including an iterative training process (that is, “during training, . . . the quality measure . . . [is] updated” is determining by the second subsystem a set of performance metrics for each of the candidate teacher models)), wherein the set of performance metrics is indicative of a fitness of each of the candidate teacher models for solving the predetermined problem or making a prediction (Jaderberg ¶ 0111 teaches “a reinforcement learning task is specified as the machine learning task for the system. Each candidate neural network receives inputs and generates outputs that conform to a deep reinforcement learning task [(that is, “task” is for solving the predetermined problem or making a prediction)]”; Jaderberg ¶ 0113 teaches “eval(•): [t]he system updates the quality measure of a candidate neural network based on the mean value of a pre-determined number of previous episodic rewards (e.g., 10 episodic rewards). Specifically, the quality measure of the candidate neural network is the mean value of the predetermined number of previous episodic rewards. The candidate neural network with the highest mean episodic reward [(that is, each of the candidate teacher models)] has the highest quality measure and is considered the ‘best’ in terms of measured fitness [(that is, the set of performance metrics is indicative of a fitness of each of the candidate teacher models for solving the predetermined problem or making a prediction)]”) . . .
[Examiner notes that the plain meaning of a “metric” standard for measuring qualitative assessment, and the broadest reasonable interpretation of “the set of performance metrics” are such qualitative assessments, with regard to a “fitness of each of the candidate teacher models” for the intended purpose, which not inconsistent with the Applicant’s disclosure. (MPEP § 2111). Accordingly, the claim limitations cover the teachings of Jaderberg, as set out above in detail];
(v) providing the set of performance metrics for each of the candidate teacher neural network models in accordance with assigned candidate identifier to the first subsystem (Jaderberg ¶ 0054 teaches that [t]he quality measure for a candidate neural network is a measure of the performance of the candidate neural network on the particular machine learning task (that is, providing the set of performance metrics for each of the candidate teacher models in accordance with assigned candidate identifier to the first subsystem));
(vi) creating a next population of candidate teacher neural network models from the sets of performance metrics for each of the candidate teacher neural network models from the second subsystem (Jaderberg ¶ 0009 teaches system determines a new quality measure for the candidate neural network in accordance with the new values of the network parameters for the candidate neural network and the new values of the hyperparameters for the candidate neural network and updates the maintained data for the candidate neural network to specify the new values of the hyperparameters, the new values of the network parameters, and the new quality measure; Jaderberg ¶ 0010 teaches [a]fter repeatedly performing the set of training operations, the system selects the trained values of the network parameters from the parameter values in the maintained data based on the maintained quality measures for the candidate neural networks after the training operations have repeatedly been performed (that is, a population of “updated candidate neural networks” is creating a next population of candidate teach models from the sets of performance metrics for each of the candidate teacher models from the second subsystem));
(vii) repeating steps (ii) to (vi) over multiple generations until an optimized candidate teacher neural network model is determined in accordance with one or more conditions (Jaderberg ¶ 0096 teaches “[t]he system will continue the iterative training process for the candidate neural network until either the termination criteria is satisfied (and the system repeats the process 200 for the candidate neural network) or performance criteria is satisfied to indicate to the system to stop training [(that is, “the iterative training process” is repeating . . . over multiple generations until an optimized candidate teacher neural network model is determined)]”; Jaderberg ¶ 0136 teaches [w]hen training is over, the system generates data specifying a trained neural network by selecting the candidate neural network from the population with the highest quality measure (shown in line 16 in TABLE 1) (that is, a best candidate teacher model is determined in accordance with a predetermined condition); Jaderberg ¶ 0136 teaches the system checks if the termination criteria are satisfied (shown as a conditional statement in line 6 in TABLE 1 executing ready(p, t, P)) (that is, the “termination criteria” is a predetermined condition)); and
providing access to the optimized candidate teacher neural network model trained on the first secure data set in a commonly accessible location (Jaderberg ¶ 0049 teaches system 100 can output (e.g., by outputting to a user device or by storing in memory accessible to the system) the trained values of the network parameters of the trained neural network 150 for later use in processing inputs using the trained neural network 150 (that is, providing access to the best candidate teacher model in a commonly accessible location)); Jaderberg ¶ 0066 teaches “[an] optimal candidate neural network selected is sometimes referred to in this specification as the ’best‘ candidate neural network 120A-N in the population. A candidate neural network that has a higher quality measure than another candidate neural network is considered ’better‘ than the other candidate neural network (that is, the best candidate teacher model)), . . . .
Though Jaderberg teaches population based neural network training system that benefits from local optimization, Jaderberg, however, does not explicitly teach –
* * *
[(ii) transmitting the first population of candidate teacher models] . . . , wherein the first server and the second server are separated by a first firewall which blocks access to a first secure data set;
[(iii) training] . . . against a first secure data set located behind the first firewall;
[(iv) determining] . . . , wherein . . . the set of performance metrics does not include secure data from the second secure data set;
* * *
But Amirguliyev teaches -
* * *
[(ii) transmitting the first population of candidate teacher models] . . . , wherein the first server and the second server are separated by a first firewall which blocks access to a first secure data set (Amirguliyev, Fig. 2, teaches a firewall separating servers [Examiner annotations in dashed-line text boxes]:”
PNG
media_image2.png
718
664
media_image2.png
Greyscale
Amirguliyev ¶ 0066 teaches “the local area network 212 has a firewall that prevents or controls access to at least devices on the master local area network 212. The master device 210 also has a storage device 260 to store parameter data for an instantiated first version of a neural network model 262 [(that is, the first server and the second server are separated by a first firewall which blocks access to a first secure data set)]”);
[(iii) training] . . . against a first secure data set located behind the first firewall (Amirguliyev ¶ 0068 “Following training on data from the slave data source 230, the slave device 220 stores an updated set of parameters in the storage device 272 and use these to generate a second configuration data (CD2) 290 that is sent to the master device 210. The master device 210 receives the second configuration data 290 and uses it to update the parameter data stored in the storage device 260 [(that is, [(iii) training] . . . against a first secure data set located behind the first firewall)]”);
[(iv) determining] . . . , wherein . . . the set of performance metrics does not include secure data from the second secure data set (Amirguliyev ¶ 0100 teaches “[o]ne or more performance metrics may be determined and compared to predefined thresholds; performance below a threshold may lead to exclusion or down-weighting of a particular instance [(that is, via the firewall, it is inherent that the set of performance metrics does not include secure data from the second secure data set)]”);
* * *
Jaderberg and Amirguliyev are from the same or similar field of endeavor. Jaderberg teaches population based training of neural networks. Amirguliyev teaches distributed training of a neural network model where a master device has a first version of the neural network model and a slave device is communicatively coupled to a first data source and the master device, and the first data source is inaccessible by the master device. Thus, it would have been obvious to a person having ordinary skill in the art as of the effective filing date of the Applicant’s invention to modify Jaderberg pertaining to population-based training with distributed training of a neural network model having a firewall of Amirguliyev. The motivation to do so is to provide “an iterative exchangeable teacher-student configuration enables asynchronous data exchange to improve both versions of the neural network model, without exchanging any private information or requiring access to secure data systems.” (Amirguliyev ¶ 0008).
Though Jaderberg and Amirguliyev teach the features of population-based training in a distributed environment with a firewall, the combination of Jaderberg and Amirguliyev, however, does not explicitly teach –
* * *
[providing access] . . . , wherein the access does not include access to secure data from the first secure data set, and further wherein the optimized candidate teacher neural network model operates on one or more additional datasets to train one or more optimized candidate student neural network models to solve the predetermined problem or make the prediction, wherein the one or more optimized candidate student neural network model is smaller than the optimized candidate teacher neural network model.
But Oehmcke teaches –
* * *
[providing access] . . . , wherein the access does not include access to secure data from the first secure data set, and further wherein the optimized candidate teacher neural network model operates on one or more additional datasets to train one or more optimized candidate student neural network models (Oehmcke at p. 259, “1. Introduction,” second paragraph, teaches a novel extension to [population based training (PBT)] by enabling knowledge sharing across generations. We adapt a knowledge distilling strategy, which is inspired by Hinton et al. [10], where the knowledge about the training data of the best individuals is stored separately and fed back to all individuals via the loss function) to solve the predetermined problem or make the prediction (Oehmcke teaches extensions to population-based training with knowledge distilling in Algorithm 1 (Examiner annotations in dashed-line text boxes):
PNG
media_image3.png
543
799
media_image3.png
Greyscale
Oehmcke at p. 260, “Knowledge Sharing,” first paragraph, teaches extensions to the [population-based training] with knowledge distilling [as] highlighted with [shaded boxes] in Algorithm 1. The teacher output T = {t1, . . . , tn} ∈ ℝc is initialized with the one-hot-encoded class targets of the true training targets Ytrain (Line 1)). During the evolutionary process, the best models are allowed to contribute to T through the teach-function (Line 13). We implement this teach-function by replacing 20% of the teacher output T with the predicted probability if the individual is from the upper 20% of the population P regarding the fitness p (that is, the one or more additional datasets by the best candidate teacher model). Depending on the population size, we are able to replace the original targets from Y in a few generations and introduce updates from generations continuously (that is, “generations” is evolving); Oehmcke at p. 261, “2.2 Knowledge Sharing,” second & third paragraph, teaches [the] combination of cross entropy and Kullback-Leibler divergence ensures that the models can learn the true labels, while also utilizing the already acquired knowledge of the population. The trade-off parameter α is added to the hyperparameter settings h. To compare the output distributions of the teacher and the individuals (that is, the candidate student models), we employ the Kullback–Leibler divergence DKL inspired by other distilling approaches [23]. The one-hot encoding of the true target as initialization results in a loss function equal to only using cross entropy since the Kullback-Leibler divergence is approximately equal to the cross entropy when all-except-one class probabilities are zero. There are similarities to generative adversarial networks (GANs) [7], where the generator is similar to the teacher (that is, the teacher model) and discriminator is similar to the student (that is, the student model)), wherein the one or more optimized candidate student neural network model is smaller than the optimized candidate teacher neural network model (Oehmcke at p. 265, “4.1 Distilling Knowledge,” first paragraph, teaches “They trained a complex model and let it be the teacher for a simpler model, the student. The student model is then able to achieve nearly the same performance as the complex one, which was not possible when training the simple model without the teacher).
Jaderberg, Amirguliyev, and Oehmcke are from the same or similar field of endeavor. Jaderberg teaches population based training of neural networks. Amirguliyev teaches distributed training of a neural network model where a master device has a first version of the neural network model and a slave device is communicatively coupled to a first data source and the master device, and the first data source is inaccessible by the master device. Oehmcke teaches extensions to population-based training to provide knowledge distilling (or sharing). Thus, it would have been obvious to a person having ordinary skill in the art as of the effective filing date of the Applicant’s invention to modify the combination of Jaderberg and Amirguliyev pertaining to distributed neural network training having a firewall with the population-based training extended to knowledge distillation of Oehmcke.
The motivation to do so is for a novel knowledge distilling scheme where only the best individuals of the population are allowed to share part of their knowledge about the training data with the whole population to embrace the idea of randomness between the models, rather than avoiding it, because the resulting diversity of models is important for the population’s evolution. (Oehmcke, Abstract).
[Examiner notes that the term "firewall" recited in Applicant's claims is interpreted to be a well-known network structure.1]
Regarding claim 8, Jaderberg teaches [a] process for evolving and providing access to a teacher neural network model trained on private data for use in evolving a child neural network model for solving a predetermined problem or making a prediction (Jaderberg ¶ 0039 teaches [t]he method (that is, process) enables neural network training to be carried out more efficiently and to produce a better trained neural network), the process comprising:
creating and training a teacher neural network model by a first system including a first server (Jaderberg ¶ 0005 teaches [d]uring the training of the neural network, the system maintains a plurality of candidate neural networks (that is, creating and training a teach model); Jaderberg ¶ 0039 teaches that the training is such that it may be performed by a distributed system and the described approach only requires values of parameters, hyperparameters, and quality measures to be communicated between candidates (that is, the “distributed system” provides the first system including a first server), . . . ;
determining by the first system a set of performance metrics for each of the teacher neural network model (Jaderberg ¶ 0051 teaches [d]uring training, the network parameters, the hyperparameters, and the quality measure for a candidate neural network are updated in accordance with training operations, including an iterative training process (that is, “during training, . . . the quality measure . . . [is] updated” is determining by the first system a set of performance metrics for each of the teacher model)), . . . ;
providing access to the trained teacher neural network model in a commonly accessible area (Jaderberg ¶ 0049 teaches system 100 can output (e.g., by outputting to a user device or by storing in memory accessible to the system) the trained values of the network parameters of the trained neural network 150 for later use in processing inputs using the trained neural network 150 (that is, providing access to the trained teacher model in a commonly accessible area)), . . . ;
evolving a . . . neural network model in accordance with one or more domain factors (Jaderberg ¶ 0055 teaches system 100 trains each candidate neural network 120A-N by repeatedly performing iterations of an iterative training process to determine updated network parameters for the respective candidate neural network (that is, the “repeatedly performing iterations” is evolving a . . . model); Jaderberg ¶ 0119 teaches a supervised learning task is specified as the machine learning task for the system (that is, the “supervised learning task” is one or more domain factors); e.g., Jaderberg teaches that these one or more domain factors include a deep reinforcement machine learning task (Jaderberg ¶ 0111), a machine translation learning task (Jaderberg ¶ 0119), and a set of training operations for a populations of pairs of candidate and discriminator neural networks of a GAN task (Jaderberg ¶ 0126)), the evolving including:
(i) creating by a second subsystem, including a second server, a first population of candidate . . . neural network models and assigning a unique candidate identifier to each of the candidate . . . neural network models in the first population (Jaderberg, Fig. 1, teaches an initial population P that assigns a unique candidate identifier to each of the candidate student models in the first population [Examiner annotations in dashed-line text boxes]:
PNG
media_image4.png
624
559
media_image4.png
Greyscale
Jaderberg ¶ 0045 teaches example population based neural network training system (that is, creating by a second subsystem, including a second server, a first population of candidate . . . models and assigning a unique candidate identifier to each of the candidate . . . models in the first population); Jaderberg ¶ 0149 teaches [i]mplementations of the subject matter described in this specification can be implemented in a computing system that includes a back end component, e.g., as a data server) (that is, a second subsystem, including a second server); see also Jaderberg ¶ 0142, which teaches an “engine” is a software-based system, subsystem, or process that is programmed to perform one or more specific functions. Generally, an engine will be implemented as one or more software modules or components, installed on one or more computers in one or more locations. In some cases, one or more computers will be dedicated to a particular engine; in other cases, multiple engines can be installed and running on the same computer or computers);
(ii) transmitting the first population of candidate student neural network models with assigned candidate identifiers, to a third subsystem, including a third server (Jaderberg ¶ 0039 teaches that the training is such that it may be performed by a distributed system and the described approach only requires values of parameters, hyperparameters, and quality measures to be communicated between candidates (that is, the “distributed system” provides the third subsystem, including a third server); Jaderberg ¶ 0072 teaches [a] population based neural network training system 100 benefits from local optimization by executing, asynchronously and in parallel, an iterative training process for each candidate neural network 120A-N in the population (that is, “local optimization” accessing the population by transmitting the first population of candidate . . . models)), . . . ;
(iii) training by the third subsystem the first population of candidate . . . neural network models (Jaderberg ¶ 0005 teaches [d]uring the training of the neural network, the system maintains a plurality of candidate neural networks (that is, training . . . the first population of candidate . . . models); Jaderberg ¶ 0039 teaches that the training is such that it may be performed by a distributed system and the described approach only requires values of parameters, hyperparameters, and quality measures to be communicated between candidates (that is, the “distributed system” provides the third subsystem)) . . . ;
(iv) determining by the third subsystem a set of performance metrics for each of the candidate student neural network models (Jaderberg ¶ 0051 teaches [d]uring training, the network parameters, the hyperparameters, and the quality measure for a candidate neural network are updated in accordance with training operations, including an iterative training process (that is, “during training, . . . the quality measure . . . [is] updated” is determining by the second subsystem a set of performance metrics for each of the candidate teacher models)), wherein the set of performance metrics is indicative of a fitness of each of the candidate teacher models for solving the predetermined problem or making a prediction (Jaderberg ¶ 0111 teaches “a reinforcement learning task is specified as the machine learning task for the system. Each candidate neural network receives inputs and generates outputs that conform to a deep reinforcement learning task [(that is, “task” is for solving the predetermined problem or making a prediction)]”; Jaderberg ¶ 0113 teaches “eval(•): [t]he system updates the quality measure of a candidate neural network based on the mean value of a pre-determined number of previous episodic rewards (e.g., 10 episodic rewards). Specifically, the quality measure of the candidate neural network is the mean value of the predetermined number of previous episodic rewards. The candidate neural network with the highest mean episodic reward [(that is, each of the candidate teacher models)] has the highest quality measure and is considered the ‘best’ in terms of measured fitness [(that is, the set of performance metrics is indicative of a fitness of each of the candidate teacher models for solving the predetermined problem or making a prediction)]”) . . . ;
(v) providing the set of performance metrics for each of the candidate . . . neural network models in accordance with assigned candidate identifier to the second subsystem (Jaderberg ¶ 0054 teaches that [t]he quality measure for a candidate neural network is a measure of the performance of the candidate neural network on the particular machine learning task (that is, providing the set of performance metrics for each of the candidate . . . models in accordance with assigned candidate identifier to the second subsystem));
(vi) creating a next population of candidate student neural network models from the sets of performance metrics for each of the candidate . . . neural network models from the third subsystem (Jaderberg ¶ 0009 teaches system determines a new quality measure for the candidate neural network in accordance with the new values of the network parameters for the candidate neural network and the new values of the hyperparameters for the candidate neural network and updates the maintained data for the candidate neural network to specify the new values of the hyperparameters, the new values of the network parameters, and the new quality measure; Jaderberg ¶ 0010 teaches [a]fter repeatedly performing the set of training operations, the system selects the trained values of the network parameters from the parameter values in the maintained data based on the maintained quality measures for the candidate neural networks after the training operations have repeatedly been performed (that is, a population of “updated candidate neural networks” is creating a next population of candidate teach models from the sets of performance metrics for each of the candidate . . . models from the third subsystem));
(vii) repeating steps (ii) to (vi) for multiple generations until an optimized candidate . . . neural network model is determined in accordance with one or more predetermined conditions (Jaderberg ¶ 0136 teaches [w]hen training is over, the system generates data specifying a trained neural network by selecting the candidate neural network from the population with the highest quality measure (shown in line 16 in TABLE 1) (that is, a best candidate . . . model is determined in accordance with a predetermined condition); Jaderberg ¶ 0136 teaches the system checks if the termination criteria are satisfied (shown as a conditional statement in line 6 in TABLE 1 executing ready(p, t, P)) (that is, the “termination criteria” is a predetermined condition));
providing access to the optimized candidate . . . neural network model in the commonly accessible area (Jaderberg ¶ 0049 teaches system 100 can output (e.g., by outputting to a user device or by storing in memory accessible to the system) the trained values of the network parameters of the trained neural network 150 for later use in processing inputs using the trained neural network 150 (that is, providing access to the best candidate. . . model in the commonly accessible area)); Jaderberg ¶ 0066 teaches [an] optimal candidate neural network selected is sometimes referred to in this specification as the “best” candidate neural network 120A-N in the population. A candidate neural network that has a higher quality measure than another candidate neural network is considered “better” than the other candidate neural network (that is, the best candidate . . . model)), . . . ; and
* * *
Though Jaderberg teaches population based neural network training system that benefits from local optimization, Jaderberg, however, does not explicitly teach –
* * *
[creating and training] . . . , wherein the teacher neural network model is trained on a first secure data set;
[determining] . . . , wherein the set of performance metrics does not include secure data from the first secure data set;
[providing access] . . . , wherein the first subsystem and the commonly accessible area are separated by a first firewall which blocks access to the first secure data set;
* * *
[(ii) transmitting] . . . , wherein the second server and the third server are separated by a second firewall which blocks access to a second secure data set;
[(iii) training] . . . against the second secure data set located behind the second firewall;
[(iv) determining] . . . , wherein the set of performance metrics does not include secure data from the second secure data set;
* * *
[providing access] . . . , wherein the access does not include access to secure data from the first or second secure data sets, and further wherein the second subsystem and the commonly accessible area are separated by a third firewall; and
* * *
But Amirguliyev teaches -
* * *
[creating and training] . . . , wherein the teacher neural network model is trained on a first secure data set (Amirguliyev ¶ 0067 teaches “the master device 210 is configured to instantiate the first version of the neural network model 262 [(that is, the teacher neural network)] using the parameter data from the storage device 260. This includes a process similar to that described for the second version of the neural network model 170 in the example of FIG. 1. The first version of the neural network model 262 may be instantiated using the first configuration data (CD1) 280 or similar data. The master device 210 is then configured to train the first version of the neural network model 262 using data from the master data source 264 [(that is, wherein the teacher neural network model is trained on a first secure data set)]”);
[determining] . . . , wherein the set of performance metrics does not include secure data from the first secure data set (Amirguliyev ¶ 0067 teaches “The first version of the neural network model 262 may be instantiated using the first configuration data (CD1) 280 or similar data [(that is, the “configuration data (CD1)” is wherein the set of performance metrics does not include secure data from the first secure data set)]”);
[providing access] . . . , wherein the first subsystem and the commonly accessible area are separated by a first firewall which blocks access to the first secure data set (Amirguliyev, Fig. 2, teaches a firewall separating servers [Examiner annotations in dashed-line text boxes]:”
PNG
media_image2.png
718
664
media_image2.png
Greyscale
Amirguliyev ¶ 0066 teaches “the local area network 212 has a firewall that prevents or controls access to at least devices on the master local area network 212. The master device 210 also has a storage device 260 to store parameter data for an instantiated first version of a neural network model 262 [(that is, wherein the first subsystem and the commonly accessible area are separated by a first firewall which blocks access to the first secure data set)]”);
* * *
[(ii) transmitting] . . . , wherein the second server and the third server are separated by a second firewall which blocks access to a second secure data set (Amirguliyev, Fig. 2, teaches a firewall separating servers [Examiner annotations in dashed-line text boxes]:”
PNG
media_image2.png
718
664
media_image2.png
Greyscale
Amirguliyev ¶ 0066 teaches “the local area network 212 has a firewall that prevents or controls access to at least devices on the master local area network 212. The master device 210 also has a storage device 260 to store parameter data for an instantiated first version of a neural network model 262 [(that is, the first server and the second server are separated by a first firewall which blocks access to a first secure data set)]”);
[(iii) training] . . . against the second secure data set located behind the first firewall (Amirguliyev ¶ 0068 “Following training on data from the slave data source 230, the slave device 220 stores an updated set of parameters in the storage device 272 and use these to generate a second configuration data (CD2) 290 that is sent to the master device 210. The master device 210 receives the second configuration data 290 and uses it to update the parameter data stored in the storage device 260 [(that is, [(iii) training] . . . against a first secure data set located behind the first firewall)]”);
[(iv) determining] . . . , wherein . . . the set of performance metrics does not include secure data from the second secure data set (Amirguliyev ¶ 0100 teaches “[o]ne or more performance metrics may be determined and compared to predefined thresholds; performance below a threshold may lead to exclusion or down-weighting of a particular instance [(that is, via the firewall, it is inherent that the set of performance metrics does not include secure data from the second secure data set)]”);
* * *
[providing access] . . . , wherein the access does not include access to secure data from the first or second secure data sets, and further wherein the second subsystem and the commonly accessible area are separated by a third firewall (Amirguliyev, Fig. 7, teaches a distributed training system 700 [Examiner annotations in dashed-line text boxes]
PNG
media_image5.png
714
520
media_image5.png
Greyscale
Amirguliyev ¶ 0088 teaches “a plurality of slave devices 722, 724, and 726 via one or more communication networks 750 . . . communicatively coupled to a respective slave data source 732, 734, and 736, which as per FIGS. 1 and 2 may be inaccessible by the master device 710, e.g. may be located on respective private networks 742, 744, and 746. The slave data sources 732, 734 and 736 may also be inaccessible by other slave devices, e.g. the slave device 724 may not be able to access the slave data source 732 or the slave data source 736. Each slave device 722, 724, and 726 receives the first configuration data 780 from the master device 710 as per the previous examples and generates respective second configuration data (CD2) 792, 794, and 796, which is communicated back to the master device 710 via the one or more communication networks 750. The set of second configuration data 792, 794, and 796 from the plurality of slave devices 722, 724, and 726, respectively, may be used to update parameters for the first version of the neural network model [(that is, wherein the access does not include access to secure data from the first or second secure data sets, and further wherein the second subsystem and the commonly accessible area are separated by a third firewall)]”);
* * *
Though Jaderberg and Amirguliyev teach the features of population-based training in a distributed environment with a firewall, the combination of Jaderberg and Amirguliyev, however, does not explicitly teach –
* * *
and training the optimized candidate student neural network model in accordance with operation by the trained teacher neural network model on one or more additional datasets to solve the predetermined problem or make the prediction, wherein the one or more optimized candidate student neural network model is smaller than the optimized candidate teacher neural network model.
But Oehmcke teaches that “knowledge distilling” is the distilling of knowledge for neural networks in which a complex model is trained, and then let it be the teacher for a simpler model, the student (that is, candidate student model). The student model is then able to achieve nearly the same performance as the complex one, which was not possible when training the simple [student] model without the teacher. (Oehmcke at p. 265, “4.1 Distilling Knowledge,” first paragraph).
Oehmcke also teaches -
and training the optimized candidate student neural network model in accordance with operation by the trained teacher neural network model on one or more additional datasets (Oehmcke at p. 259, “1. Introduction,” second paragraph, teaches a novel extension to [population based training (PBT)] by enabling knowledge sharing across generations. We adapt a knowledge distilling strategy, which is inspired by Hinton et al. [10], where the knowledge about the training data of the best individuals is stored separately and fed back to all individuals via the loss function) to solve the predetermined problem or make the prediction (Oehmcke teaches extensions to population-based training with knowledge distilling in Algorithm 1 (Examiner annotations in dashed-line text boxes):
PNG
media_image3.png
543
799
media_image3.png
Greyscale
Oehmcke at p. 260, “Knowledge Sharing,” first paragraph, teaches extensions to the [population-based training] with knowledge distilling [as] highlighted with [shaded boxes] in Algorithm 1. The teacher output T = {t1, . . . , tn} ∈ ℝc is initialized with the one-hot-encoded class targets of the true training targets Ytrain (Line 1)). During the evolutionary process, the best models are allowed to contribute to T through the teach-function (Line 13). We implement this teach-function by replacing 20% of the teacher output T with the predicted probability if the individual is from the upper 20% of the population P regarding the fitness p (that is, the one or more additional datasets by the best candidate teacher model). Depending on the population size, we are able to replace the original targets from Y in a few generations and introduce updates from generations continuously (that is, “generations” is evolving); Oehmcke at p. 261, “2.2 Knowledge Sharing,” second & third paragraph, teaches [the] combination of cross entropy and Kullback-Leibler divergence ensures that the models can learn the true labels, while also utilizing the already acquired knowledge of the population. The trade-off parameter α is added to the hyperparameter settings h. To compare the output distributions of the teacher and the individuals (that is, the candidate student models), we employ the Kullback–Leibler divergence DKL inspired by other distilling approaches [23]. The one-hot encoding of the true target as initialization results in a loss function equal to only using cross entropy since the Kullback-Leibler divergence is approximately equal to the cross entropy when all-except-one class probabilities are zero. There are similarities to generative adversarial networks (GANs) [7], where the generator is similar to the teacher (that is, the teacher model) and discriminator is similar to the student (that is, the student model)), wherein the one or more optimized candidate student neural network model is smaller than the optimized candidate teacher neural network model (Oehmcke at p. 265, “4.1 Distilling Knowledge,” first paragraph, teaches “They trained a complex model and let it be the teacher for a simpler model, the student. The student model is then able to achieve nearly the same performance as the complex one, which was not possible when training the simple model without the teacher).
Jaderberg, Amirguliyev, and Oehmcke are from the same or similar field of endeavor. Jaderberg teaches population based training of neural networks. Amirguliyev teaches distributed training of a neural network model where a master device has a first version of the neural network model and a slave device is communicatively coupled to a first data source and the master device, and the first data source is inaccessible by the master device. Oehmcke teaches extensions to population-based training to provide knowledge distilling (or sharing). Thus, it would have been obvious to a person having ordinary skill in the art as of the effective filing date of the Applicant’s invention to modify the combination of Jaderberg and Amirguliyev pertaining to distributed neural network training having a firewall with the population-based training extended to knowledge distillation of Oehmcke.
The motivation to do so is for a novel knowledge distilling scheme where only the best individuals of the population are allowed to share part of their knowledge about the training data with the whole population to embrace the idea of randomness between the models, rather than avoiding it, because the resulting diversity of models is important for the population’s evolution. (Oehmcke, Abstract).
Regarding claims 2, 9, and 13, the combination of Jaderberg, Amirguliyev, and Oehmcke teaches all of the limitations of claims 1, 8, and 12, respectively, as described above in detail.
Jaderberg teaches -
wherein the one or more domain factors are selected from the group consisting of: domain constraints, known domain parameters and formatting rules for a specific representation of each of the candidate teacher neural network models and candidate student neural network models (Jaderberg ¶ 0012 teaches [t]he hyperparameters can include values that impact how the values of the network parameters are updated by the training process e.g., the learning rate or other update rule (that is, formatting rules) that defines how the gradients determine at the current training iteration are used to update the network parameter values, objective function values, e.g., entropy cost, weights assigned to various terms of the objective function, and so on; Jaderberg ¶ 0126 teaches, under a set of training operations for a population of pairs of candidate generator and discriminator neural networks of a GAN task (that is, one or more domain factors); Jaderberg ¶ 0130 teaches [i]f the pair of candidate neural networks is ranked below a certain threshold (e.g., bottom 20% of all pairs of candidate neural networks), the system samples another pair of candidate neural networks ranked above a certain threshold (e.g., top 20% of all pairs candidate neural networks) and sets the new network parameters and new hyperparameters of the generator candidate neural network to the values of the network parameters and hyperparameters of the sampled generator candidate neural network (that is, in relating to the “GAN task,” these are known domain parameters) and sets the new network parameters and new hyperparameters of the discriminator candidate neural network to the values of the network parameters and hyperparameters of the sampled discriminator candidate neural network (that is, for a specific representation of each of the candidate individuals)).
Regarding claims 4, 11, and 15, the combination of Jaderberg, Amirguliyev, and Oehmcke teaches all of the limitations of claims 1, 8, and 12, respectively, as described above in detail.
Amirguliyev teaches -
wherein the first subsystem and the commonly accessible location are separated by a second firewall (Amirguliyev, Fig. 7, teaches a distributed training system 700 [Examiner annotations in dashed-line text boxes]
PNG
media_image5.png
714
520
media_image5.png
Greyscale
Amirguliyev ¶ 0088 teaches “a plurality of slave devices 722, 724, and 726 via one or more communication networks 750 . . . communicatively coupled to a respective slave data source 732, 734, and 736, which as per FIGS. 1 and 2 may be inaccessible by the master device 710, e.g. may be located on respective private networks 742, 744, and 746. The slave data sources 732, 734 and 736 may also be inaccessible by other slave devices, e.g. the slave device 724 may not be able to access the slave data source 732 or the slave data source 736. Each slave device 722, 724, and 726 receives the first configuration data 780 from the master device 710 as per the previous examples and generates respective second configuration data (CD2) 792, 794, and 796, which is communicated back to the master device 710 via the one or more communication networks 750. The set of second configuration data 792, 794, and 796 from the plurality of slave devices 722, 724, and 726, respectively, may be used to update parameters for the first version of the neural network model [(that is, wherein the access does not include access to secure data from the first or second secure data sets, and further wherein the second subsystem and the commonly accessible area are separated by a third firewall)]”)
Regarding claims 5 and 16, the combination of Jaderberg, Amirguliyev, and Oehmcke teaches all of the limitations of claims 1 and 12, respectively, as described above in detail.
Jaderberg teaches -
further comprising:
evolving the one or more optimized candidate . . . neural network models in accordance with the one or more domain factors (Jaderberg ¶ 0119 teaches a supervised learning task is specified as the machine learning task for the system (that is, the “supervised learning task” is one or more domain factors); e.g., Jaderberg teaches that these one or more domain factors include a deep reinforcement machine learning task (Jaderberg ¶ 0111), a machine translation learning task (Jaderberg ¶ 0119), and a set of training operations for a populations of pairs of candidate and discriminator neural networks of a GAN task (Jaderberg ¶ 0126 )), the evolving including:
(viii) creating by a third subsystem, including a third server, a first population of candidate . . . neural network models and assigning a unique candidate identifier to each of the candidate . . . neural network models in the first population (Jaderberg, Fig. 1, teaches an initial population P that assigns a unique candidate identifier to each of the candidate teacher models in the first population [Examiner annotations in dashed-line text boxes]:
PNG
media_image1.png
635
556
media_image1.png
Greyscale
Jaderberg ¶ 0045 teaches example population based neural network training system (that is, creating by a first subsystem, including a first server, a first population of candidate teacher models and assigning a unique candidate identifier to each of the candidate teacher models in the first population); Jaderberg ¶ 0149 teaches [i]mplementations of the subject matter described in
this specification can be implemented in a computing system that includes a back end component, e.g., as a data server) (that is, a first subsystem, including a first server));
(ix) transmitting the first population of candidate student neural network models with assigned candidate identifiers, to a fourth subsystem, including a fourth server (Jaderberg ¶ 0039 teaches that the training is such that it may be performed by a distributed system and the described approach only requires values of parameters, hyperparameters, and quality measures to be communicated between candidates (that is, the “distributed system” provides the second subsystem, including a second server); Jaderberg ¶ 0072 teaches [a] population based neural network training system 100 benefits from local optimization by executing, asynchronously and in parallel, an iterative training process for each candidate neural network 120A-N in the population (that is, “local optimization” accessing the population by transmitting the first population of candidate teacher models)), . . . ;
(x) training by the fourth subsystem the first population of candidate . . . neural network models (Jaderberg ¶ 0005 teaches [d]uring the training of the neural network, the system maintains a plurality of candidate neural networks (that is, training . . . the first population of candidate teacher models); Jaderberg ¶ 0039 teaches that the training is such that it may be performed by a distributed system and the described approach only requires values of parameters, hyperparameters, and quality measures to be communicated between candidates (that is, the “distributed system” provides the second subsystem)) . . . ;
(xi) determining by the fourth subsystem a set of performance metrics for each of the candidate . . . neural network models (Jaderberg ¶ 0051 teaches [d]uring training, the network parameters, the hyperparameters, and the quality measure for a candidate neural network are updated in accordance with training operations, including an iterative training process (that is, “during training, . . . the quality measure . . . [is] updated” is determining by the second subsystem a set of performance metrics for each of the candidate teacher models)), . . . ;
(xii) providing the set of performance metrics for each of the candidate student neural network models in accordance with assigned candidate identifier to the third subsystem (Jaderberg ¶ 0054 teaches that [t]he quality measure for a candidate neural network is a measure of the performance of the candidate neural network on the particular machine learning task (that is, providing the set of performance metrics for each of the candidate teacher models in accordance with assigned candidate identifier to the third subsystem));
(xiii) creating a next population of candidate student neural network models from the sets of performance metrics for each of the candidate student neural network models from the fourth subsystem (Jaderberg ¶ 0009 teaches system determines a new quality measure for the candidate neural network in accordance with the new values of the network parameters for the candidate neural network and the new values of the hyperparameters for the candidate neural network and updates the maintained data for the candidate neural network to specify the new values of the hyperparameters, the new values of the network parameters, and the new quality measure; Jaderberg ¶ 0010 teaches [a]fter repeatedly performing the set of training operations, the system selects the trained values of the network parameters from the parameter values in the maintained data based on the maintained quality measures for the candidate neural networks after the training operations have repeatedly been performed (that is, a population of “updated candidate neural networks” is creating a next population of candidate teach models from the sets of performance metrics for each of the candidate teacher models from the second subsystem));
(xiv) repeating steps (ix) to (xiii) for multiple generations until an optimized candidate . . . neural network model is determined in accordance with one or more predetermined conditions (Jaderberg ¶ 0136 teaches [w]hen training is over, the system generates data specifying a trained neural network by selecting the candidate neural network from the population with the highest quality measure (shown in line 16 in TABLE 1) (that is, a best candidate teacher model is determined in accordance with a predetermined condition); Jaderberg ¶ 0136 teaches the system checks if the termination criteria are satisfied (shown as a conditional statement in line 6 in TABLE 1 executing ready(p, t, P)) (that is, the “termination criteria” is a predetermined condition)); and
providing access to the optimized candidate student neural network model in the commonly accessible location (Jaderberg ¶ 0066 teaches [an] optimal candidate neural network selected is sometimes referred to in this specification as the “best” candidate neural network 120A-N in the population. A candidate neural network that has a higher quality measure than another candidate neural network is considered “better” than the other candidate neural network); and
* * *
Oehmcke teaches that “knowledge distilling” is the distilling of knowledge for neural networks in which a complex model is trained, and then let it be the teacher for a simpler model, the student (that is, candidate student model). The student model is then able to achieve nearly the same performance as the complex one, which was not possible when training the simple [student] model without the teacher. (Oehmcke at p. 265, “4.1 Distilling Knowledge,” first paragraph).
Oehmcke also teaches –
and training the optimized candidate student neural network model in accordance with operation on the one or more additional datasets by the optimized candidate teacher neural network model (Oehmcke at p. 259, “1. Introduction,” second paragraph, teaches a novel extension to PBT by enabling knowledge sharing across generations. We adapt a knowledge distilling strategy, which is inspired by Hinton et al. [10], where the knowledge about the training data of the best individuals (that is, the one or more additional datasets by the best candidate teacher model) is stored separately and fed back to all individuals (that is, training the best candidate student model) via the loss function) to solve the predetermined problem or make the prediction (Oehmcke teaches extensions to population-based training with knowledge distilling in Algorithm 1 (Examiner annotations in dashed-line text boxes):
PNG
media_image3.png
543
799
media_image3.png
Greyscale
Oehmcke at p. 260, “Knowledge Sharing,” first paragraph, teaches extensions to the [population-based training] with knowledge distilling [as] highlighted with [shaded boxes] in Algorithm 1. The teacher output T = {t1, . . . , tn} ∈ ℝc is initialized with the one-hot-encoded class targets of the true training targets Ytrain (Line 1)). During the evolutionary process, the best models are allowed to contribute to T through the teach-function (Line 13). We implement this teach-function by replacing 20% of the teacher output T with the predicted probability if the individual is from the upper 20% of the population P regarding the fitness p (that is, the one or more additional datasets by the best candidate teacher model). Depending on the population size, we are able to replace the original targets from Y in a few generations and introduce updates from generations continuously (that is, “generations” is evolving); Oehmcke at p. 261, “2.2 Knowledge Sharing,” second & third paragraph, teaches [the] combination of cross entropy and Kullback-Leibler divergence ensures that the models can learn the true labels, while also utilizing the already acquired knowledge of the population. The trade-off parameter α is added to the hyperparameter settings h. To compare the output distributions of the teacher and the individuals (that is, the candidate student models), we employ the Kullback–Leibler divergence DKL inspired by other distilling approaches [23]. The one-hot encoding of the true target as initialization results in a loss function equal to only using cross entropy since the Kullback-Leibler divergence is approximately equal to the cross entropy when all-except-one class probabilities are zero. There are similarities to generative adversarial networks (GANs) [7], where the generator is similar to the teacher (that is, the teacher model) and discriminator is similar to the student (that is, the student model)).
Regarding claims 6 and 17, the combination of Jaderberg, Amirguliyev, and Oehmcke teaches all of the limitations of claims 5 and 16, respectively, as described above in detail.
Jaderberg teaches -
wherein the first subsystem and the third subsystem are the same (Jaderberg ¶ 0142 teaches an “engine” is a software-based system, subsystem, or process that is programmed to perform one or more specific functions. Generally, an engine will be implemented as one or more software modules or components, installed on one or more computers in one or more locations. In some cases, one or more computers will be dedicated to a particular engine; in other cases, multiple engines can be installed and running on the same computer or computers (that is, “running on the same computer or computers,” is the first subsystem and the third subsystem are the same)).
Regarding claim 7, Jaderberg, Amirguliyev, and Oehmcke teaches all of the limitations of claim 1, as described in detail above.
Oehmcke teaches -
wherein the one or more additional datasets used to train the one or more optimized candidate student neural network models includes at least a portion of data from the first secure dataset (Oehmcke at p. 261, “2.2 Knowledge Sharing,” first paragraph, teaches this teach-function by replacing 20% of the teacher output T with the predicted probability if the individual is from the upper 20% of the population P regarding the fitness p. (that is, the “replacing 20% of the teacher output T” includes at least a portion of the data from the first secure dataset)).
Regarding claim 18, the combination of Jaderberg, Amirguliyev, and Oehmcke teaches all of the limitations of claim 1, as described above in detail.
Oehmcke teaches -
wherein the one or more additional datasets used to train the one or more optimized candidate student models includes at least a portion of data from the first secure dataset (Oehmcke at p. 261, “2.2 Knowledge Sharing,” first paragraph, teaches this teach-function by replacing 20% of the teacher output T with the predicted probability if the individual is from the upper 20% of the population P regarding the fitness p. (that is, the “replacing 20% of the teacher output T” includes at least a portion of data from one or more of the multiple individual first secure datasets)).
Response to Arguments
8. Examiner has fully considered Applicant’s arguments, and responds below accordingly.
Claim Rejections – 35 U.S.C. § 101
9. Under Step 2A Prong One, Applicant submits, “for purposes of any appeal, the Applicant renews and maintains the position that the present claims could not be practically performed in the human mind which renders the claims patent eligible.” (Response at pp. 12-13).
Under Step 2A Prong Two, Applicant argues that “[t]he reviewing panel’s reasoning in Dejardins is directly applicable to the present rejection and the present claims. As was pointed out in the prior response, the Board has explicitly recognized claims to artificial intelligence, data mining technologies, and neural networks as being patent-eligible.” (Response at p. 11).
Also, Applicant argues that “[t]he present claims are absolutely directed to an improvement to a technical field, i.e., AI innovation. The evolution and training of neural networks falls squarely into any definition of AI innovation. Neural networks are a foundational idea in AI and NN directly power modern AI. And neuroevolution is a form of AI for generation of NNs based on the concepts of evolution as we know them from the biological world. Further, the courts and the USPTO have long recognized data security as being a technical field. See Finjan, Inc. v. Blue Coat Systems, Inc., 879 F.3d 1299 (Fed. Cir. 2018). The present claims are directed to the improvement of multiple technical fields, i.e., neural network evolution, neural network training and data security. Accordingly, for at least these reasons, the claims are directed to improvements to a technical field and not to an abstract idea under.” (Response at p. 12).
Examiner Response:
Examiner WITHDRAWS the rejection under Section 101 to Applicant’s claims 1, 2, 4-9, 11-13, and 15-18.
Claim Rejections – 35 U.S.C. §103
10. Applicant argues “Jaderberg does not disclose
• evolving and providing access to a teacher neural network model trained on private data
• for use in evolving a child neural network model for solving a predetermined problem or making a prediction.
Jaderberg simply describes an example of prior art population based training (PBT), wherein multiple models are trained concurrently and underperforming models are replaced with better performing models until some consensus metric is achieved. Thus, Jaderberg is missing critical elements of the claims. There is no description or suggestion in Jaderberg of training models on secure or private data. In fact, all models in Jaderberg are trained on the same data (see FIG. 1).” (Response at p. 13).
Examiner’s Response:
The language argued by Applicant is recited in the preamble of the claims, and accordingly, is not positively recited within the body of the claim. (see claim 1, lines 1-3, claim 8, lines 1-3, and claim 12, lines 1-3). For example, the terms “private data” “a child neural network model” are only recited in the claim preambles. (see claims 1, 8, and 12).
In other words, Applicant’s preamble does not afford patentable weight to the Applicant’s claims because the claim preamble is not “necessary to give life, meaning, and vitality” to the claim. Moreover, because the Applicant’s preamble merely states the purpose or intended use of the invention rather than any distinct definition of any of the claimed invention’s limitations, the preamble is not considered a limitation and is of no significance to claim construction.
The bulk of claims 1, 8, and 12 are directed to “evolving a teacher neural network model in accordance with one or more domain factors.” (claim 1, lines 4-5 & lines 6-26; see also claim 8, lines 1-2, claim 12, lines 1-2). The last limitation recites providing access to the “teacher neural network,” in which the intended use is for the model is “to train one or more optimized candidate student neural network models.” (claim 1, line 27-30). That is:
* * *
[providing access to the optimized candidate teacher neural network model] . . . , further wherein the optimized candidate teacher neural network model operates on one or more additional datasets to train one or more optimized candidate student neural network models to solve the predetermined problem or make the prediction, wherein the one or more optimized candidate student neural network model is smaller than the optimized candidate teacher neural network model.
(claim 1, lines 29-33 (emphasis added by Examiner)).
Jaderberg is relied upon as teaching the features of “candidate teacher neural network models,” which under a broadest reasonable interpretation covers the “candidate neural network A . . . N” of Jaderberg (see, e.g., Jaderberg ¶¶ 0045, 0142, 149 & Fig. 1), which is not inconsistent with the Specification. (MPEP § 2111).
Also, the plain meaning of a “metric” standard for measuring qualitative assessment, and the broadest reasonable interpretation of “the set of performance metrics” are such qualitative assessments, with regard to a “fitness of each of the candidate teacher models” for the intended purpose, which not inconsistent with the Applicant’s disclosure. Accordingly, the claim limitations cover the teachings of Jaderberg, as set out above in detail.
Moreover, the rejections hereinabove clearly sets forth which claim limitations are taught by each of the prior art references, and the reason why it would be obvious to a person having ordinary skill in the art as of the effective filing date of the Applicant's invention to combine their teachings, and Applicant has not explained why the cited prior art references cannot be combined in the manner set forth in the rejection.
Conclusion
11. THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
12. The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
(Parisi et al., “Cultural Evolution in Neural Networks,” IEEE (1997)) teaches a model in which organisms learn from teachers (or "conspecifics"). Learning in this scenario differs from that of standard machine learning in that a pool of teachers is assumed, and the best teachers are selected to train the students of the next generation.
(Denaro et al., “Cultural Evolution in a Population of Neural Networks,” Springer (1997)) teaches if only the best individuals of each generation function as teachers and, furthermore, the teaching input provided by the teacher is modified by noise before it is used by the learner, not only culturally transmitted behaviors do not deteriorate but initially nonexistent behavioral capacities can emerge evolutionarily via pure cultural transmission as they can be shown to emerge via genetic transmission.
13. Any inquiry concerning this communication or earlier communications from the Examiner should be directed to KEVIN L. SMITH whose telephone number is (571) 272-5964. Normally, the Examiner is available on Monday-Thursday 0730-1730.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the Examiner by telephone are unsuccessful, the Examiner’s supervisor, KAKALI CHAKI can be reached on 571-272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/K.L.S./
Examiner, Art Unit 2122
/KAKALI CHAKI/Supervisory Patent Examiner, Art Unit 2122
1 “In 1987 Dorothy E. Denning proposed intrusion detection as is an approach to counter the
computer and networking attacks and misuses.” Hoque et al., "An Implementation of Intrusion Detection System using Genetic Algorithms," Int'l Journal of Network Security (2012), at p. 109.