DETAILED ACTION
Claims 1, 16 and 22 are independent.
Claims 1, 6-7 and 16-22 are amended.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant's arguments filed 10/14/2025 have been fully considered but they are not persuasive.
In regard to 35 USC 101, see Applicant’s Remarks pgs. 9-10, Claims 16-21 have been changed to include the non-transitory readable medium making those claims in the proper form. However, applicant argues that “The claims, as amended, are not directed to an abstract idea but rather a particular application that provides concrete technological improvement…” Applicant also states that amended claim 1 is “to thereby modify the training of the DNN to produce a more computationally efficient and compact trained DNN having reduced memory requirements.” Examiner would like to point out that as the claim is written now, the abstract idea is a mathematical concept,
Specifically:
carrying out a clustering-based regularization process at least one layer l of the DNN having neurons j, in which process a regularization activity penalty is added to a loss function for the batch of training data which is to be optimized during training, (limitation is directed to a mathematical concept).
The other elements of the claim do not add significantly more, or show an improvement to the technology. For example, Applicant points to “the regularization activity penalty is structured to induce activations of neurons to converge to the prior probability distribution and includes components associated with respective neurons in the layer which are dependent on the respective classes of the training data.” Stating that this dynamically guides the neuron activation. Examiner would like to point out that this is mere instruction to apply the exception under MPEP 2106.05(f)/generally linking the exception to a particular technological environment (neural networks) or field of use (training). Therefore, the 35 USC 101 rejection is maintained.
Applicant’s arguments, see pg. 10, paragraph 3, filed 10/14/2025, with respect to c have been fully considered and are persuasive. The 112 rejection of 07/14/2025 has been withdrawn.
In regard to 35 USC 103, see Applicant’s Remarks pgs. 10- 13, Applicant argues that claim 1, 16 and 22 are amended to describe the recite a penalty structured to make the neuron activations converge to a respective prior probability distribution. Applicant also points out that Baker is directed to activations that tend to agree with the corresponding nodes. Examiner would like to point out, that in Baker, the nodes are being interpreted as neurons. The nodes are agreeing being interpreted to converge to be alike.
Specifically:
the regularization activity penalty is structured to induce activations of the neurons j to converge to the respective prior probability distribution and includes components associated with respective neurons in layer l which are dependent on the respective classes of the training data. (Baker, paragraph 0075, “FIG. 8 illustrates another variant of the embodiment illustrated in FIG. 7. In this variant another set of nodes 505 is added to the selected layer. These added nodes 505 are in a one-to-one relationship with the virtual nodes 401 and a regularization is applied to make their activations tend to agree [to induce activations of neurons to converge to the respective prior probability distribution.] with the corresponding virtual nodes.” And “These added nodes 505 are in a one-to-one relationship with the virtual nodes 401 and a regularization is applied to make their activations tend to agree with the corresponding virtual nodes. Regularization is a well-known technique to those skilled in the art of statistical estimation that smoothes statistical estimates and makes them more robust.”)
Applicant also points out that Sargent does not disclose a obtaining a prior probability for each class. Examiner would like to point out, that in paragraph 0066 of Sargent it clearly states that p(Ci) is for each class.
Specifically:
the clustering-based regularization process comprises, before adding the regularization activity penalty; for each class, obtaining a prior probability distribution over activations of the neurons j for the class, (Sargent, paragraph 0066, “The p(C,) describes the prior probability distribution of each LC/LU class [for each class obtaining a prior probability distribution over activations for the neurons j the class,]. In this method, we do not specify any priors for the classification, meaning that the joint distribution is equivalent to the modelled conditional distribution. The conditional probability p(FIC,) for the LC is initially estimated by the probabilistic MLP at the pixel level representing the membership association. Those LC conditional probabilities are then fed into the OCNN model to learn and classify each LU category. The estimated LU probabilities together with the original images are then re-used as input layers for LC classification using MLP in the next iteration.”)
Lastly, Applicant argues that Bengio does not disclose that each prior probability distribution is a sparse distribution. Examiner would like to point out that in Bengio it explains that the sparse representation corresponds with the prior input scene being interpreted as the probability.
Specifically:
wherein each prior probability distribution is a sparse distribution in which only a proportion of neurons in layer l that is less than a predetermined threshold are activated for the class; and (Bengio, pg. 2, 1.1 More Motivations and Conditional Computation, paragraph 2, “Stochastic neurons [proportion of neurons] with binary outputs are also interesting because they can easily give rise to sparse representations (that have many zeros), a form of regularization that has been used in many representation learning algorithms (Bengio et al., 2013). Sparsity of the representation corresponds to the prior that, for a given input scene, most of the explanatory factors are irrelevant (and that would be represented by many zeros in the representation).” And paragraph 3, “As argued by Bengio (2013), sparse representations may be a useful ingredient of conditional computation, by which only a small subset of the model parameters are “activated” (and need to be visited) for any particular example, [sparse distribution in which only proportion of neurons in the layer l are activated for the class.] thereby greatly reducing the number of computations needed per example. Sparse gating units may be trained to select which part of the model actually need to be computed for a given example.” And pg. 4, paragraph 6, “For example, if the moving average of being non-zero falls below a threshold [less than a predetermined threshold], the bias is pushed up until that average comes back above the threshold.”)
Therefore, the 35 USC 103 rejection is maintained.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-22 rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Regarding claim 1,
Step 1: Is the claim to a process, machine, manufacture or composition of matter?
Claim 1 is directed to a process.
Step 1: yes.
Step 2A, prong 1: Does the claim recite an abstract idea, law of nature, or natural phenomenon?
carrying out a clustering-based regularization process at at least one layer l of the DNN having neurons j, in which process a regularization activity penalty is added to a loss function for the batch of training data which is to be optimized during training, (limitation is directed to a mathematical concept).
Step 2A, prong 1: yes.
Step 2A, prong 2: Does the claim recite additional elements that integrate the judicial exception into a practical application?
for a batch of N training data X, where i = 1 to N and ci is the class of training data X… the clustering-based regularization process includes, before adding the regularization activity penalty: for each class, obtaining a prior probability distribution over activations for of the neurons j for the class, wherein each prior probability distribution is a sparse distribution in which only a proportion of neurons in layer l that is less than a predetermined threshold are activated for the class; (data gathering/inputting is insignificant extra-solution activity under MPEP 2106.05(g); the type of data being gathered does not cause the data gathering to practically integrate the exception)
the regularization activity penalty is structured to induce activations of the neurons j to converge to the respective prior probability distribution and includes components associated with respective neurons in layer l which are dependent on the respective classes of the training data. (mere instruction to apply the exception under MPEP 2106.05(f)/generally linking the exception to a particular technological environment (neural networks) or field of use (training)).
Step 2A, prong 2: no
Step 2B: Does the claim recite additional elements that amount to significantly more than the judicial exception?
for a batch of N training data X, where i = 1 to N and ci is the class of training data X… the clustering-based regularization process includes, before adding the regularization activity penalty; for each class, obtaining a prior probability distribution over neuron activations for the class, wherein each prior probability distribution is a sparse distribution in which only a low proportion of neurons in the layer l are activated for the class; (data gathering/inputting is insignificant extra-solution activity under MPEP 2106.05(g); the type of data being gathered does not cause the data gathering to practically integrate the exception; receiving data is well understood routine and conventional under MPEP 2106.05(d)II)
the regularization activity penalty is structured to induce activations of neurons to converge to the prior probability distribution and includes components associated with respective neurons in the layer which are dependent on the respective classes of the training data. (mere instruction to apply the exception under MPEP 2106.05(f)/generally linking the exception to a particular technological environment (neural networks) or field of use (training)).
Step 2B: no.
Regarding claim 4 and analogous claim 17,
Claim 4 incorporates the analysis of the process of claim 1.
wherein the prior probability distributions of at least some classes intersect. (limitation is directed to a mathematical concept).
Regarding claim 5 and analogous claim 18,
Claim 5 incorporates the analysis of the process of claim 1.
wherein the clustering-based regularization process further comprises calculating, for each neuron, the component of the regularization activity penalty associated with the neuron, the amount of the component being determined by the probabilities of the neuron activating according to the prior probability distributions pjci. (limitation is directed to a mathematical concept).
Regarding claim 6,
Claim 6 incorporates the analysis of the process of claim 5.
wherein the component of the regularization activity penalty is calculated using the formula:
PNG
media_image1.png
71
310
media_image1.png
Greyscale
where Aijl is the activation of neuron j in layer l for training data Xi. (limitation is directed to a mathematical concept).
Regarding claim 7,
Claim 7 incorporates the analysis of the process of claim 6.
wherein the regularization activity penalty R(W 1:i) is calculated using the formula:
PNG
media_image2.png
132
458
media_image2.png
Greyscale
where denotes the set of weights from layer 1 up to l. (limitation is directed to a mathematical concept).
Regarding claim 8 and analogous claim 19,
Claim 8 incorporates the analysis of the process of claim 1.
wherein: the clustering-based regularization process further comprises, before adding the regularization activity penalty, determining the prior probability distribution for each class at each iteration of the process. (limitation is directed to a mathematical concept).
Regarding claim 9 and analogous claim 20,
Claim 9 incorporates the analysis of the process of claim 8.
wherein determining the prior probability distribution for each class includes using neuron activations for the class from previous iterations to define the probability distribution. (limitation is directed to a mental process (i.e. “judgement”)).
Regarding claim 10 and analogous claim 21,
Claim 10 incorporates the analysis of the process of claim 8.
wherein: the clustering-based regularization process further comprises using the determined prior probability distribution to identify a group of neurons for which the number of activations of the neuron for the class meets a predefined criterion. (limitation is directed to a mental process (i.e. “judgement”)).
Regarding claim 11,
Claim 11 incorporates the analysis of the process of claim 10.
whether, when the neurons are ranked according to the number of activations of the neuron for the class from the prior probability distribution, the neuron is ranked within the top K neurons, where K is an integer; (limitation is directed to a mental process (i.e. “judgement”)).
whether the number of activations of the neuron for the class from the prior probability distribution exceeds a predefined activation threshold. (limitation is directed to a mental process (i.e. “observation”)).
Regarding claim 12,
Claim 12 incorporates the analysis of the process of claim 10.
wherein the regularization activity penalty includes penalty components calculated for each neuron outside the group but no penalty component for the neurons within the group. (limitation is directed to a mathematical concept).
Regarding claim 13,
Claim 13 incorporates the analysis of the process of claim 10.
wherein the regularization activity penalty includes penalty components calculated for each neuron in the layer, the amount of the penalty component for neurons outside the group being greater than for neurons within the group. (limitation is directed to a mathematical concept).
Regarding claim 14,
Claim 14 incorporates the analysis of the process of claim 13.
wherein in the clustering-based regularization process the neurons are ranked according to the number of activations of the neuron for the class from the prior probability distribution, and the penalty component for each neuron is inversely proportional to the ranking of the neuron. (limitation is directed to a mental process (i.e. “evaluation”)).
Regarding claim 15,
Claim 15 incorporates the analysis of the process of claim 1.
further comprising determining saliency of the neurons in the layer and discarding at least one neuron in the layer which is less salient than others in the layer. (limitation is directed to a mental process (i.e. “judgement”)).
Regarding claim 16,
Step 1: Is the claim to a process, machine, manufacture or composition of matter?
Claim 16 is directed to manufacture.
Step 1: yes
The rest of the analysis for claim 22 is analogous to claims 1.
Regarding claim 22,
Step 1: Is the claim to a process, machine, manufacture or composition of matter?
Claim 22 is directed to a machine.
Step 1: yes
The rest of the analysis for claim 22 is analogous to claims 1.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or non-obviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1, 4-13 and 16-22 are rejected under 35 U.S.C. 103 as being unpatentable over Baker (U.S. Published Patent Application No. 20200184337)[Jun. 11, 2020], in view of Shirahata (U.S. Published Patent Application No. 20180150745)[May, 31, 2018], Patel et al (U.S. Published Patent Application No. 20180082172, "Patel")[Mar. 22, 2018] and in further view of Bengio et al (Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation, "Bengio")[Aug. 15, 2013] and Sargent et al (US Published Patent Application No. 20200065968, “Sargent”)[Feb. 27, 2020].
In regard to claim 1 and analogous claims 16 and 22, Baker teaches carrying out a clustering-based regularization process at least one layer l of the DNN having neurons j, (Baker, paragraph 0075, “FIG. 8 illustrates another variant of the embodiment illustrated in FIG. 7. In this variant another set of nodes 505 is added to the selected layer [process at least one layer l of the DNN having neurons j,]. These added nodes 505 are in a one-to-one relationship with the virtual nodes 401 and a regularization is applied to make their activations tend to agree with the corresponding virtual nodes.” And paragraph 0088, “For example, there are two forms of soft tying node activations that will be explained below. In addition, once feature nodes or clusters have been trained [clustering-based regularization process], say in machine learning systems 1023 and 1024 respectively, the knowledge may be used for supervised training of other systems such as machine learning systems 1025 and 1026 respectively.”)
the regularization activity penalty is structured to induce activations of the neurons j to converge to the respective prior probability distribution and includes components associated with respective neurons in layer l which are dependent on the respective classes of the training data. (Baker, paragraph 0075, “FIG. 8 illustrates another variant of the embodiment illustrated in FIG. 7. In this variant another set of nodes 505 is added to the selected layer. These added nodes 505 are in a one-to-one relationship with the virtual nodes 401 and a regularization is applied to make their activations tend to agree [to induce activations of neurons to converge to the respective prior probability distribution.] with the corresponding virtual nodes.” And “These added nodes 505 are in a one-to-one relationship with the virtual nodes 401 and a regularization is applied to make their activations tend to agree with the corresponding virtual nodes. Regularization is a well-known technique to those skilled in the art of statistical estimation that smoothes statistical estimates and makes them more robust.”)
However, Baker does not explicitly teach for a batch of N training data X, where i = 1 to N and ci is the class of training data X,
in which process a regularization activity penalty is added to a loss function for the batch of training data which is to be optimized during training,
the clustering-based regularization process comprises, before adding the regularization activity penalty; for each class, obtaining a prior probability distribution over activations of the neurons j for the class,
wherein each prior probability distribution is a sparse distribution in which only a proportion of neurons in layer l that is less than a predetermined threshold are activated for the class; and
Shirahata teaches in which process a regularization activity penalty is added to a loss function for the batch of training data which is to be optimized during training, (Shirahata, paragraph 0150, “Furthermore, the learning may also be performed by adding a restriction for preventing overlearning. The restriction includes, for example, L1 regularization and L2 regularization [loss function]. For example, at the end of the learning process, the error data of the parameter may also be calculated from Equations [regularization activity penalty is added to a loss function for the batch of training data which is to be optimized during training, ] (17-1) and (17-2) below in each layer.”)
Baker and Shirahata are related to the same field of endeavor (i.e. training neural networks). In view of the teachings of Shirahata, it would have been obvious for a person with ordinary skill in the art to apply the teachings of Shirahata to Baker before the effective filing date of the claimed invention in order to have a high recognition accuracy. (Shirahata, paragraph 0003, “The machine learning using such neural networks having multilayer structure is also called deep learning. Multi layering of neural networks is improved in deep learning and effectiveness is confirmed in various fields. For example, in recognition of images/voices, deep learning exhibits high recognition accuracy equal to that performed by a human.”)
However, Baker and Shirahata do not explicitly teach for a batch of N training data X, where i = 1 to N and ci is the class of training data X,
the clustering-based regularization process comprises, before adding the regularization activity penalty; for each class, obtaining a prior probability distribution over activations of the neurons j for the class,
wherein each prior probability distribution is a sparse distribution in which only a proportion of neurons in layer l that is less than a predetermined threshold are activated for the class; and
Patel teaches for a batch of N training data X, where i = 1 to N and ci is the class of training data X, (Patel, paragraph 0066, “We are given a D-pixel, multi-channel image I of an object, with intensity I(x,w) at pixel x and channel w (e.g., WE {red, green, blue}). We seek to infer the object's identity (class) cEC, where C is a finite set of classes. (The restriction for C to be finite can be removed by using a nonparametric prior such as a Chinese Restaurant Process.)”)
Baker, Shirahata and Patel are related to the same field of endeavor (i.e. training neural networks). In view of the teachings of Patel, it would have been obvious for a person with ordinary skill in the art to apply the teachings of Patel to Baker and Shirahata before the effective filing date of the claimed invention in order to have high accuracy in recognition. (Patel, paragraph 0183, “It achieves state-of-the-art accuracy in face recognition and clustering on several public benchmarks.”)
However, Baker, Shirahata and Patel do not explicitly teach the clustering-based regularization process comprises, before adding the regularization activity penalty; for each class, obtaining a prior probability distribution over activations of the neurons j for the class,
wherein each prior probability distribution is a sparse distribution in which only a proportion of neurons in layer l that is less than a predetermined threshold are activated for the class; and
Bengio teaches wherein each prior probability distribution is a sparse distribution in which only a proportion of neurons in layer l that is less than a predetermined threshold are activated for the class; and (Bengio, pg. 2, 1.1 More Motivations and Conditional Computation, paragraph 2, “Stochastic neurons [proportion of neurons] with binary outputs are also interesting because they can easily give rise to sparse representations (that have many zeros), a form of regularization that has been used in many representation learning algorithms (Bengio et al., 2013). Sparsity of the representation corresponds to the prior that, for a given input scene, most of the explanatory factors are irrelevant (and that would be represented by many zeros in the representation).” And paragraph 3, “As argued by Bengio (2013), sparse representations may be a useful ingredient of conditional computation, by which only a small subset of the model parameters are “activated” (and need to be visited) for any particular example, [sparse distribution in which only proportion of neurons in the layer l are activated for the class.] thereby greatly reducing the number of computations needed per example. Sparse gating units may be trained to select which part of the model actually need to be computed for a given example.” And pg. 4, paragraph 6, “For example, if the moving average of being non-zero falls below a threshold [less than a predetermined threshold], the bias is pushed up until that average comes back above the threshold.”)
Baker, Shirahata, Patel and Bengio are related to the same field of endeavor (i.e. training neural networks). In view of the teachings of Bengio, it would have been obvious for a person with ordinary skill in the art to apply the teachings of Bengio to Baker, Shirahata and Patel before the effective filing date of the claimed invention in order to gain computational efficiency. (Bengio, pg. 11, A.1 Sparsity Constraint, “Computational efficiency is gained by imposing a sparsity constraint on the output of the gater.”)
However, Baker, Shirahata, Patel and Bengio do not explicitly teach the clustering-based regularization process comprises, before adding the regularization activity penalty; for each class, obtaining a prior probability distribution over activations of the neurons j for the class,
Sargent teaches the clustering-based regularization process comprises, before adding the regularization activity penalty; for each class, obtaining a prior probability distribution over activations of the neurons j for the class, (Sargent, paragraph 0066, “The p(C,) describes the prior probability distribution of each LC/LU class [for each class obtaining a prior probability distribution over activations for the neurons j the class,]. In this method, we do not specify any priors for the classification, meaning that the joint distribution is equivalent to the modelled conditional distribution. The conditional probability p(FIC,) for the LC is initially estimated by the probabilistic MLP at the pixel level representing the membership association. Those LC conditional probabilities are then fed into the OCNN model to learn and classify each LU category. The estimated LU probabilities together with the original images are then re-used as input layers for LC classification using MLP in the next iteration.”)
Baker, Shirahata, Patel, Bengio and Sargent are related to the same field of endeavor (i.e. training neural networks). In view of the teachings of Sargent, it would have been obvious for a person with ordinary skill in the art to apply the teachings of Sargent to Baker, Shirahata, Patel, Bengio before the effective filing date of the claimed invention in order to allow for accuracy of classification to be excellent. (Sargent, paragraph 0016, “One advantage of this OCNN method is that it produces excellent LU classification accuracy and computational efficiency, consistently outperforming its sub-modules, as well as other benchmark comparators.”)
In regard to claim 4 and analogous claim 17, Baker, Shirahata, Patel, Bengio and Sargent teach the method of claim 1.
Shirahata further teaches wherein the prior probability distributions of at least some classes intersect. (Shirahata, paragraph 0046, “Consequently, n pieces of neuron data X; that are the operation result obtained by using the neural network are converted to probability distribution with the probability of a (x) that is each of the recognition targets i. The identification layer uses the type of image associated with the neuron data having the maximum probability distribution as the identification result [prior probability distributions of at least some classes intersect.]. Furthermore, when performing learning, the identification layer obtains an error by comparing the recognition result with the correct answer. For example, the identification layer obtains an error between the target probability distribution (correct answer) by using a cross-entropy error function.”)
Baker and Shirahata are combinable for the same rationale as set forth above with respect to claim 1.
In regard to claim 5 and analogous claim 18, Baker, Shirahata, Patel, Bengio and Sargent teach the method of claim 1.
Baker further teaches wherein the clustering-based regularization process further comprises calculating, for each neuron, the component of the regularization activity penalty associated with the neuron, the amount of the component being determined by the probabilities of the neuron activating according to the prior probability distributions pjci. (Baker, paragraph 0091, “Given a set of assignments of data examples to clusters, selected nodes within machine learning system 1021 can be designated as potential feature nodes for one or more clusters. Each potential feature node n designated for a cluster has its activations values an(x) soft tied for all data examples x associated with that cluster [the amount of the component being determined by the probabilities of the neuron activating according to the prior probability distributions pjci.]. In this form of soft tying, an extra regularization term is added to the cost function for the potential feature node. For a data example x associated with the cluster, the regularization cost term can be based on the difference between the value an(x) and the average activation value averaged across all data assigned to the cluster. For example, the soft tying regularization can be the L2 norm, b[calculating, for each neuron, the component of the regularization activity penalty associated with the neuron,] L2n(x)=(an(x)-μ,,2. The valueμ.,, is the mean activation for node n over all of the data associated with the cluster. To save computation in some embodiments, this mean value is estimated from the mean value in the previous iteration.”)
In regard to claim 6, Baker, Shirahata, Patel, Bengio and Sargent teach the method of claim 5.
Baker further teaches wherein the component of the regularization activity penalty is calculated using the formula:
PNG
media_image1.png
71
310
media_image1.png
Greyscale
where Aijl is the activation of neuron j in layer l for training data Xi. (Baker, paragraph 0091, “Given a set of assignments of data examples to clusters, selected nodes within machine learning system 1021 can be designated as potential feature nodes for one or more clusters [j in layer l for training data Xi.]. Each potential feature node n designated for a cluster has its activations values an(x) soft tied for all data examples x associated with that cluster. In this form of soft tying, an extra regularization term is added to the cost function for the potential feature node. For a data example x associated with the cluster, the regularization cost term can be based on the difference between the value an(x) and the average activation value averaged across all data assigned to the cluster. For example, the soft tying regularization can be the L2 norm, L2n(x)=(an(x)-μ,,2. The valueμ.,, is the mean activation for node n [where Aijl is the activation of neuron] over all of the data associated with the cluster. To save computation in some embodiments, this mean value is estimated from the mean value in the previous iteration.”)
In regard to claim 7, Baker, Shirahata, Patel, Bengio and Sargent teach the method of claim 6.
Baker further teaches wherein the regularization activity penalty R(W 1:i) is calculated using the formula:
PNG
media_image2.png
132
458
media_image2.png
Greyscale
where denotes the set of weights from layer 1 up to l. (Baker, paragraph 0091, “Given a set of assignments of data examples to clusters, selected nodes within machine learning system 1021 can be designated as potential feature nodes for one or more clusters. Each potential feature node n designated for a cluster has its activations values an(x) soft tied for all data examples x associated with that cluster. In this form of soft tying, an extra regularization term is added to the costfunction for the potential feature node. For a data example x associated with the cluster, the regularization cost term can be based on the difference between the value an(x) and the average activation value averaged across all data assigned to the cluster. For example, the soft tying regularization can be the L2 norm [regularization activity penalty], L2n(x)=(an(x)-μ,,2. The valueμ.,, is the mean activation for node n over all of the data associated with the cluster. To save computation in some embodiments, this mean value is estimated from the mean value in the previous iteration.” and paragraph 0096, “For example, if machine learning system 1021 is a deep neural network, these internal variables include the node activations of all of the inner layer nodes as well as the input and computed output values. In addition, during training these internal variables include the partial derivatives of the cost function with respect to each of the node activations and with respect to each of the connection weights [where denotes the set of weights from layer 1 up to l] and any other learned parameters.”)
In regard to claim 8 and analogous claim 19, Baker, Shirahata, Patel, Bengio and Sargent teach the method of claim 1.
Baker further teaches the clustering-based regularization process further comprises, before adding the regularization activity penalty, determining the prior probability distribution for each class at each iteration of the process. (Baker, paragraph 0075, “Regularization is a well-known technique to those skilled in the art of statistical estimation that smoothes statistical estimates [prior probability distribution for each class at each iteration of the process] and makes them more robust. In this case, the regularization consists of an additional term in the objective function during training that penalizes differences between each node in set 505 and its corresponding virtual node in set 401. The regularization and the respective dropout rates of the virtual nodes 401 and the regularized nodes 505 are all controlled by the learning coach 303, with an objective that is optimized by testing on practice data.”)
In regard to claim 9 and analogous claim 20, Baker, Shirahata, Patel, Bengio and Sargent teach the method of claim 8.
Baker further teaches wherein determining the prior probability distribution for each class includes using neuron activations for the class from previous iterations to define the probability distribution. (Baker, paragraph 0084, “If the nodes represent intervals of a deterministic variable, then only the node corresponding to the value of the variable would be activated. However, if the nodes represent states in a hidden stochastic process or intervals for an estimated random variable, then the node activations would represent some form of probability distribution. If the data observations are made as a function of time, then the activation values might represent either joint probabilities or conditional probabilities. The activation probabilities might be conditioned on (or joint with) either the past or the future, or both. [using neuron activations for the class from previous iterations to define the probability distribution.] In some embodiments, the node activations might be the probabilities themselves, perhaps normalized to sum to one across the nodes in a given set.”)
In regard to claim 10 and analogous claim 21, Baker, Shirahata, Patel, Bengio and Sargent teach the method of claim 8.
Baker further teaches wherein: the clustering-based regularization process further comprises using the determined prior probability distribution to identify a group of neurons for which the number of activations of the neuron for the class meets a predefined criterion. (Baker, paragraph 0084, “If the nodes represent intervals of a deterministic variable, then only the node corresponding to the value of the variable would be activated. However, if the nodes represent states in a hidden stochastic process or intervals for an estimated random variable, then the node activations would represent some form of probability distribution. If the data observations are made as a function of time, then the activation values might represent either joint probabilities or conditional probabilities. The activation probabilities might be conditioned on (or joint with) either the past or the future, or both [using the determined prior probability distribution to identify a group of neurons for which the number of activations of the neuron for the class meets a predefined criterion.]. In some embodiments, the node activations might be the probabilities themselves, perhaps normalized to sum to one across the nodes in a given set.”)
In regard to claim 11, Baker, Shirahata, Patel, Bengio and Sargent teach the method of claim 10.
Baker further teaches whether, when the neurons are ranked according to the number of activations of the neuron for the class from the prior probability distribution, the neuron is ranked within the top K neurons, where K is an integer; (Baker, paragraph 0084, “If the nodes represent intervals of a deterministic variable, then only the node corresponding to the value of the variable would be activated [[number of activations of the neuron for the class from the prior probability distribution, Examiner would like to point out that the node with corresponding variables is being interpreted as ranking.]]. However, if the nodes represent states in a hidden stochastic process or intervals for an estimated random variable, then the node activations would represent some form of probability distribution. If the data observations are made as a function of time, then the activation values might represent either joint probabilities or conditional probabilities. The activation probabilities might be conditioned on (or joint with) either the past or the future, or both. In some embodiments, the node activations might be the probabilities themselves, perhaps normalized to sum to one across the nodes in a given set [the neuron is ranked within the top K neurons, where K is an integer;].”)
Shirahata further teaches whether the number of activations of the neuron for the class from the prior probability distribution exceeds a predefined activation threshold. (Shirahata, paragraph 0036, “As the nonlinear activation function a, for example, a ramp function (ReLU) may also be used. FIG. 2B is a schematic diagram illustrating an example of the ReLU. In the example illustrated in FIG. 2B, if an input X is less than zero, zero is output to an output Y. Furthermore, if the input X exceeds zero [number of activations of the neuron for the class from the prior probability distribution exceeds a predefined activation threshold.], a value of the input Xis output to the output Y.”)
Baker and Shirahata are combinable for the same rationale as set forth above with respect to claim 1.
In regard to claim 12, Baker, Shirahata, Patel, Bengio and Sargent teach the method of claim 10.
However, Baker, Shirahata and Patel do not explicitly teach wherein the regularization activity penalty includes penalty components calculated for each neuron outside the group but no penalty component for the neurons within the group.
Bengio teaches wherein the regularization activity penalty includes penalty components calculated for each neuron outside the group but no penalty component for the neurons within the group. (Bengio, pg. 11, A.1 Sparsity Constraint, paragraph 1, “Computational efficiency is gained by imposing a sparsity constraint on the output of the gater. All experiments aim for an average sparsity of 10%, such that for 2000 expert hidden units we will only require computing approximately 200 of them in average. Theoretically, efficiency can be gained by only propagating the input activations to the selected expert units, and only using these to compute the network output. For imposing the sparsity constraint we use a KL-divergence criterion for sigmoids and an L1-norm criterion for rectifiers, where the amount of penalty is adapted to achieve the target level of average sparsity. [calculated for each neuron outside the group but no penalty component for the neurons within the group.]”)
Baker and Bengio are combinable for the same rationale as set forth above with respect to claim 1.
In regard to claim 13, Baker, Shirahata, Patel, Bengio and Sargent teach the method of claim 10.
However, Baker, Shirahata and Patel do not explicitly teach wherein the regularization activity penalty includes penalty components calculated for each neuron in the layer, the amount of the penalty component for neurons outside the group being greater than for neurons within the group.
Bengio teaches wherein the regularization activity penalty includes penalty components calculated for each neuron in the layer, the amount of the penalty component for neurons outside the group being greater than for neurons within the group. (Bengio, pg. 11, A.1 Sparsity Constraint, paragraph 1, “Computational efficiency is gained by imposing a sparsity constraint on the output of the gater. All experiments aim for an average sparsity of 10%, such that for 2000 expert hidden units we will only require computing approximately 200 of them in average. Theoretically, efficiency can be gained by only propagating the input activations to the selected expert units [the group], and only using these to compute the network output. For imposing the sparsity constraint we use a KL-divergence criterion for sigmoids and an L1-norm criterion for rectifiers, where the amount of penalty [the amount of the penalty component for neurons outside the group being greater than for neurons within the group., adapting the penalty is being interpreted as being greater] is adapted to achieve the target level of average sparsity.”)
Baker and Bengio are combinable for the same rationale as set forth above with respect to claim 1.
Claims 14-15 are rejected under 35 U.S.C. 103 as being unpatentable over Baker, in view of Shirahata, Patel, Bengio, Sargent and in further view of Rahangdale et al (Deep Neural Network Regularization for Feature Selection in Learning-to-Rank, "Rahangdale").
In regard to claim 14, Baker, Shirahata, Patel, Bengio and Sargent teach the method of claim 13.
However, Baker, Shirahata, Patel, Bengio and Sargent do not explicitly teach wherein in the clustering-based regularization process the neurons are ranked according to the number of activations of the neuron for the class from the prior probability distribution, and the penalty component for each neuron is inversely proportional to the ranking of the neuron.
Rahangdale teaches wherein in the clustering-based regularization process the neurons are ranked according to the number of activations of the neuron for the class from the prior probability distribution, and the penalty component for each neuron is inversely proportional to the ranking of the neuron. (Rahangdale, pg. 53989, Col. 2, paragraph 2, “Principal way to achieve this objective is `1 regularization wherein loss function is optimized with a penalty term obtained by aggregating absolute value of weights during training [42]. It is an indirect way to solve the problem of network pruning with feature selection. Here, low-rank (a low absolute value for weight) neuron with all its incoming and outgoing connection are set to zero and it does not participate in further learning [the penalty component for each neuron is inversely proportional to the ranking of the neuron.]. However, this is a highly sub-optimal solution to an equally sparse network. Hence, we prefer group level sparsity that gives more structured level sparsity and keeps smaller number of neurons per layer. The group level sparsity imposes sparsity such that all variables within a group are either simultaneously zero or none. Thus, we achieve our objectives of optimizing weight of neural network and selecting the active neurons at input layer by using `1 regularization technique [process the neurons are ranked according to the number of activations of the neuron].”)
Baker, Shirahata, Patel, Bengio, Sargent and Rahangdale are related to the same field of endeavor (i.e. training neural networks). In view of the teachings of Rahangdale, it would have been obvious for a person with ordinary skill in the art to apply the teachings of Rahangdale to Baker, Shirahata, Patel, Bengio and Sargent before the effective filing date of the claimed invention in order to teach deep neural networks ranking. (Rahangdale, Abstract, ”The proposed model makes use of the deep neural network for learning-to-rank for document retrieval.”)
In regard to claim 15, Baker, Shirahata, Patel, Bengio and Sargent teach the method of claim 1.
However, Baker, Shirahata, Patel, Bengio and Sargent do not explicitly teach further comprising determining saliency of the neurons in the layer and discarding at least one neuron in the layer which is less salient than others in the layer.
Rahangdale teaches further comprising determining saliency of the neurons in the layer and discarding at least one neuron in the layer which is less salient than others in the layer. (Rahangdale, pg. 53989, Col. 2, paragraph 3, “As an additional variety, we have adopted group `1 and sparse group `1 regularization to speed up the learning and improve result significantly. The sparse group `1 is mainly used to induce sparsity on non-sparse group. The group of features can be formed based on all outgoing connections of the neuron. In this way, optimization of deep neural network is forced to remove [discarding at least one neuron in the layer] the low-rank neurons [determining saliency of the neurons in the layer] at learning time.”)
Baker and Rahangdale are combinable for the same rationale as set forth above with respect to claim 14.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SKYLAR K VANWORMER whose telephone number is (703)756-1571. The examiner can normally be reached M-F 6:00am to 3:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Usmaan Saeed can be reached on (571) 272-4046. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/S.K.V./Examiner, Art Unit 2146
/USMAAN SAEED/Supervisory Patent Examiner, Art Unit 2146