DETAILED ACTION
This action is responsive to the amendment filed on 09/25/2025. Claims 1-2, 4-6, 8-10, and 12 are pending in the case. Claims 1-2, 5-6, and 9-10 are currently amended. Claims 1, 5, and 9 are independent claims.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Acknowledgement is made of applicant’s claim for domestic priority based on international application no. PCT/JP2019/045281 filed on 11/19/2019.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 09/25/2025 is being considered by the examiner.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 2, 6, and 10 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
Regarding claim 2, the claim recites “the input ANN model” in lines 3-4. There is insufficient antecedent basis for this limitation in the claim. The parent claim recites “an ANN model”. It is unclear if applicant is attempting to recite a new claim element or attempting to refer to a previously recited claim element. For examination purposes this claim limitation has been interpreted to mean “the ANN model”, referring to the previously recited claim element.
Further regarding claim 2, the claim recites “the New training data” in line 4. There is insufficient antecedent basis for this limitation in the claim. The parent claim recites “training data”. It is unclear if applicant is attempting to recite a new claim element or attempting to refer to a previously recited claim element. For examination purposes this claim limitation has been interpreted to mean “the training data”, referring to the previously recited claim element.
Further, claim 2 recites “the input Policy model” in line 11. There is insufficient antecedent basis for this limitation in the claim. The parent claim recites “a policy model”. It is unclear if applicant is attempting to recite a new claim element or attempting to refer to a previously recited claim element. For examination purposes this claim limitation has been interpreted to mean “the policy model”, referring to the previously recited claim element.
Regarding claim 6, the claim recites “the input ANN model” in line 2. There is insufficient antecedent basis for this limitation in the claim. The parent claim recites “an ANN model”. It is unclear if applicant is attempting to recite a new claim element or attempting to refer to a previously recited claim element. For examination purposes this claim limitation has been interpreted to mean “the ANN model”, referring to the previously recited claim element.
Further, claim 6 recites “the input Policy model” in line 6. There is insufficient antecedent basis for this limitation in the claim. The parent claim recites “a Policy model”. It is unclear if applicant is attempting to refer to a previously recited claim element or if applicant is attempting to recite a new claim element. For examination purposes this claim limitation has been interpreted to mean “the policy model”, referring to the previously recited claim element.
Regarding claim 10, the claim recites “the input ANN model” in line 3. There is insufficient antecedent basis for this limitation in the claim. The parent claim recites “an ANN model”. It is unclear if applicant is attempting to recite a new claim element or attempting to refer to a previously recited claim element. For examination purposes this claim limitation has been interpreted to mean “the ANN model”, referring to the previously recited claim element.
Further, claim 10 recites “the input Policy model” in line 8. There is insufficient antecedent basis for this limitation in the claim. The parent claim recites “a Policy model”. It is unclear if applicant is attempting to refer to a previously recited claim element or if applicant is attempting to recite a new claim element. For examination purposes this claim limitation has been interpreted to mean “the policy model”, referring to the previously recited claim element.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-2, 4-6, 8-10, and 12 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Regarding claim 1:
Step 1 Statutory Category: Claim 1 is directed to an apparatus which falls under one of the four statutory categories.
Step 2A Prong 1 Judicial Exception: Claim 1 recites, in part, “compute information matrix of each sample in the training data using training information extracted”. This limitation is the abstract idea of a mathematical calculation, as directed to “a claim that recites a mathematical calculation, when the claim is given its broadest reasonable interpretation in light of the specification, will be considered as falling within the "mathematical concepts" grouping. A mathematical calculation is a mathematical operation (such as multiplication) or an act of calculating using mathematical methods to determine a variable or number”. See MPEP § 2106.04(a)(2)(I)(C).
Step 2A Prong 2 Integration into a practical application: This judicial exception is not integrated into a practical application. In particular the claim recites: “an information processing apparatus”, “at least one memory configured to store program code”, “at least one processor configured to operate as instructed by the program code”, “training code configured to cause the at least one processor to operate as an ANN (artificial neural networks) model trainer”, “computation code configured to cause the at least one processor to…”, and “policy training code configured to cause the at least one processor to operate as a policy model trainer”. These limitations are additional elements that amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP §2106.05(f). Further, the claim recites “train an ANN model using training data” and “train a Policy model, by using, as teacher data, a policy vector which can be determined by comparing a threshold with the information matrix, using the training data and the information matrix, wherein values included in the policy vector cause at least one layer of the policy model to skip processing during an inference phase”. These limitations are recited at a high level of generality and amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP §2106.05(f).
Step 2B Significantly more: The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed with respect to integration of the abstract idea into a practical application, the additional elements: “an information processing apparatus”, “at least one memory configured to store program code”, “at least one processor configured to operate as instructed by the program code”, “training code configured to cause the at least one processor to operate as an ANN (artificial neural networks) model trainer”, “computation code configured to cause the at least one processor to…”, “policy training code configured to cause the at least one processor to operate as a policy model trainer”, “train an ANN model using training data”, and “train a Policy model, by using, as teacher data, a policy vector which can be determined by comparing a threshold with the information matrix, using the training data and the information matrix, wherein values included in the policy vector cause at least one layer of the policy model to skip processing during an inference phase” are additional elements that amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP §2106.05(f). Elements that merely amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process cannot provide an inventive concept. The claim is not patent eligible.
Regarding claim 2: The rejection of claim 1 is incorporated, and further: claim 2 recites, in part, “compute the information matrix of each sample in the new training data using the training information”. This limitation is the abstract idea of a mathematical calculation, as directed to “a claim that recites a mathematical calculation, when the claim is given its broadest reasonable interpretation in light of the specification, will be considered as falling within the "mathematical concepts" grouping. A mathematical calculation is a mathematical operation (such as multiplication) or an act of calculating using mathematical methods to determine a variable or number”. See MPEP § 2106.04(a)(2)(I)(C).
Further, the claim recites the additional elements: “incremental training code configured to cause the at least one processor to operate as an incremental ANN model trainer”, “the computation code”, and “incremental policy training code configured to cause the at least one processor to operate as an incremental policy model trainer”. These limitations are additional elements that amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP §2106.05(f). Further, the claim recites “train the ANN model incrementally from the input ANN model with the new training data including pairs of input and output of training and validation for an incremental training phase” and “train the Policy model incrementally from the input policy model by using, as teacher data, a policy vector which can be determined by comparing a threshold with the information matrix, using the New training data and the information matrix”. These limitations are recited at a high level of generality and amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP §2106.05(f). Elements that merely amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process cannot provide an inventive concept. The claim is not patent eligible.
Regarding claim 4, the rejection of claim 1 is incorporated, and further, the claim recites “wherein the policy model is a light-weight policy model based on a traditional machine learning model with a supervised learning”. This is an additional element that generally links the use of the judicial exception to a particular technological environment or field of use. See MPEP § 2106.05(h). Elements that merely generally link the use of the judicial exception to a particular technological environment or field of use cannot provide an inventive concept. The claim is not patent eligible.
Regarding claim 5:
Step 1 Statutory Category: Claim 5 is directed to a method which falls under one of the four statutory categories.
Step 2A Prong 1 Judicial Exception: Claim 5 recites, in part, “computing an information matrix of each sample in the training data using training information extracted during the ANN model training”. This limitation is the abstract idea of a mathematical calculation, as directed to “a claim that recites a mathematical calculation, when the claim is given its broadest reasonable interpretation in light of the specification, will be considered as falling within the "mathematical concepts" grouping. A mathematical calculation is a mathematical operation (such as multiplication) or an act of calculating using mathematical methods to determine a variable or number”. See MPEP § 2106.04(a)(2)(I)(C).
Step 2A Prong 2 Integration into a practical application: This judicial exception is not integrated into a practical application. In particular the claim recites: “training an ANN (artificial neural network) model using training data” and “training a Policy model by using, as teacher data, a policy vector which can be determined by comparing a threshold with the information matrix, using the training data and the information matrix, wherein values included in the policy vector cause at least one layer of the policy model to skip processing during an inference phase”. These limitations are recited at a high level of generality and amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP §2106.05(f).
Step 2B Significantly more: The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed with respect to integration of the abstract idea into a practical application, the additional elements: “training an ANN (artificial neural network) model using training data”, and “training a Policy model by using, as teacher data, a policy vector which can be determined by comparing a threshold with the information matrix, using the training data and the information matrix, wherein values included in the policy vector cause at least one layer of the policy model to skip processing during an inference phase” are additional elements that amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP §2106.05(f). Elements that merely amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process cannot provide an inventive concept. The claim is not patent eligible.
Regarding claim 6, the rejection of claim 5 is incorporated and further, claim 6 is substantially similar to claim 2 respectively, and is rejected in the same manner and reasoning applying.
Regarding claim 8, the rejection of claim 5 is incorporated and further, claim 8 is substantially similar to claim 4 respectively, and is rejected in the same manner and reasoning applying.
Regarding claim 9:
Step 1 Statutory Category: Claim 9 is directed to a machine which falls under one of the four statutory categories.
Step 2A Prong 1 Judicial Exception: Claim 9 recites, in part, “computing an information matrix of each sample in the training data using training information extracted during the ANN model training”. This limitation is the abstract idea of a mathematical calculation, as directed to “a claim that recites a mathematical calculation, when the claim is given its broadest reasonable interpretation in light of the specification, will be considered as falling within the "mathematical concepts" grouping. A mathematical calculation is a mathematical operation (such as multiplication) or an act of calculating using mathematical methods to determine a variable or number”. See MPEP § 2106.04(a)(2)(I)(C).
Step 2A Prong 2 Integration into a practical application: This judicial exception is not integrated into a practical application. In particular the claim recites: “a non-transitory computer readable medium storing a program for causing a computer to execute an information processing method”. This limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP §2106.05(f). Further, the claim recites “training an ANN (artificial neural network) model using training data” and “training a Policy model by using, as teacher data, a policy vector which can be determined by comparing a threshold with the information matrix, using the training data and the information matrix, wherein values included in the policy vector cause at least one layer of the policy model to skip processing during an inference phase”. These limitations are recited at a high level of generality and amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP §2106.05(f).
Step 2B Significantly more: The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed with respect to integration of the abstract idea into a practical application, the additional elements: “a non-transitory computer readable medium storing a program for causing a computer to execute an information processing method”, “training an ANN (artificial neural network) model using training data”, and “training a Policy model by using, as teacher data, a policy vector which can be determined by comparing a threshold with the information matrix, using the training data and the information matrix, wherein values included in the policy vector cause at least one layer of the policy model to skip processing during an inference phase” are additional elements that amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP §2106.05(f). Elements that merely amount to adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer in its ordinary capacity as a tool to perform an existing process cannot provide an inventive concept. The claim is not patent eligible.
Regarding claim 10, the rejection of claim 9 is incorporated, and further, claim 9 is substantially similar to claims 2 and 6 respectively, is rejected in the same manner and reasoning applying.
Regarding claim 12, the rejection of claim 9 is incorporated, and further, claim 12 is substantially similar to claims 4 and 8 respectively, and is rejected in the same manner and reasoning applying.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-2, 4-6, 8-10, and 12 are rejected under 35 U.S.C. 103 as being unpatentable over Spasov et al., Dynamic Neural Network Channel Execution for Efficient Training, 05/15/2019, https://arxiv.org/abs/1905.06435, hereinafter referred to as “Spasov” in view of Theis et al., Faster gaze prediction with dense networks and Fisher pruning, 07/09/2019, https://arxiv.org/abs/1801.05787, hereinafter referred to as “Theis” in further view of Luo et al., AutoPruner: An End-to-End Trainable Filter Pruning Method for Efficient Deep Model Inference, 01/17/2019, https://arxiv.org/pdf/1805.08941, hereinafter referred to as “Luo”.
Regarding claim 1, Spasov teaches An information processing apparatus comprising: at least one memory configured to store program code; and at least one processor configured to operate as instructed by the program code (Spasov, Abstract, Lines 3-6, “in this work, we propose a novel method which reduces the memory footprint and number of computing operations required for training and inference. Our framework efficiently integrates pruning as part of the training procedure by exploring and tracking the relative importance of convolutional channels” The method of Spasov reduces memory footprint and the number of computing operations, which provides evidence that the method of Spasov is performed on a computer/processor which is considered to be the “information processing apparatus”; Spasov, Page 4, Section 3, Line 2, “We conduct all experiments in PyTorch [27]”; Spasov, Page 5, Section 3.1, Lines 1-3, “We evaluate the performance of our method on three datasets: CIFAR-10, CIFAR-100 [28] and Street View House Number (SVHN) [29]. Both CIFAR datasets consist of coloured natural images with resolution 32x32, and comprise 50,000 training examples and 10,000 testing examples”; A person of ordinary skill would recognize that a computer must have performed the experiments disclosed, and further because the experiments were conducted in “PyTorch”, the memory must have been configured to store program code, and the processor to execute that code), the program code comprising:
training code configured to cause the at least one processor to operate as an ANN (artificial neural networks) model trainer configured to train an ANN model using training data (Spasov, Page 6, Section 3.3, Line 1, “All network models are trained using SGD with Nesterov momentum of 0.9 without dampening”; Spasov, Page 2, Section 2, Lines 1-6, “We consider a supervised learning problem with a set of training examples D = {X = {x1, x2, . . . , xN}, Y = {y1, y2, . . . , yN}}, where x and y represent an input and a label, respectively. Given a CNN model with L convolutional layers, let each layer l ∈ 1 . . . L comprise Kl channels, C k l , where k ∈ 1 . . . Kl is the channel index. In each training step t, we 1) sample a batch of B data samples (x 1:B, y 1:B); 2) select and activate a subset S of convolutional channels (see Figure 2); 3) run a forward and a backward pass on the “thin” network, that is only on the active channels; and 4) observe the revealed saliency estimates (SALk l,t) of the activated channels” The “CNN model with L convolutional layers” is considered to be the “ANN model” that is trained using “a set of training examples” which are considered to be the “training data”);
policy training code configured to cause the at least one processor to operate as a Policy model trainer configured to train a Policy model using the Training data (Spasov, Page 2, Section 2, Lines 4-5, “2) select and activate a subset S of convolutional channels (see Figure 2)”; Spasov, Page 4, Algorithm 1, Steps 1-2, 4-11; Spasov, Page 3, Section 2.1, Paragraph 2, Lines 1-3, “Algorithm 1 can be loosely divided in two stages: firstly, an initialization round of exploring the saliencies of the channels in the network, and a second stage where we start exploiting and refining the initial saliency estimates to guide the channel selection procedure” Spasov, Page 3, Section 2.2, Lines 1-2, “Our dynamic channel selection framework requires the estimation of channel saliency, or the contribution of each active channel to the overall network performance” The “channel selection framework” shown in Algorithm 1 is considered to be the “Policy model”; because the “initial saliency estimates” are exploited and refined, the model is considered to be trained using at least part of Algorithm 1).
Spasov does not explicitly teach an Information matrix computation unit configured to compute information matrix of each sample in the training data using training information extracted by the ANN model trainer, nor training the policy model by using, as teacher data, a policy vector which can be determined by comparing a threshold with the information matrix, using the training data and the Information matrix, wherein values included in the policy vector cause at least one layer of the Policy model to skip processing during an inference phase.
Theis teaches an Information matrix computation unit configured to compute information matrix of each sample in the training data using training information extracted by the ANN model trainer (Theis, Page 3, Section 2.1, Lines 1-2, “Our goal is to remove feature maps or parameters which contribute little to the overall performance of the model”; Theis, Page 4, Paragraph 1, Lines 1-4, “For convolutional architectures, it makes sense to try to prune entire feature maps instead of individual parameters, since typical implementations of convolutions may not be able to exploit sparse kernels for speedups. Let
a
n
k
i
j
be the activation of the kth feature map at spatial location i,j for the nth datapoint”; Theis, Page 4, Paragraph 1, Lines 7-9, “The gradient of the loss for the nth datapoint with respect to mk is
g
n
k
=
-
∑
i
j
a
n
k
i
j
∂
∂
a
n
k
i
j
l
o
g
Q
(
z
n
|
I
n
)
(9) and the pruning signal is therefore
∆
k
=
1
2
N
∑
n
g
n
k
2
, since
m
k
2
=
1
before pruning”; Theis, Page 18, Paragraph 3, and Equations 27-32).
Spasov in view of Theis also teaches training the policy model by using, as teacher data, a policy … which can be determined by comparing a threshold with the information matrix (Theis, Page 5, Section 2.3, Paragraph 2, Lines 1-3, and Equation 14, “For a given β, a feature should be pruned if Equation 13 is negative, that is, when doing so reduces the overall cost because it decreases the computational cost more than it increases the cross-entropy: ∆Li + β · ∆Ci ≤ 0 (14)” “0” is considered to be the “threshold”), using the training data and the Information matrix (Spasov, Page 3, Section 2.1, Paragraph 2, Lines 1-3, “Algorithm 1 can be loosely divided in two stages: firstly, an initialization round of exploring the saliencies of the channels in the network, and a second stage where we start exploiting and refining the initial saliency estimates to guide the channel selection procedure” Spasov, Page 3, Section 2.2, Lines 1-2, “Our dynamic channel selection framework requires the estimation of channel saliency, or the contribution of each active channel to the overall network performance”; Theis, Page 4, Paragraph 1, Lines 1-4, “For convolutional architectures, it makes sense to try to prune entire feature maps instead of individual parameters, since typical implementations of convolutions may not be able to exploit sparse kernels for speedups. Let
a
n
k
i
j
be the activation of the kth feature map at spatial location i,j for the nth datapoint”; Theis, Page 4, Paragraph 1, Lines 7-9, “The gradient of the loss for the nth datapoint with respect to mk is
g
n
k
=
-
∑
i
j
a
n
k
i
j
∂
∂
a
n
k
i
j
l
o
g
Q
(
z
n
|
I
n
)
(9) and the pruning signal is therefore
∆
k
=
1
2
N
∑
n
g
n
k
2
, since
m
k
2
=
1
before pruning”; Theis, Page 18, Paragraph 3, and Equations 27-32; Spasov uses the “saliency estimates” to train the “channel selection framework” and Theis teaches using the Fisher Information matrix to estimate a “pruning signal” which is considered to be equivalent to the “saliency estimates”, therefore, using the Fisher Information Matrix method of Theis to estimate the “saliency”, the policy model of Spasov is trained using the information matrix)
It would have been obvious, to a person of ordinary skill in the art, before the effective filing date of the invention to have modified the information processing method of Spasov to use the Fisher information matrix method of kernel ranking across layers as taught by Theis. The motivation for doing so would have been that Spasov presented the method taught by Theis as an alternative choice to the kernel ranking method taught by Spasov (Spasov, Page 3, Section 2.2, Lines 2-6, “We need a saliency metric which enables us to rank the filters of the entire network globally, that is across layers. Molchanov et al. [3] propose a pruning framework which leverages a first-order Taylor approximation for global channel ranking, whereas Theis et al. [25] use Fisher information to achieve kernel ranking across layers [26]. Our channel ranking approach is based on Molchanov et al. [3] although both methods would be applicable”), further, Theis also notes that the method is similar to the one used by Molchanov, but provides a more principled motivation (Theis, Page 4, Paragraph 2, Lines 1-4, “We note that this pruning signal is very similar to the one used by Molchanov et al. [8] – which uses absolute gradients instead of squared gradients and a certain normalization of the pruning signal – but our derivation provides a more principled motivation”).
Theis does not explicitly teach the policy being a vector nor wherein values included in the policy vector cause at least one layer of the Policy model to skip processing during an inference phase.
Luo teaches the policy being a vector and wherein values included in the policy vector cause at least one layer of the Policy model to skip processing during an inference phase (Luo, Page 3, Section 3.1, Paragraph 2, Lines 1-4, “After training, the binary index code is used for filter pruning. All the filters in previous layer and all the channels in the filters of the next layer will be removed if their corresponding index value is 0”; see also Luo, Page 3, Figure 2, “
X
∈
B
C
” is considered to be the “policy vector” which is used in “element-wise multiplication” to result in an output where “some layers are pruned”).
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to have modified the information processing method of the proposed combination to include using a policy vector and skipping layers during inference as taught by Luo. The motivation for doing so would have been that the binarization of the vector allows for pruning and fine-tuning to be integrated together (Luo, Page 4, Section 3.1.3, Paragraph 4).
Regarding claim 2, the rejection of claim 1 is incorporated, and further, the proposed combination teaches incremental training code configured to cause the at least one processor to operate as an Incremental ANN model trainer configured to train the ANN model incrementally from the input ANN model with the New training data including pairs of input and output of training and validation for an incremental training phase (Spasov, Page 6, Section 3.3, Line 1, “All network models are trained using SGD with Nesterov momentum of 0.9 without dampening”; Spasov, Page 2, Section 2, Lines 1-6, “We consider a supervised learning problem with a set of training examples D = {X = {x1, x2, . . . , xN}, Y = {y1, y2, . . . , yN}}, where x and y represent an input and a label, respectively. Given a CNN model with L convolutional layers, let each layer l ∈ 1 . . . L comprise Kl channels, C k l , where k ∈ 1 . . . Kl is the channel index. In each training step t, we 1) sample a batch of B data samples (x 1:B, y 1:B); 2) select and activate a subset S of convolutional channels (see Figure 2); 3) run a forward and a backward pass on the “thin” network, that is only on the active channels; and 4) observe the revealed saliency estimates (SALk l,t) of the activated channels” The training is done is “step[s] t” which is considered to be “train[ing] the ANN model incrementally”);
the computation code is further configured to compute the information matrix of each sample in the New training data using the training information (Spasov, Page 3, Section 2.2, Lines 2-5, “We need a saliency metric which enables us to rank the filters of the entire network globally, that is across layers. … Theis et al. [25] use Fisher information to achieve kernel ranking across layers [26]”; Theis, Page 3, Section 2.1, Lines 1-2, “Our goal is to remove feature maps or parameters which contribute little to the overall performance of the model”; Theis, Page 4, Paragraph 1, Lines 1-4, “For convolutional architectures, it makes sense to try to prune entire feature maps instead of individual parameters, since typical implementations of convolutions may not be able to exploit sparse kernels for speedups. Let
a
n
k
i
j
be the activation of the kth feature map at spatial location i,j for the nth datapoint”; Theis, Page 4, Paragraph 1, Lines 7-9, “The gradient of the loss for the nth datapoint with respect to mk is
g
n
k
=
-
∑
i
j
a
n
k
i
j
∂
∂
a
n
k
i
j
l
o
g
Q
(
z
n
|
I
n
)
(9) and the pruning signal is therefore
∆
k
=
1
2
N
∑
n
g
n
k
2
, since
m
k
2
=
1
before pruning”; Theis, Page 18, Paragraph 3, and Equations 27-32) and
incremental policy training code configured to cause the at least one processor to operate as an Incremental policy model trainer configured to train the Policy model incrementally from the input Policy model by using, as teacher data, a policy … which can be determined by comparing a threshold with the information matrix (Theis, Page 5, Section 2.3, Paragraph 2, Lines 1-3, and Equation 14, “For a given β, a feature should be pruned if Equation 13 is negative, that is, when doing so reduces the overall cost because it decreases the computational cost more than it increases the cross-entropy: ∆Li + β · ∆Ci ≤ 0 (14)” “0” is considered to be the “threshold”) using the New training data (Spasov, Page 2, Section 2, Lines 4-5, “2) select and activate a subset S of convolutional channels (see Figure 2)”; Spasov, Page 4, Algorithm 1, Steps 1-2, 4-11; Spasov, Page 3, Section 2.1, Paragraph 2, Lines 1-3, “Algorithm 1 can be loosely divided in two stages: firstly, an initialization round of exploring the saliencies of the channels in the network, and a second stage where we start exploiting and refining the initial saliency estimates to guide the channel selection procedure” Spasov, Page 3, Section 2.2, Lines 1-2, “Our dynamic channel selection framework requires the estimation of channel saliency, or the contribution of each active channel to the overall network performance” The “channel selection framework” shown in Algorithm 1 is considered to be the “Policy model”; because the “initial saliency estimates” are exploited and refined, the model is considered to be trained using at least part of Algorithm 1; Algorithm 1 shows that the policy model is trained iteratively via the “for” loops) and the Information matrix (Spasov, Page 3, Section 2.1, Paragraph 2, Lines 1-3, “Algorithm 1 can be loosely divided in two stages: firstly, an initialization round of exploring the saliencies of the channels in the network, and a second stage where we start exploiting and refining the initial saliency estimates to guide the channel selection procedure” Spasov, Page 3, Section 2.2, Lines 1-2, “Our dynamic channel selection framework requires the estimation of channel saliency, or the contribution of each active channel to the overall network performance”; Theis, Page 4, Paragraph 1, Lines 1-4, “For convolutional architectures, it makes sense to try to prune entire feature maps instead of individual parameters, since typical implementations of convolutions may not be able to exploit sparse kernels for speedups. Let
a
n
k
i
j
be the activation of the kth feature map at spatial location i,j for the nth datapoint”; Theis, Page 4, Paragraph 1, Lines 7-9, “The gradient of the loss for the nth datapoint with respect to mk is
g
n
k
=
-
∑
i
j
a
n
k
i
j
∂
∂
a
n
k
i
j
l
o
g
Q
(
z
n
|
I
n
)
(9) and the pruning signal is therefore
∆
k
=
1
2
N
∑
n
g
n
k
2
, since
m
k
2
=
1
before pruning”; Theis, Page 18, Paragraph 3, and Equations 27-32; Spasov uses the “saliency estimates” to train the “channel selection framework” and Theis teaches using the Fisher Information matrix to estimate a “pruning signal” which is considered to be equivalent to the “saliency estimates”, therefore, using the Fisher Information Matrix method of Theis to estimate the “saliency”, the policy model of Spasov is trained using the information matrix)
Luo teaches the policy being a vector (Luo, Page 3, Section 3.1, Paragraph 2, Lines 1-4, “After training, the binary index code is used for filter pruning. All the filters in previous layer and all the channels in the filters of the next layer will be removed if their corresponding index value is 0”; see also Luo, Page 3, Figure 2, “
X
∈
B
C
” is considered to be the “policy vector”).
Regarding claim 4, the rejection of claim 1 is incorporated, and further, the proposed combination teaches wherein the Policy model is a light-weight Policy model based on a traditional machine learning model with a supervised learning (Spasov, Page 3, Lines 1-2, “We consider a supervised learning problem with a set of training examples D = {X = {x1, x2, . . . , xN}, Y = {y1, y2, . . . , yN}}, where x and y represent an input and a label, respectively” The training algorithm takes training samples and labels as input, and thus is considered a “supervised learning”; Spasov, Page 3, Section 2.2, Lines 1-2, “Our dynamic channel selection framework requires the estimation of channel saliency, or the contribution of each active channel to the overall network performance”; Spasov, Page 2, Paragraph 1, Lines 10-13, “we formulate the channel selection problem as a combinatorial multi-armed bandit problem, and propose to use the combinatorial upper confidence bound algorithm (CUCB) [5] to solve it. An advantage of this approach is that we do not require additional model complexity to implement the channel selection procedure”, See also Figure 2, “Compute channel saliencies” is shown as occurring during “Backward pass”; Theis, Page 4, Paragraph 1, Lines 9-11, “The gradient with respect to the activations is available during the backward pass of computing the network’s gradient and the pruning signal can therefore be computed at little extra computational cost” Because the policy model adds little to no additional model complexity, it is considered to be “light-weight”).
Regarding claim 5, Spasov teaches an information processing method (Spasov, Abstract, Lines 3-6, “in this work, we propose a novel method which reduces the memory footprint and number of computing operations required for training and inference. Our framework efficiently integrates pruning as part of the training procedure by exploring and tracking the relative importance of convolutional channels”) comprising:
training an ANN (artificial neural networks) model using training data (Spasov, Page 6, Section 3.3, Line 1, “All network models are trained using SGD with Nesterov momentum of 0.9 without dampening”; Spasov, Page 2, Section 2, Lines 1-6, “We consider a supervised learning problem with a set of training examples D = {X = {x1, x2, . . . , xN}, Y = {y1, y2, . . . , yN}}, where x and y represent an input and a label, respectively. Given a CNN model with L convolutional layers, let each layer l ∈ 1 . . . L comprise Kl channels, C k l , where k ∈ 1 . . . Kl is the channel index. In each training step t, we 1) sample a batch of B data samples (x 1:B, y 1:B); 2) select and activate a subset S of convolutional channels (see Figure 2); 3) run a forward and a backward pass on the “thin” network, that is only on the active channels; and 4) observe the revealed saliency estimates (SALk l,t) of the activated channels” The “CNN model with L convolutional layers” is considered to be the “ANN model” that is trained using “a set of training examples” which are considered to be the “training data”);
training a policy model using the training data (Spasov, Page 2, Section 2, Lines 4-5, “2) select and activate a subset S of convolutional channels (see Figure 2)”; Spasov, Page 4, Algorithm 1, Steps 1-2, 4-11; Spasov, Page 3, Section 2.1, Paragraph 2, Lines 1-3, “Algorithm 1 can be loosely divided in two stages: firstly, an initialization round of exploring the saliencies of the channels in the network, and a second stage where we start exploiting and refining the initial saliency estimates to guide the channel selection procedure” Spasov, Page 3, Section 2.2, Lines 1-2, “Our dynamic channel selection framework requires the estimation of channel saliency, or the contribution of each active channel to the overall network performance” The “channel selection framework” shown in Algorithm 1 is considered to be the “Policy model”; because the “initial saliency estimates” are exploited and refined, the model is considered to be trained using at least part of Algorithm 1).
Spasov does not explicitly teach computing an information matrix of each sample in the training data using training information extracted during the ANN model training, nor training the policy model by using, as teacher data, a policy vector which can be determined by comparing a threshold with the information matrix, using the training data and the Information matrix, wherein values included in the policy vector cause at least one layer of the Policy model to skip processing during an inference phase.
Theis teaches an Information matrix computation unit configured to compute information matrix of each sample in the training data using training information extracted during the ANN model training (Theis, Page 3, Section 2.1, Lines 1-2, “Our goal is to remove feature maps or parameters which contribute little to the overall performance of the model”; Theis, Page 4, Paragraph 1, Lines 1-4, “For convolutional architectures, it makes sense to try to prune entire feature maps instead of individual parameters, since typical implementations of convolutions may not be able to exploit sparse kernels for speedups. Let
a
n
k
i
j
be the activation of the kth feature map at spatial location i,j for the nth datapoint”; Theis, Page 4, Paragraph 1, Lines 7-9, “The gradient of the loss for the nth datapoint with respect to mk is
g
n
k
=
-
∑
i
j
a
n
k
i
j
∂
∂
a
n
k
i
j
l
o
g
Q
(
z
n
|
I
n
)
(9) and the pruning signal is therefore
∆
k
=
1
2
N
∑
n
g
n
k
2
, since
m
k
2
=
1
before pruning”; Theis, Page 18, Paragraph 3, and Equations 27-32).
Spasov in view of Theis also teaches training the policy model by using, as teacher data, a policy … which can be determined by comparing a threshold with the information matrix (Theis, Page 5, Section 2.3, Paragraph 2, Lines 1-3, and Equation 14, “For a given β, a feature should be pruned if Equation 13 is negative, that is, when doing so reduces the overall cost because it decreases the computational cost more than it increases the cross-entropy: ∆Li + β · ∆Ci ≤ 0 (14)” “0” is considered to be the “threshold”), using the training data and the Information matrix (Spasov, Page 3, Section 2.1, Paragraph 2, Lines 1-3, “Algorithm 1 can be loosely divided in two stages: firstly, an initialization round of exploring the saliencies of the channels in the network, and a second stage where we start exploiting and refining the initial saliency estimates to guide the channel selection procedure” Spasov, Page 3, Section 2.2, Lines 1-2, “Our dynamic channel selection framework requires the estimation of channel saliency, or the contribution of each active channel to the overall network performance”; Theis, Page 4, Paragraph 1, Lines 1-4, “For convolutional architectures, it makes sense to try to prune entire feature maps instead of individual parameters, since typical implementations of convolutions may not be able to exploit sparse kernels for speedups. Let
a
n
k
i
j
be the activation of the kth feature map at spatial location i,j for the nth datapoint”; Theis, Page 4, Paragraph 1, Lines 7-9, “The gradient of the loss for the nth datapoint with respect to mk is
g
n
k
=
-
∑
i
j
a
n
k
i
j
∂
∂
a
n
k
i
j
l
o
g
Q
(
z
n
|
I
n
)
(9) and the pruning signal is therefore
∆
k
=
1
2
N
∑
n
g
n
k
2
, since
m
k
2
=
1
before pruning”; Theis, Page 18, Paragraph 3, and Equations 27-32; Spasov uses the “saliency estimates” to train the “channel selection framework” and Theis teaches using the Fisher Information matrix to estimate a “pruning signal” which is considered to be equivalent to the “saliency estimates”, therefore, using the Fisher Information Matrix method of Theis to estimate the “saliency”, the policy model of Spasov is trained using the information matrix)
It would have been obvious, to a person of ordinary skill in the art, before the effective filing date of the invention to have modified the information processing method of Spasov to use the Fisher information matrix method of kernel ranking across layers as taught by Theis. The motivation for doing so would have been that Spasov presented the method taught by Theis as an alternative choice to the kernel ranking method taught by Spasov (Spasov, Page 3, Section 2.2, Lines 2-6, “We need a saliency metric which enables us to rank the filters of the entire network globally, that is across layers. Molchanov et al. [3] propose a pruning framework which leverages a first-order Taylor approximation for global channel ranking, whereas Theis et al. [25] use Fisher information to achieve kernel ranking across layers [26]. Our channel ranking approach is based on Molchanov et al. [3] although both methods would be applicable”), further, Theis also notes that the method is similar to the one used by Molchanov, but provides a more principled motivation (Theis, Page 4, Paragraph 2, Lines 1-4, “We note that this pruning signal is very similar to the one used by Molchanov et al. [8] – which uses absolute gradients instead of squared gradients and a certain normalization of the pruning signal – but our derivation provides a more principled motivation”).
Theis does not explicitly teach the policy being a vector nor wherein values included in the policy vector cause at least one layer of the Policy model to skip processing during an inference phase.
Luo teaches the policy being a vector and wherein values included in the policy vector cause at least one layer of the Policy model to skip processing during an inference phase (Luo, Page 3, Section 3.1, Paragraph 2, Lines 1-4, “After training, the binary index code is used for filter pruning. All the filters in previous layer and all the channels in the filters of the next layer will be removed if their corresponding index value is 0”; see also Luo, Page 3, Figure 2, “
X
∈
B
C
” is considered to be the “policy vector” which is used in “element-wise multiplication” to result in an output where “some layers are pruned”).
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to have modified the information processing method of the proposed combination to include using a policy vector and skipping layers during inference as taught by Luo. The motivation for doing so would have been that the binarization of the vector allows for pruning and fine-tuning to be integrated together (Luo, Page 4, Section 3.1.3, Paragraph 4).
Regarding claim 6, the rejection of claim 5 is incorporated, and further, the proposed combination teaches training the ANN model incrementally from the input ANN model with a new training data including pairs of input and output for training and validation in an incremental training phase (Spasov, Page 6, Section 3.3, Line 1, “All network models are trained using SGD with Nesterov momentum of 0.9 without dampening”; Spasov, Page 2, Section 2, Lines 1-6, “We consider a supervised learning problem with a set of training examples D = {X = {x1, x2, . . . , xN}, Y = {y1, y2, . . . , yN}}, where x and y represent an input and a label, respectively. Given a CNN model with L convolutional layers, let each layer l ∈ 1 . . . L comprise Kl channels, C k l , where k ∈ 1 . . . Kl is the channel index. In each training step t, we 1) sample a batch of B data samples (x 1:B, y 1:B); 2) select and activate a subset S of convolutional channels (see Figure 2); 3) run a forward and a backward pass on the “thin” network, that is only on the active channels; and 4) observe the revealed saliency estimates (SALk l,t) of the activated channels” The training is done is “step[s] t” which is considered to be “train[ing] the ANN model incrementally”);
computing the information matrix of each sample in the new training data using the training information (Spasov, Page 3, Section 2.2, Lines 2-5, “We need a saliency metric which enables us to rank the filters of the entire network globally, that is across layers. … Theis et al. [25] use Fisher information to achieve kernel ranking across layers [26]”; Theis, Page 3, Section 2.1, Lines 1-2, “Our goal is to remove feature maps or parameters which contribute little to the overall performance of the model”; Theis, Page 4, Paragraph 1, Lines 1-4, “For convolutional architectures, it makes sense to try to prune entire feature maps instead of individual parameters, since typical implementations of convolutions may not be able to exploit sparse kernels for speedups. Let
a
n
k
i
j
be the activation of the kth feature map at spatial location i,j for the nth datapoint”; Theis, Page 4, Paragraph 1, Lines 7-9, “The gradient of the loss for the nth datapoint with respect to mk is
g
n
k
=
-
∑
i
j
a
n
k
i
j
∂
∂
a
n
k
i
j
l
o
g
Q
(
z
n
|
I
n
)
(9) and the pruning signal is therefore
∆
k
=
1
2
N
∑
n
g
n
k
2
, since
m
k
2
=
1
before pruning”; Theis, Page 18, Paragraph 3, and Equations 27-32); and
training the policy model incrementally from the input policy model by using, as teacher data, a policy … which can be determined by comparing a threshold with the information matrix (Theis, Page 5, Section 2.3, Paragraph 2, Lines 1-3, and Equation 14, “For a given β, a feature should be pruned if Equation 13 is negative, that is, when doing so reduces the overall cost because it decreases the computational cost more than it increases the cross-entropy: ∆Li + β · ∆Ci ≤ 0 (14)” “0” is considered to be the “threshold”), using the new training data (Spasov, Page 3, Section 2.2, Lines 2-5, “We need a saliency metric which enables us to rank the filters of the entire network globally, that is across layers. … Theis et al. [25] use Fisher information to achieve kernel ranking across layers [26]”; Theis, Page 3, Section 2.1, Lines 1-2, “Our goal is to remove feature maps or parameters which contribute little to the overall performance of the model”; Theis, Page 4, Paragraph 1, Lines 1-4, “For convolutional architectures, it makes sense to try to prune entire feature maps instead of individual parameters, since typical implementations of convolutions may not be able to exploit sparse kernels for speedups. Let
a
n
k
i
j
be the activation of the kth feature map at spatial location i,j for the nth datapoint”; Theis, Page 4, Paragraph 1, Lines 7-9, “The gradient of the loss for the nth datapoint with respect to mk is
g
n
k
=
-
∑
i
j
a
n
k
i
j
∂
∂
a
n
k
i
j
l
o
g
Q
(
z
n
|
I
n
)
(9) and the pruning signal is therefore
∆
k
=
1
2
N
∑
n
g
n
k
2
, since
m
k
2
=
1
before pruning”; Theis, Page 18, Paragraph 3, and Equations 27-32) and the information matrix (Spasov, Page 3, Section 2.1, Paragraph 2, Lines 1-3, “Algorithm 1 can be loosely divided in two stages: firstly, an initialization round of exploring the saliencies of the channels in the network, and a second stage where we start exploiting and refining the initial saliency estimates to guide the channel selection procedure” Spasov, Page 3, Section 2.2, Lines 1-2, “Our dynamic channel selection framework requires the estimation of channel saliency, or the contribution of each active channel to the overall network performance”; Theis, Page 4, Paragraph 1, Lines 1-4, “For convolutional architectures, it makes sense to try to prune entire feature maps instead of individual parameters, since typical implementations of convolutions may not be able to exploit sparse kernels for speedups. Let
a
n
k
i
j
be the activation of the kth feature map at spatial location i,j for the nth datapoint”; Theis, Page 4, Paragraph 1, Lines 7-9, “The gradient of the loss for the nth datapoint with respect to mk is
g
n
k
=
-
∑
i
j
a
n
k
i
j
∂
∂
a
n
k
i
j
l
o
g
Q
(
z
n
|
I
n
)
(9) and the pruning signal is therefore
∆
k
=
1
2
N
∑
n
g
n
k
2
, since
m
k
2
=
1
before pruning”; Theis, Page 18, Paragraph 3, and Equations 27-32; Spasov uses the “saliency estimates” to train the “channel selection framework” and Theis teaches using the Fisher Information matrix to estimate a “pruning signal” which is considered to be equivalent to the “saliency estimates”, therefore, using the Fisher Information Matrix method of Theis to estimate the “saliency”, the policy model of Spasov is trained using the information matrix)
Luo teaches the policy being a vector (Luo, Page 3, Section 3.1, Paragraph 2, Lines 1-4, “After training, the binary index code is used for filter pruning. All the filters in previous layer and all the channels in the filters of the next layer will be removed if their corresponding index value is 0”; see also Luo, Page 3, Figure 2, “
X
∈
B
C
” is considered to be the “policy vector”).
Regarding claim 8, the rejection of claim 5 is incorporated, and further, Spasov teaches wherein the Policy model is a light-weight Policy model based on a traditional machine learning model with a supervised learning (Spasov, Page 3, Lines 1-2, “We consider a supervised learning problem with a set of training examples D = {X = {x1, x2, . . . , xN}, Y = {y1, y2, . . . , yN}}, where x and y represent an input and a label, respectively” The training algorithm takes training samples and labels as input, and thus is considered a “supervised learning”; Spasov, Page 3, Section 2.2, Lines 1-2, “Our dynamic channel selection framework requires the estimation of channel saliency, or the contribution of each active channel to the overall network performance; Spasov, Page 2, Paragraph 1, Lines 10-13, “we formulate the channel selection problem as a combinatorial multi-armed bandit problem, and propose to use the combinatorial upper confidence bound algorithm (CUCB) [5] to solve it. An advantage of this approach is that we do not require additional model complexity to implement the channel selection procedure”, See also Figure 2, “Compute channel saliencies” is shown as occurring during “Backward pass”; Theis, Page 4, Paragraph 1, Lines 9-11, “The gradient with respect to the activations is available during the backward pass of computing the network’s gradient and the pruning signal can therefore be computed at little extra computational cost” Because the policy model adds little to no additional model complexity, it is considered to be “light-weight”).
Regarding claim 9, Spasov teaches a non-transitory computer readable medium storing a program for causing a computer to execute an information processing method (Spasov, Abstract, Lines 3-6, “in this work, we propose a novel method which reduces the memory footprint and number of computing operations required for training and inference. Our framework efficiently integrates pruning as part of the training procedure by exploring and tracking the relative importance of convolutional channels” The method of Spasov reduces memory footprint and the number of computing operations, which provides evidence that the method of Spasov is performed on a computer/processor using a non-transitory computer readable medium), the information processing method comprising:
training an ANN (artificial neural networks) model using training data (Spasov, Page 6, Section 3.3, Line 1, “All network models are trained using SGD with Nesterov momentum of 0.9 without dampening”; Spasov, Page 2, Section 2, Lines 1-6, “We consider a supervised learning problem with a set of training examples D = {X = {x1, x2, . . . , xN}, Y = {y1, y2, . . . , yN}}, where x and y represent an input and a label, respectively. Given a CNN model with L convolutional layers, let each layer l ∈ 1 . . . L comprise Kl channels, C k l , where k ∈ 1 . . . Kl is the channel index. In each training step t, we 1) sample a batch of B data samples (x 1:B, y 1:B); 2) select and activate a subset S of convolutional channels (see Figure 2); 3) run a forward and a backward pass on the “thin” network, that is only on the active channels; and 4) observe the revealed saliency estimates (SALk l,t) of the activated channels” The “CNN model with L convolutional layers” is considered to be the “ANN model” that is trained using “a set of training examples” which are considered to be the “training data”);
training a policy model … using the training data (Spasov, Page 2, Section 2, Lines 4-5, “2) select and activate a subset S of convolutional channels (see Figure 2)”; Spasov, Page 4, Algorithm 1, Steps 1-2, 4-11; Spasov, Page 3, Section 2.1, Paragraph 2, Lines 1-3, “Algorithm 1 can be loosely divided in two stages: firstly, an initialization round of exploring the saliencies of the channels in the network, and a second stage where we start exploiting and refining the initial saliency estimates to guide the channel selection procedure” Spasov, Page 3, Section 2.2, Lines 1-2, “Our dynamic channel selection framework requires the estimation of channel saliency, or the contribution of each active channel to the overall network performance” The “channel selection framework” shown in Algorithm 1 is considered to be the “Policy model”; because the “initial saliency estimates” are exploited and refined, the model is considered to be trained using at least part of Algorithm 1).
Spasov does not explicitly teach computing an information matrix of each sample in the training data using training information extracted during the ANN model training, nor training the policy model by using, as teacher data, a policy vector which can be determined by comparing a threshold with the information matrix, using the training data and the Information matrix, wherein values included in the policy vector cause at least one layer of the Policy model to skip processing during an inference phase.
Theis teaches computing an information matrix of each sample in the training data using training information extracted during the ANN model training (Theis, Page 3, Section 2.1, Lines 1-2, “Our goal is to remove feature maps or parameters which contribute little to the overall performance of the model”; Theis, Page 4, Paragraph 1, Lines 1-4, “For convolutional architectures, it makes sense to try to prune entire feature maps instead of individual parameters, since typical implementations of convolutions may not be able to exploit sparse kernels for speedups. Let
a
n
k
i
j
be the activation of the kth feature map at spatial location i,j for the nth datapoint”; Theis, Page 4, Paragraph 1, Lines 7-9, “The gradient of the loss for the nth datapoint with respect to mk is
g
n
k
=
-
∑
i
j
a
n
k
i
j
∂
∂
a
n
k
i
j
l
o
g
Q
(
z
n
|
I
n
)
(9) and the pruning signal is therefore
∆
k
=
1
2
N
∑
n
g
n
k
2
, since
m
k
2
=
1
before pruning”; Theis, Page 18, Paragraph 3, and Equations 27-32).
Spasov in view of Theis also teaches training the policy model by using, as teacher data, a policy … which can be determined by comparing a threshold with the information matrix (Theis, Page 5, Section 2.3, Paragraph 2, Lines 1-3, and Equation 14, “For a given β, a feature should be pruned if Equation 13 is negative, that is, when doing so reduces the overall cost because it decreases the computational cost more than it increases the cross-entropy: ∆Li + β · ∆Ci ≤ 0 (14)” “0” is considered to be the “threshold”), using the training data and the Information matrix (Spasov, Page 3, Section 2.1, Paragraph 2, Lines 1-3, “Algorithm 1 can be loosely divided in two stages: firstly, an initialization round of exploring the saliencies of the channels in the network, and a second stage where we start exploiting and refining the initial saliency estimates to guide the channel selection procedure” Spasov, Page 3, Section 2.2, Lines 1-2, “Our dynamic channel selection framework requires the estimation of channel saliency, or the contribution of each active channel to the overall network performance”; Theis, Page 4, Paragraph 1, Lines 1-4, “For convolutional architectures, it makes sense to try to prune entire feature maps instead of individual parameters, since typical implementations of convolutions may not be able to exploit sparse kernels for speedups. Let
a
n
k
i
j
be the activation of the kth feature map at spatial location i,j for the nth datapoint”; Theis, Page 4, Paragraph 1, Lines 7-9, “The gradient of the loss for the nth datapoint with respect to mk is
g
n
k
=
-
∑
i
j
a
n
k
i
j
∂
∂
a
n
k
i
j
l
o
g
Q
(
z
n
|
I
n
)
(9) and the pruning signal is therefore
∆
k
=
1
2
N
∑
n
g
n
k
2
, since
m
k
2
=
1
before pruning”; Theis, Page 18, Paragraph 3, and Equations 27-32; Spasov uses the “saliency estimates” to train the “channel selection framework” and Theis teaches using the Fisher Information matrix to estimate a “pruning signal” which is considered to be equivalent to the “saliency estimates”, therefore, using the Fisher Information Matrix method of Theis to estimate the “saliency”, the policy model of Spasov is trained using the information matrix)
It would have been obvious, to a person of ordinary skill in the art, before the effective filing date of the invention to have modified the information processing method of Spasov to use the Fisher information matrix method of kernel ranking across layers as taught by Theis. The motivation for doing so would have been that Spasov presented the method taught by Theis as an alternative choice to the kernel ranking method taught by Spasov (Spasov, Page 3, Section 2.2, Lines 2-6, “We need a saliency metric which enables us to rank the filters of the entire network globally, that is across layers. Molchanov et al. [3] propose a pruning framework which leverages a first-order Taylor approximation for global channel ranking, whereas Theis et al. [25] use Fisher information to achieve kernel ranking across layers [26]. Our channel ranking approach is based on Molchanov et al. [3] although both methods would be applicable”), further, Theis also notes that the method is similar to the one used by Molchanov, but provides a more principled motivation (Theis, Page 4, Paragraph 2, Lines 1-4, “We note that this pruning signal is very similar to the one used by Molchanov et al. [8] – which uses absolute gradients instead of squared gradients and a certain normalization of the pruning signal – but our derivation provides a more principled motivation”).
Theis does not explicitly teach the policy being a vector nor wherein values included in the policy vector cause at least one layer of the Policy model to skip processing during an inference phase.
Luo teaches the policy being a vector and wherein values included in the policy vector cause at least one layer of the Policy model to skip processing during an inference phase (Luo, Page 3, Section 3.1, Paragraph 2, Lines 1-4, “After training, the binary index code is used for filter pruning. All the filters in previous layer and all the channels in the filters of the next layer will be removed if their corresponding index value is 0”; see also Luo, Page 3, Figure 2, “
X
∈
B
C
” is considered to be the “policy vector” which is used in “element-wise multiplication” to result in an output where “some layers are pruned”).
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the invention, to have modified the information processing method of the proposed combination to include using a policy vector and skipping layers during inference as taught by Luo. The motivation for doing so would have been that the binarization of the vector allows for pruning and fine-tuning to be integrated together (Luo, Page 4, Section 3.1.3, Paragraph 4).
Regarding claim 10, the rejection of claim 9 is incorporated, and further, the proposed combination teaches training the ANN model incrementally from the input ANN model with a new training data including pairs of input and output for training and validation in an incremental training phase (Spasov, Page 6, Section 3.3, Line 1, “All network models are trained using SGD with Nesterov momentum of 0.9 without dampening”; Spasov, Page 2, Section 2, Lines 1-6, “We consider a supervised learning problem with a set of training examples D = {X = {x1, x2, . . . , xN}, Y = {y1, y2, . . . , yN}}, where x and y represent an input and a label, respectively. Given a CNN model with L convolutional layers, let each layer l ∈ 1 . . . L comprise Kl channels, C k l , where k ∈ 1 . . . Kl is the channel index. In each training step t, we 1) sample a batch of B data samples (x 1:B, y 1:B); 2) select and activate a subset S of convolutional channels (see Figure 2); 3) run a forward and a backward pass on the “thin” network, that is only on the active channels; and 4) observe the revealed saliency estimates (SALk l,t) of the activated channels” The training is done is “step[s] t” which is considered to be “train[ing] the ANN model incrementally”)
computing the information matrix of each sample in the new training data using the training data (Spasov, Page 3, Section 2.2, Lines 2-5, “We need a saliency metric which enables us to rank the filters of the entire network globally, that is across layers. … Theis et al. [25] use Fisher information to achieve kernel ranking across layers [26]”; Theis, Page 3, Section 2.1, Lines 1-2, “Our goal is to remove feature maps or parameters which contribute little to the overall performance of the model”; Theis, Page 4, Paragraph 1, Lines 1-4, “For convolutional architectures, it makes sense to try to prune entire feature maps instead of individual parameters, since typical implementations of convolutions may not be able to exploit sparse kernels for speedups. Let
a
n
k
i
j
be the activation of the kth feature map at spatial location i,j for the nth datapoint”; Theis, Page 4, Paragraph 1, Lines 7-9, “The gradient of the loss for the nth datapoint with respect to mk is
g
n
k
=
-
∑
i
j
a
n
k
i
j
∂
∂
a
n
k
i
j
l
o
g
Q
(
z
n
|
I
n
)
(9) and the pruning signal is therefore
∆
k
=
1
2
N
∑
n
g
n
k
2
, since
m
k
2
=
1
before pruning”; Theis, Page 18, Paragraph 3, and Equations 27-32);
training the policy model incrementally from the input policy model by using, as teacher data, a policy … which can be determined by comparing a threshold with the information matrix (Theis, Page 5, Section 2.3, Paragraph 2, Lines 1-3, and Equation 14, “For a given β, a feature should be pruned if Equation 13 is negative, that is, when doing so reduces the overall cost because it decreases the computational cost more than it increases the cross-entropy: ∆Li + β · ∆Ci ≤ 0 (14)” “0” is considered to be the “threshold”), using the new training data (Spasov, Page 2, Section 2, Lines 4-5, “2) select and activate a subset S of convolutional channels (see Figure 2)”; Spasov, Page 4, Algorithm 1, Steps 1-2, 4-11; Spasov, Page 3, Section 2.1, Paragraph 2, Lines 1-3, “Algorithm 1 can be loosely divided in two stages: firstly, an initialization round of exploring the saliencies of the channels in the network, and a second stage where we start exploiting and refining the initial saliency estimates to guide the channel selection procedure” Spasov, Page 3, Section 2.2, Lines 1-2, “Our dynamic channel selection framework requires the estimation of channel saliency, or the contribution of each active channel to the overall network performance” The “channel selection framework” shown in Algorithm 1 is considered to be the “Policy model”; because the “initial saliency estimates” are exploited and refined, the model is considered to be trained using at least part of Algorithm 1; Algorithm 1 shows that the policy model is trained iteratively via the “for” loops) and the information matrix (Spasov, Page 3, Section 2.1, Paragraph 2, Lines 1-3, “Algorithm 1 can be loosely divided in two stages: firstly, an initialization round of exploring the saliencies of the channels in the network, and a second stage where we start exploiting and refining the initial saliency estimates to guide the channel selection procedure” Spasov, Page 3, Section 2.2, Lines 1-2, “Our dynamic channel selection framework requires the estimation of channel saliency, or the contribution of each active channel to the overall network performance”; Theis, Page 4, Paragraph 1, Lines 1-4, “For convolutional architectures, it makes sense to try to prune entire feature maps instead of individual parameters, since typical implementations of convolutions may not be able to exploit sparse kernels for speedups. Let
a
n
k
i
j
be the activation of the kth feature map at spatial location i,j for the nth datapoint”; Theis, Page 4, Paragraph 1, Lines 7-9, “The gradient of the loss for the nth datapoint with respect to mk is
g
n
k
=
-
∑
i
j
a
n
k
i
j
∂
∂
a
n
k
i
j
l
o
g
Q
(
z
n
|
I
n
)
(9) and the pruning signal is therefore
∆
k
=
1
2
N
∑
n
g
n
k
2
, since
m
k
2
=
1
before pruning”; Theis, Page 18, Paragraph 3, and Equations 27-32; Spasov uses the “saliency estimates” to train the “channel selection framework” and Theis teaches using the Fisher Information matrix to estimate a “pruning signal” which is considered to be equivalent to the “saliency estimates”, therefore, using the Fisher Information Matrix method of Theis to estimate the “saliency”, the policy model of Spasov is trained using the information matrix)
Luo teaches the policy being a vector (Luo, Page 3, Section 3.1, Paragraph 2, Lines 1-4, “After training, the binary index code is used for filter pruning. All the filters in previous layer and all the channels in the filters of the next layer will be removed if their corresponding index value is 0”; see also Luo, Page 3, Figure 2, “
X
∈
B
C
” is considered to be the “policy vector”).
Regarding claim 12, the rejection of claim 9 is incorporated, and further, Spasov teaches wherein the Policy model is a light-weight Policy model based on a traditional machine learning model with a supervised learning (Spasov, Page 3, Lines 1-2, “We consider a supervised learning problem with a set of training examples D = {X = {x1, x2, . . . , xN}, Y = {y1, y2, . . . , yN}}, where x and y represent an input and a label, respectively” The training algorithm takes training samples and labels as input, and thus is considered a “supervised learning”; Spasov, Page 3, Section 2.2, Lines 1-2, “Our dynamic channel selection framework requires the estimation of channel saliency, or the contribution of each active channel to the overall network performance”; Spasov, Page 2, Paragraph 1, Lines 10-13, “we formulate the channel selection problem as a combinatorial multi-armed bandit problem, and propose to use the combinatorial upper confidence bound algorithm (CUCB) [5] to solve it. An advantage of this approach is that we do not require additional model complexity to implement the channel selection procedure”, See also Figure 2, “Compute channel saliencies” is shown as occurring during “Backward pass”; Theis, Page 4, Paragraph 1, Lines 9-11, “The gradient with respect to the activations is available during the backward pass of computing the network’s gradient and the pruning signal can therefore be computed at little extra computational cost” Because the policy model adds little to no additional model complexity, it is considered to be “light-weight”).
Response to Arguments
Applicant’s amendments to the specification with respect to objections to the drawings have been fully considered, and overcome the objections set forth in the nonfinal office action dated 06/25/2025
Applicant’s amendments to claims 6 and 9-11 with respect to objections to the claims have been fully considered, and overcome the objections set forth in the nonfinal office action dated 06/25/2025. Consequently, the objections to the claims have been withdrawn.
Applicant’s amendments to claims 1-2, 5-6, and 9-10 with respect to the 35 U.S.C. 112(b) indefiniteness rejections have been fully considered, and some of the rejections set forth in the nonfinal office action dated 06/25/2025 have been overcome. Consequently, some of the previous grounds of rejections have been withdrawn. However, there were several 35 U.S.C. 112(b) indefiniteness rejections not addressed with amendments and those have been maintained. Please see the 35 U.S.C. 112(b) indefiniteness rejections above.
Applicant’s arguments regarding the 35 U.S.C. 101 rejections of the claims have been fully considered but are unpersuasive.
Applicant argues, on page 12, paragraphs 2-4 of the response, that independent claims 1, 5, and 9 integrate the judicial exception into a practical application of “an improved machine learning system which trains more computationally efficient policy models than alternative systems”, and that embodiments of the applications “enable an increase in the efficiency of a trained machine learning policy model” and thus “represent an improved method of training a machine learning model”. However, this argument is unpersuasive. Claiming improved speed or efficiency inherent with applying the abstract idea on a computer does not integrate the judicial exception into a practical application or provide an inventive concept, see MPEP 2106.05(f).
Applicant's arguments regarding the remainder of the claims rely upon the arguments asserted with respect to the independent claims, and are thus unpersuasive.
Applicant’s arguments regarding the 35 U.S.C. 103 rejections of the claims have been fully considered but are unpersuasive.
Although a new grounds of rejection has been applied, the examiner has determined a response necessary for the portion of the remarks [Remarks Page 14, Paragraph 2] wherein the references applied in the prior rejection of record are still being relied upon in the new grounds of rejection to teach or suggest the subject matter being challenged in applicant’s argument.
The remaining remarks, while having been considered, are moot because the new grounds of rejection does not rely on the references applied in the prior rejection of record for the subject matter being challenged in applicant’s argument.
Applicant argues none of the cited prior art teaches or suggests “compute information matrix of each sample in the training data using training information extracted by the ANN model trainer”. Examiner respectfully disagrees. Spasov in view of Theis teaches this limitation (Theis, Page 4, Paragraph 1, Lines 1-4; Theis, Page 18, Paragraph 3, and Equations 27-32). Applicant has made a mere allegation of patentability without disputing the previous rejection including how Applicant’s claims are different from the cited art. Please see the updated 35 U.S.C. 103 rejection above for further details.
Applicant's arguments regarding the remainder of the claims rely upon the arguments asserted with respect to the independent claims, and are thus unpersuasive.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MOLLY CLARKE SIPPEL whose telephone number is (571)272-3270. The examiner can normally be reached Monday - Friday, 7:30 a.m. - 4:30 p.m. ET..
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached at (571)272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/M.C.S./Examiner, Art Unit 2122
/KAKALI CHAKI/Supervisory Patent Examiner, Art Unit 2122