DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 01/10/2023 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.
Claim Objections
Claim 16 is objected to because of the following informalities:
In claim 16, line 2, “the program” should read “the computer program” to properly reference ”a computer program” in line 1.
Appropriate correction is required.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 5 and 10-13 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
Claim 5 recites the limitation “the episodic buffer” in lines 3-4. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the episodic buffer” has been interpreted as “the episodic memory” in reference to “an episodic memory” in lines 1-2.
Claim 10 recites the limitation “the network” in line 2. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the network” has been interpreted as “the artificial neural network” in reference to “an artificial neural network” in line 1 of claim 1.
Claim 11 recites the limitation “the activation count” in line 3. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the activation count” has been interpreted as “an activation count”.
Claim 12 recites the limitation “the lowest class-wise activity count” in lines 2-3. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the lowest class-wise activity count” has been interpreted as “a lowest class-wise activity count”.
Claim 12 recites the limitation “the global activity count” in line 4. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the global activity count” has been interpreted as “a global activity count”.
Claim 13 recites the limitation “the class-wise activity count” in line 3. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the class-wise activity count” has been interpreted as “a class-wise activity count”.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claim 15 is rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter. The claim does not fall within at least one of the four categories of patent eligible subject matter because the claim could be considered signal per se.
Independent claim 15 recites “computer-readable medium.” The broadest reasonable interpretation of a claim that recites "computer-readable medium," in view of the present specification, covers forms of non-transitory tangible media and transitory propagating signals per se in view of the ordinary and customary meaning of computer usable medium, particularly when the specification is silent. See MPEP 2111.01. When the broadest reasonable interpretation of a claim covers a signal per se, the claim must be rejected under 35 U.S.C. § 101 as covering non-statutory subject matter. See In re Nuijten, 500 F.3d 1346, 1356-57 (Fed. Cir. 2007) (transitory embodiments are not directed to statutory subject matter) and Interim Examination Instructions for Evaluating Subject Matter Eligibility Under 35 U.S.C. § 101, Aug. 24, 2009; p. 2. 1351 Off. Gaz. Pat. Off. 212 (2010). Under broadest reasonable interpretation, "computer-readable medium" recited in claim 15 encompasses a transitory, propagating signal, which is not a process, machine, manufacture, or composition of matter. Nuijten, 500 F.3d at 1357. The claim "covers material not found in any of the four statutory categories [and thus] falls outside the plainly expressed scope of § 101." Id. at 1354. A recommended amendment is to recite “non-transitory computer-readable medium” (emphasis added).
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1-2, 8-9, and 11-16 are rejected under 35 U.S.C. 103 as being unpatentable over Shaker et al. (US 2021/0064989 A1) in view of Piot et al. ("Dual-Memory Model for Incremental Learning: The Handwriting Recognition Use Case") and further in view of Abbasi et al. ("Sparsity and Heterogeneous Dropout for Continual Learning in the Null Space of Neural Activations").
Regarding claim 1,
Shaker et al. teaches a computer-implemented method for continual learning in an artificial neural network ([0002]: "The present invention relates to a method and system for continual learning of artificial intelligent systems" teaches a method for continual learning in artificial intelligent systems (e.g. neural network)) comprising the steps of:
maintaining an instance-based episodic memory ([0106]: "Given the current task's dataset Dt={
D
t
t
r
,
D
t
v
a
l
}, embodiments can keep two datasets M={Mtr, Mval}, the episodic memory, where the embodiments can store previous tasks' samples" teaches maintaining an episodic memory).
Shaker et al. does not appear to explicitly teach training a working memory by using a continuous data stream containing a sequence of tasks; maintaining a long-term memory by progressively aggregating synaptic weights of the working memory as tasks are sequentially learned by the working memory; and enforcing activation sparsity along with a complementary dropout mechanism by activating similar neurons for semantically similar tasks while reducing overlap among neural activations for samples belonging to different tasks.
However, Piot et al. teaches training a working memory by using a continuous data stream containing a sequence of tasks (Section III, B, first paragraph: "Incremental Learning. This process takes place in the working memory (WM). The concept of working memory is sometimes used as a synonym for short-term memory (STM), but it refers to a system that not only temporarily holds information (about 7 for the human brain), as short-term memory does, but also processes and manipulates it. In particular, WM is responsible for consolidation of new knowledge before it is incorporated into long-term memory" teaches a working memory (WM) for performing incremental (continual) learning. Section III, B, 3): "The entire process is summarized in Algorithm 1. Let C be the total number of consolidations to be performed, K the number of enhancements between each consolidation and N the number of incremental data presented during each enhancement. RF is the random forest trained on the first task (no SBS applied) and D is the distribution of the training data (c.f. III-A2). Let slot be the stream of incremental data added at each enhancement. This stream, referred as IncrementalStream, transmits MNIST training data in packets (e.g. slots) of size N" teaches that the working memory a continuous stream of data for a sequence of tasks. Algorithm 1; teaches that the working memory is trained using data stream for the sequence of tasks); and
maintaining a long-term memory by progressively aggregating synaptic weights of the working memory as tasks are sequentially learned by the working memory (Section II, A, first paragraph: "The second, called working memory (WM), combines/integrates new experiences with those already encountered and stored in the long-term memory (LTM)" teaches a long-term memory (LTM) that aggregates new experiences of the working memory. Algorithm 1; teaches that the LTM aggregates updated weights from the WM as the WM learns tasks from the data stream).
Shaker et al. and Piot et al. are analogous to the claimed invention because they are directed towards continual learning in neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate training a working memory by using a continuous data stream containing a sequence of tasks; maintaining a long-term memory by progressively aggregating synaptic weights of the working memory as tasks are sequentially learned by the working memory as taught by Piot et al. to the disclosed invention of Shaker et al.
One of ordinary skill in the art would have been motivated to make this modification because the dual-memory model "has the advantage of not requiring the storage of all the training data and is less computationally expensive" (Piot et al. Section V, first paragraph).
Shaker et al. in view of Piot et al. does not appear to explicitly teach enforcing activation sparsity along with a complementary dropout mechanism by activating similar neurons for semantically similar tasks while reducing overlap among neural activations for samples belonging to different tasks.
However, Abbasi et al. teaches enforcing activation sparsity along with a complementary dropout mechanism by activating similar neurons for semantically similar tasks while reducing overlap among neural activations for samples belonging to different tasks (Section 2, third-fourth paragraphs: "In this work, we leverage k-winner activations to induce sparsity in our network and show the benefit of this induced sparsity in the GPM framework … In our work, we introduce a task conditional Dropout that encourages non-overlapping sparse activations between tasks. We show that the addition of task-conditional Dropout to our sparse neural activation framework provides an additional boost in the performance of GPM" teaches inducing sparsity on neural activations along with a task conditional dropout (complementary dropout mechanism) that promotes non-overlapping neural activations between tasks. Section 3.2, fourth paragraph: "To address this newly emerged practical issue in continual learning, and motivated by the approaches using non-overlapping neural representation for continual learning, we propose a conditional dropout between tasks, that encourages diverse neural activations between different tasks" teaches that similar tasks activate similar neurons whereas different tasks have diverse neural activation (e.g. reduced overlap) thanks to the non-overlapping neural activations promoted by the dropout mechanism).
Shaker et al., Piot et al., and Abbasi et al. are analogous to the claimed invention because they are directed towards continual learning in neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate enforcing activation sparsity along with a complementary dropout mechanism by activating similar neurons for semantically similar tasks while reducing overlap among neural activations for samples belonging to different tasks as taught by Abbasi et al. to the disclosed invention of Shaker et al. in view of Piot et al.
One of ordinary skill in the art would have been motivated to make this modification to "significantly increase a continual learner’s performance over a long sequence of tasks" (Abbasi et al. Abstract).
Regarding claim 2,
Shaker et al. in view of Piot et al. and further in view of Abbasi et al. teaches the computer-implemented method of claim 1.
In addition, Piot et al. further teaches further comprising the step of initializing the long-term memory by using weights and sparsity constraints of the working memory (Algorithm 1; teaches that the LTM is initialized based on weights from the WM and a SBS algorithm to remove some features (acts as a sparsity constraint)).
Shaker et al., Piot et al., and Abbasi et al. are analogous to the claimed invention because they are directed towards continual learning in neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate further comprising the step of initializing the long-term memory by using weights and sparsity constraints of the working memory as taught by Piot et al. to the disclosed invention of Shaker et al. in view of Abbasi et al.
One of ordinary skill in the art would have been motivated to make this modification because the dual-memory model "has the advantage of not requiring the storage of all the training data and is less computationally expensive" (Piot et al. Section V, first paragraph).
Regarding claim 8,
Shaker et al. in view of Piot et al. and further in view of Abbasi et al. teaches the computer-implemented method of claim 1.
In addition, Piot et al. further teaches wherein the step of training the working memory is followed by stochastically updating the long-term memory (Algorithm 1; teaches that the LTM is updated with updated weights from the WM learning process (training the WM) as the WM learns tasks from the data stream).
Shaker et al., Piot et al., and Abbasi et al. are analogous to the claimed invention because they are directed towards continual learning in neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the step of training the working memory is followed by stochastically updating the long-term memory as taught by Piot et al. to the disclosed invention of Shaker et al. in view of Abbasi et al.
One of ordinary skill in the art would have been motivated to make this modification because the dual-memory model "has the advantage of not requiring the storage of all the training data and is less computationally expensive" (Piot et al. Section V, first paragraph).
Regarding claim 9,
Shaker et al. in view of Piot et al. and further in view of Abbasi et al. teaches the computer-implemented method of claim 1.
In addition, Abbasi et al. further teaches further comprising the step of employing a k-winner-take-all activation function, wherein an activation score is assigned to each filter in a current layer by calculating an absolute sum of an activation map of the current layer, and by propagating the activation map of top-k filters to a next layer while setting the activation map of non-propagated filters to zero (Fig. 1; Section 3.2, first-second paragraphs: "We leverage sparsity in neural activations with the target of: 1) reducing power consumption, and 2) reducing the rank of the subspaces
S
t
l
for ∀t ∈ T and l ∈ {2, ..., L}. Of course, sparsity of
x
t
.
i
l
does not guarantee a low rank
S
t
l
, e.g., even one-sparse activations could lead to neural activation subspaces that are full rank. However, we show numerically that neural networks trained with sparse activations often form low-dimensional activation subspace,
S
t
l
. … Sparse Activations: Following the recent work of Ahmad & Scheinkman (2019) we leverage k-winner activations to induce sparse neural representations. The framework is similar to the work of Majani et al. (1988), Makhzani & Frey (2013), and ?4Srivastavacompete2compute. In short, each layer of our network follows,
x
t
,
i
l
+
1
= f(Wl
x
t
,
i
l
) where f(·) is an adaptive threshold corresponding to the k’th largest activation. Hence, only the top-k activations in each layer are allowed to propagate to the next layer leading to ||
x
t
,
i
l
+
1
|
|
0 ≤ k. One advantage of the k-winner framework is that we have control over the sparsity of neural activations through parameter k" teaches employing k-winner activation sparsity (k-winner-take-all activation function) with an activation score x assigned for each activation (filter) in a layer, wherein only the top-k activations (filters) are propagated to the next layer, with the others being set to zero due to sparsity).
Shaker et al., Piot et al., and Abbasi et al. are analogous to the claimed invention because they are directed towards continual learning in neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate further comprising the step of employing a k-winner-take-all activation function, wherein an activation score is assigned to each filter in a current layer by calculating an absolute sum of an activation map of the current layer, and by propagating the activation map of top-k filters to a next layer while setting the activation map of non-propagated filters to zero as taught by Abbasi et al. to the disclosed invention of Shaker et al. in view of Piot et al.
One of ordinary skill in the art would have been motivated to make this modification to "significantly increase a continual learner’s performance over a long sequence of tasks" (Abbasi et al. Abstract).
Regarding claim 11,
Shaker et al. in view of Piot et al. and further in view of Abbasi et al. teaches the computer-implemented method of claim 1.
In addition, Abbasi et al. further teaches further comprising the step of utilizing two sets of activation trackers: a global activity count for tracking the activation count of each neuron throughout the training (Section 3.3, first paragraph: "While training a network on a task, we keep track of the frequency of the neural activations. In short, we assign an activation counter per neuron, which increments when a neuron’s activation is in the top-k activations in its layer (i.e., the neuron is activated). Let
[
b
t
l
]
j denote the activation counter for the j’th neuron in the l’th layer of the network, after learning task t. Note that
[
b
t
l
]
j represents the number of times the j’th neuron in layer l was in the top k activations over all previously seen tasks τ ∈ [1, ..., t]. Then, while learning task t + 1, we would like to encourage the network to utilize the less activated neurons" teaches an activation counter per neuron during training (global activity count) for tracking the activation count of each neuron throughout training); and
a class-wise activity count for tracking the activation count of each neuron processing samples belonging to a particular class (Section 4.2, first paragraph: "Here we numerically confirm that our heterogeneous dropout leads to fewer overlaps between neural representations of different tasks. For these experiments, we use the GPM algorithm on a model with k-winner activations and learn two tasks from Permuted-MNIST sequentially, where the first task is MNIST and the second task is a permuted version. After training on Task 1, we calculate the number of times each neuron is activated for all samples in the validation set of Task 1. Then, we learn Task 2 using gradient-projection and afterwards calculate the number of times each neuron is activated for all samples in the validation set of Task 2. For task t and for the j’th neuron in layer l, we denote the neural activations on the validation set as
[
v
t
l
]
j Note that
v
t
l
is different from
b
t
l
introduced in Subsection 3.3, as it is calculated on the validation set (as opposed to the training set), and it is calculated per task, while
b
t
l
is the accumulation of activations over all tasks" teaches an activation counter per neuron for each task (class-wise activity count) for tracking the activation count of each neuron for each particular task (e.g. activating for processing samples of a particular task/class)).
Shaker et al., Piot et al., and Abbasi et al. are analogous to the claimed invention because they are directed towards continual learning in neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate further comprising the step of utilizing two sets of activation trackers: a global activity count for tracking the activation count of each neuron throughout the training; and a class-wise activity count for tracking the activation count of each neuron processing samples belonging to a particular class as taught by Abbasi et al. to the disclosed invention of Shaker et al. in view of Piot et al.
One of ordinary skill in the art would have been motivated to make this modification to "significantly increase a continual learner’s performance over a long sequence of tasks" (Abbasi et al. Abstract).
Regarding claim 12,
Shaker et al. in view of Piot et al. and further in view of Abbasi et al. teaches the computer-implemented method of claim 1.
In addition, Abbasi et al. further teaches further comprising the step of employing heterogeneous dropout for each task wherein new classes are learned by using neurons having the lowest class-wise activity count for previously seen classes and by setting a probability of a neuron being dropped to be inversely proportional to the global activity count of the neuron (Section 3.3, first-second paragraphs: "While training a network on a task, we keep track of the frequency of the neural activations. In short, we assign an activation counter per neuron, which increments when a neuron’s activation is in the top-k activations in its layer (i.e., the neuron is activated). Let
[
b
t
l
]
j denote the activation counter for the j’th neuron in the l’th layer of the network, after learning task t. Note that
[
b
t
l
]
j represents the number of times the j’th neuron in layer l was in the top k activations over all previously seen tasks τ ∈ [1, ..., t]. Then, while learning task t + 1, we would like to encourage the network to utilize the less activated neurons. To that end, we propose a dropout (Srivastava et al., 2014) mechanism that favors to retain neurons that are less activated in previous tasks. We define a binary Bernoulli random variable, [
δ
t
+
1
l
]j, for the j’th neuron in layer l during training on task t + 1 that indicates whether the neuron is disabled by the dropout or not. In particular, we set P([
δ
t
+
1
l
]j = 1) = [
p
t
+
1
l
]j for:
PNG
media_image1.png
48
190
media_image1.png
Greyscale
where α > 0 is a hyper-parameter of our proposed dropout mechanism. Larger P corresponds to less dropout and larger values of α lead to a more stringent enforcement of non-overlapping representations … We call the proposed dropout a heterogeneous dropout as the probability of dropout is different for various neurons in the network. Importantly, the probability of dropout is directly correlated with the frequency of activations of a neuron for previous tasks. Hence, heterogeneous dropout will encourage the network to use non-overlapping neural activations for different tasks" teaches heterogeneous dropout for each task, wherein a new task (new classes) is learned by using neurons that are less activated in previous tasks (e.g. lower class-wise activity count), with the probability of a neuron being dropped out being inversely proportional to the global activity count b of the neuron).
Shaker et al., Piot et al., and Abbasi et al. are analogous to the claimed invention because they are directed towards continual learning in neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate further comprising the step of employing heterogeneous dropout for each task wherein new classes are learned by using neurons having the lowest class-wise activity count for previously seen classes and by setting a probability of a neuron being dropped to be inversely proportional to the global activity count of the neuron as taught by Abbasi et al. to the disclosed invention of Shaker et al. in view of Piot et al.
One of ordinary skill in the art would have been motivated to make this modification to "significantly increase a continual learner’s performance over a long sequence of tasks" (Abbasi et al. Abstract).
Regarding claim 13,
Shaker et al. in view of Piot et al. and further in view of Abbasi et al. teaches the computer-implemented method of claim 1.
In addition, Abbasi et al. further teaches further comprising the step of emerging class-wise activations followed by the step of employing semantic dropout wherein a probability of retention of a neuron for a class is set to be proportional to the class-wise activity count of the neuron for the class (Section 4.2, first-second paragraphs: "Here we numerically confirm that our heterogeneous dropout leads to fewer overlaps between neural representations of different tasks. For these experiments, we use the GPM algorithm on a model with k-winner activations and learn two tasks from Permuted-MNIST sequentially, where the first task is MNIST and the second task is a permuted version. After training on Task 1, we calculate the number of times each neuron is activated for all samples in the validation set of Task 1. Then, we learn Task 2 using gradient-projection and afterwards calculate the number of times each neuron is activated for all samples in the validation set of Task 2. For task t and for the j’th neuron in layer l, we denote the neural activations on the validation set as
[
v
t
l
]
j Note that
v
t
l
is different from
b
t
l
introduced in Subsection 3.3, as it is calculated on the validation set (as opposed to the training set), and it is calculated per task, while
b
t
l
is the accumulation of activations over all tasks. Let
v
t
l
=
∑
j
=
1
d
l
[
v
t
l
]
j where dl is the number of neurons in the l’th layer, then we can define a probability mass function of activations for each layer as
[
q
t
l
]
j =
[
v
t
l
]
j/
v
t
l
. Finally, we measure the neural activation overlap between tasks t1 and t2, via the Jensen-Shannon divergence (i.e., the symmetric KL-divergence) between their neural activation probability mass functions, i.e.,:
PNG
media_image2.png
54
430
media_image2.png
Greyscale
Figure 3 measures the overlap between neural representations (between Task 1, MNIST, and Task 2, a Permuted MNIST) when the networks are trained with and without our heterogeneous dropout and or
different values of α. Higher Jensen-Shannon divergence means less overlap. In short, α = 0 means no dropout, and we confirm that higher α translates to less overlap (higher JS-divergence) between the neural representations" teaches an activation counter per neuron for each task (class-wise activity count) for tracking the activation count of each neuron for each particular task (e.g. activating for processing samples of a particular task/class) followed by heterogeneous dropout (semantic dropout) to have fewer overlaps between neural representations of different tasks, with the probability of retention for a neuron activation of the task (class) is proportional to the class-wise activity count of the neuron).
Shaker et al., Piot et al., and Abbasi et al. are analogous to the claimed invention because they are directed towards continual learning in neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate further comprising the step of emerging class-wise activations followed by the step of employing semantic dropout wherein a probability of retention of a neuron for a class is set to be proportional to the class-wise activity count of the neuron for the class as taught by Abbasi et al. to the disclosed invention of Shaker et al. in view of Piot et al.
One of ordinary skill in the art would have been motivated to make this modification to "significantly increase a continual learner’s performance over a long sequence of tasks" (Abbasi et al. Abstract).
Regarding claim 14,
Shaker et al. in view of Piot et al. and further in view of Abbasi et al. teaches the computer-implemented method of claim 1.
In addition, Abbasi et al. further teaches further comprising the step of updating probabilities for semantic dropout and heterogeneous dropout at an end of each epoch and each task respectively for enforcing an emerged pattern (Section 3.3, first-second paragraphs: "While training a network on a task, we keep track of the frequency of the neural activations. In short, we assign an activation counter per neuron, which increments when a neuron’s activation is in the top-k activations in its layer (i.e., the neuron is activated). Let
[
b
t
l
]
j denote the activation counter for the j’th neuron in the l’th layer of the network, after learning task t. Note that
[
b
t
l
]
j represents the number of times the j’th neuron in layer l was in the top k activations over all previously seen tasks τ ∈ [1, ..., t]. Then, while learning task t + 1, we would like to encourage the network to utilize the less activated neurons. To that end, we propose a dropout (Srivastava et al., 2014) mechanism that favors to retain neurons that are less activated in previous tasks. We define a binary Bernoulli random variable, [
δ
t
+
1
l
]j, for the j’th neuron in layer l during training on task t + 1 that indicates whether the neuron is disabled by the dropout or not. In particular, we set P([
δ
t
+
1
l
]j = 1) = [
p
t
+
1
l
]j for:
PNG
media_image1.png
48
190
media_image1.png
Greyscale
where α > 0 is a hyper-parameter of our proposed dropout mechanism. Larger P corresponds to less dropout and larger values of α lead to a more stringent enforcement of non-overlapping representations … We call the proposed dropout a heterogeneous dropout as the probability of dropout is different for various neurons in the network. Importantly, the probability of dropout is directly correlated with the frequency of activations of a neuron for previous tasks. Hence, heterogeneous dropout will encourage the network to use non-overlapping neural activations for different tasks" teaches calculating a probability of heterogeneous/semantic dropout for each task for each layer (i.e. at end o