DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 01/10/2023 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.
Claim Objections
Claim 16 is objected to because of the following informalities:
In claim 16, line 2, “the program” should read “the computer program” to properly reference ”a computer program” in line 1.
Appropriate correction is required.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 5 and 10-13 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
Claim 5 recites the limitation “the episodic buffer” in lines 3-4. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the episodic buffer” has been interpreted as “the episodic memory” in reference to “an episodic memory” in lines 1-2.
Claim 10 recites the limitation “the network” in line 2. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the network” has been interpreted as “the artificial neural network” in reference to “an artificial neural network” in line 1 of claim 1.
Claim 11 recites the limitation “the activation count” in line 3. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the activation count” has been interpreted as “an activation count”.
Claim 12 recites the limitation “the lowest class-wise activity count” in lines 2-3. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the lowest class-wise activity count” has been interpreted as “a lowest class-wise activity count”.
Claim 12 recites the limitation “the global activity count” in line 4. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the global activity count” has been interpreted as “a global activity count”.
Claim 13 recites the limitation “the class-wise activity count” in line 3. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the class-wise activity count” has been interpreted as “a class-wise activity count”.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claim 15 is rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter. The claim does not fall within at least one of the four categories of patent eligible subject matter because the claim could be considered signal per se.
Independent claim 15 recites “computer-readable medium.” The broadest reasonable interpretation of a claim that recites "computer-readable medium," in view of the present specification, covers forms of non-transitory tangible media and transitory propagating signals per se in view of the ordinary and customary meaning of computer usable medium, particularly when the specification is silent. See MPEP 2111.01. When the broadest reasonable interpretation of a claim covers a signal per se, the claim must be rejected under 35 U.S.C. § 101 as covering non-statutory subject matter. See In re Nuijten, 500 F.3d 1346, 1356-57 (Fed. Cir. 2007) (transitory embodiments are not directed to statutory subject matter) and Interim Examination Instructions for Evaluating Subject Matter Eligibility Under 35 U.S.C. § 101, Aug. 24, 2009; p. 2. 1351 Off. Gaz. Pat. Off. 212 (2010). Under broadest reasonable interpretation, "computer-readable medium" recited in claim 15 encompasses a transitory, propagating signal, which is not a process, machine, manufacture, or composition of matter. Nuijten, 500 F.3d at 1357. The claim "covers material not found in any of the four statutory categories [and thus] falls outside the plainly expressed scope of § 101." Id. at 1354. A recommended amendment is to recite “non-transitory computer-readable medium” (emphasis added).
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1-2, 8-9, and 11-16 are rejected under 35 U.S.C. 103 as being unpatentable over Shaker et al. (US 2021/0064989 A1) in view of Piot et al. ("Dual-Memory Model for Incremental Learning: The Handwriting Recognition Use Case") and further in view of Abbasi et al. ("Sparsity and Heterogeneous Dropout for Continual Learning in the Null Space of Neural Activations").
Regarding claim 1,
Shaker et al. teaches a computer-implemented method for continual learning in an artificial neural network ([0002]: "The present invention relates to a method and system for continual learning of artificial intelligent systems" teaches a method for continual learning in artificial intelligent systems (e.g. neural network)) comprising the steps of:
maintaining an instance-based episodic memory ([0106]: "Given the current task's dataset Dt={
D
t
t
r
,
D
t
v
a
l
}, embodiments can keep two datasets M={Mtr, Mval}, the episodic memory, where the embodiments can store previous tasks' samples" teaches maintaining an episodic memory).
Shaker et al. does not appear to explicitly teach training a working memory by using a continuous data stream containing a sequence of tasks; maintaining a long-term memory by progressively aggregating synaptic weights of the working memory as tasks are sequentially learned by the working memory; and enforcing activation sparsity along with a complementary dropout mechanism by activating similar neurons for semantically similar tasks while reducing overlap among neural activations for samples belonging to different tasks.
However, Piot et al. teaches training a working memory by using a continuous data stream containing a sequence of tasks (Section III, B, first paragraph: "Incremental Learning. This process takes place in the working memory (WM). The concept of working memory is sometimes used as a synonym for short-term memory (STM), but it refers to a system that not only temporarily holds information (about 7 for the human brain), as short-term memory does, but also processes and manipulates it. In particular, WM is responsible for consolidation of new knowledge before it is incorporated into long-term memory" teaches a working memory (WM) for performing incremental (continual) learning. Section III, B, 3): "The entire process is summarized in Algorithm 1. Let C be the total number of consolidations to be performed, K the number of enhancements between each consolidation and N the number of incremental data presented during each enhancement. RF is the random forest trained on the first task (no SBS applied) and D is the distribution of the training data (c.f. III-A2). Let slot be the stream of incremental data added at each enhancement. This stream, referred as IncrementalStream, transmits MNIST training data in packets (e.g. slots) of size N" teaches that the working memory a continuous stream of data for a sequence of tasks. Algorithm 1; teaches that the working memory is trained using data stream for the sequence of tasks); and
maintaining a long-term memory by progressively aggregating synaptic weights of the working memory as tasks are sequentially learned by the working memory (Section II, A, first paragraph: "The second, called working memory (WM), combines/integrates new experiences with those already encountered and stored in the long-term memory (LTM)" teaches a long-term memory (LTM) that aggregates new experiences of the working memory. Algorithm 1; teaches that the LTM aggregates updated weights from the WM as the WM learns tasks from the data stream).
Shaker et al. and Piot et al. are analogous to the claimed invention because they are directed towards continual learning in neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate training a working memory by using a continuous data stream containing a sequence of tasks; maintaining a long-term memory by progressively aggregating synaptic weights of the working memory as tasks are sequentially learned by the working memory as taught by Piot et al. to the disclosed invention of Shaker et al.
One of ordinary skill in the art would have been motivated to make this modification because the dual-memory model "has the advantage of not requiring the storage of all the training data and is less computationally expensive" (Piot et al. Section V, first paragraph).
Shaker et al. in view of Piot et al. does not appear to explicitly teach enforcing activation sparsity along with a complementary dropout mechanism by activating similar neurons for semantically similar tasks while reducing overlap among neural activations for samples belonging to different tasks.
However, Abbasi et al. teaches enforcing activation sparsity along with a complementary dropout mechanism by activating similar neurons for semantically similar tasks while reducing overlap among neural activations for samples belonging to different tasks (Section 2, third-fourth paragraphs: "In this work, we leverage k-winner activations to induce sparsity in our network and show the benefit of this induced sparsity in the GPM framework … In our work, we introduce a task conditional Dropout that encourages non-overlapping sparse activations between tasks. We show that the addition of task-conditional Dropout to our sparse neural activation framework provides an additional boost in the performance of GPM" teaches inducing sparsity on neural activations along with a task conditional dropout (complementary dropout mechanism) that promotes non-overlapping neural activations between tasks. Section 3.2, fourth paragraph: "To address this newly emerged practical issue in continual learning, and motivated by the approaches using non-overlapping neural representation for continual learning, we propose a conditional dropout between tasks, that encourages diverse neural activations between different tasks" teaches that similar tasks activate similar neurons whereas different tasks have diverse neural activation (e.g. reduced overlap) thanks to the non-overlapping neural activations promoted by the dropout mechanism).
Shaker et al., Piot et al., and Abbasi et al. are analogous to the claimed invention because they are directed towards continual learning in neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate enforcing activation sparsity along with a complementary dropout mechanism by activating similar neurons for semantically similar tasks while reducing overlap among neural activations for samples belonging to different tasks as taught by Abbasi et al. to the disclosed invention of Shaker et al. in view of Piot et al.
One of ordinary skill in the art would have been motivated to make this modification to "significantly increase a continual learner’s performance over a long sequence of tasks" (Abbasi et al. Abstract).
Regarding claim 2,
Shaker et al. in view of Piot et al. and further in view of Abbasi et al. teaches the computer-implemented method of claim 1.
In addition, Piot et al. further teaches further comprising the step of initializing the long-term memory by using weights and sparsity constraints of the working memory (Algorithm 1; teaches that the LTM is initialized based on weights from the WM and a SBS algorithm to remove some features (acts as a sparsity constraint)).
Shaker et al., Piot et al., and Abbasi et al. are analogous to the claimed invention because they are directed towards continual learning in neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate further comprising the step of initializing the long-term memory by using weights and sparsity constraints of the working memory as taught by Piot et al. to the disclosed invention of Shaker et al. in view of Abbasi et al.
One of ordinary skill in the art would have been motivated to make this modification because the dual-memory model "has the advantage of not requiring the storage of all the training data and is less computationally expensive" (Piot et al. Section V, first paragraph).
Regarding claim 8,
Shaker et al. in view of Piot et al. and further in view of Abbasi et al. teaches the computer-implemented method of claim 1.
In addition, Piot et al. further teaches wherein the step of training the working memory is followed by stochastically updating the long-term memory (Algorithm 1; teaches that the LTM is updated with updated weights from the WM learning process (training the WM) as the WM learns tasks from the data stream).
Shaker et al., Piot et al., and Abbasi et al. are analogous to the claimed invention because they are directed towards continual learning in neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the step of training the working memory is followed by stochastically updating the long-term memory as taught by Piot et al. to the disclosed invention of Shaker et al. in view of Abbasi et al.
One of ordinary skill in the art would have been motivated to make this modification because the dual-memory model "has the advantage of not requiring the storage of all the training data and is less computationally expensive" (Piot et al. Section V, first paragraph).
Regarding claim 9,
Shaker et al. in view of Piot et al. and further in view of Abbasi et al. teaches the computer-implemented method of claim 1.
In addition, Abbasi et al. further teaches further comprising the step of employing a k-winner-take-all activation function, wherein an activation score is assigned to each filter in a current layer by calculating an absolute sum of an activation map of the current layer, and by propagating the activation map of top-k filters to a next layer while setting the activation map of non-propagated filters to zero (Fig. 1; Section 3.2, first-second paragraphs: "We leverage sparsity in neural activations with the target of: 1) reducing power consumption, and 2) reducing the rank of the subspaces
S
t
l
for ∀t ∈ T and l ∈ {2, ..., L}. Of course, sparsity of
x
t
.
i
l
does not guarantee a low rank
S
t
l
, e.g., even one-sparse activations could lead to neural activation subspaces that are full rank. However, we show numerically that neural networks trained with sparse activations often form low-dimensional activation subspace,
S
t
l
. … Sparse Activations: Following the recent work of Ahmad & Scheinkman (2019) we leverage k-winner activations to induce sparse neural representations. The framework is similar to the work of Majani et al. (1988), Makhzani & Frey (2013), and ?4Srivastavacompete2compute. In short, each layer of our network follows,
x
t
,
i
l
+
1
= f(Wl
x
t
,
i
l
) where f(·) is an adaptive threshold corresponding to the k’th largest activation. Hence, only the top-k activations in each layer are allowed to propagate to the next layer leading to ||
x
t
,
i
l
+
1
|
|
0 ≤ k. One advantage of the k-winner framework is that we have control over the sparsity of neural activations through parameter k" teaches employing k-winner activation sparsity (k-winner-take-all activation function) with an activation score x assigned for each activation (filter) in a layer, wherein only the top-k activations (filters) are propagated to the next layer, with the others being set to zero due to sparsity).
Shaker et al., Piot et al., and Abbasi et al. are analogous to the claimed invention because they are directed towards continual learning in neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate further comprising the step of employing a k-winner-take-all activation function, wherein an activation score is assigned to each filter in a current layer by calculating an absolute sum of an activation map of the current layer, and by propagating the activation map of top-k filters to a next layer while setting the activation map of non-propagated filters to zero as taught by Abbasi et al. to the disclosed invention of Shaker et al. in view of Piot et al.
One of ordinary skill in the art would have been motivated to make this modification to "significantly increase a continual learner’s performance over a long sequence of tasks" (Abbasi et al. Abstract).
Regarding claim 11,
Shaker et al. in view of Piot et al. and further in view of Abbasi et al. teaches the computer-implemented method of claim 1.
In addition, Abbasi et al. further teaches further comprising the step of utilizing two sets of activation trackers: a global activity count for tracking the activation count of each neuron throughout the training (Section 3.3, first paragraph: "While training a network on a task, we keep track of the frequency of the neural activations. In short, we assign an activation counter per neuron, which increments when a neuron’s activation is in the top-k activations in its layer (i.e., the neuron is activated). Let
[
b
t
l
]
j denote the activation counter for the j’th neuron in the l’th layer of the network, after learning task t. Note that
[
b
t
l
]
j represents the number of times the j’th neuron in layer l was in the top k activations over all previously seen tasks τ ∈ [1, ..., t]. Then, while learning task t + 1, we would like to encourage the network to utilize the less activated neurons" teaches an activation counter per neuron during training (global activity count) for tracking the activation count of each neuron throughout training); and
a class-wise activity count for tracking the activation count of each neuron processing samples belonging to a particular class (Section 4.2, first paragraph: "Here we numerically confirm that our heterogeneous dropout leads to fewer overlaps between neural representations of different tasks. For these experiments, we use the GPM algorithm on a model with k-winner activations and learn two tasks from Permuted-MNIST sequentially, where the first task is MNIST and the second task is a permuted version. After training on Task 1, we calculate the number of times each neuron is activated for all samples in the validation set of Task 1. Then, we learn Task 2 using gradient-projection and afterwards calculate the number of times each neuron is activated for all samples in the validation set of Task 2. For task t and for the j’th neuron in layer l, we denote the neural activations on the validation set as
[
v
t
l
]
j Note that
v
t
l
is different from
b
t
l
introduced in Subsection 3.3, as it is calculated on the validation set (as opposed to the training set), and it is calculated per task, while
b
t
l
is the accumulation of activations over all tasks" teaches an activation counter per neuron for each task (class-wise activity count) for tracking the activation count of each neuron for each particular task (e.g. activating for processing samples of a particular task/class)).
Shaker et al., Piot et al., and Abbasi et al. are analogous to the claimed invention because they are directed towards continual learning in neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate further comprising the step of utilizing two sets of activation trackers: a global activity count for tracking the activation count of each neuron throughout the training; and a class-wise activity count for tracking the activation count of each neuron processing samples belonging to a particular class as taught by Abbasi et al. to the disclosed invention of Shaker et al. in view of Piot et al.
One of ordinary skill in the art would have been motivated to make this modification to "significantly increase a continual learner’s performance over a long sequence of tasks" (Abbasi et al. Abstract).
Regarding claim 12,
Shaker et al. in view of Piot et al. and further in view of Abbasi et al. teaches the computer-implemented method of claim 1.
In addition, Abbasi et al. further teaches further comprising the step of employing heterogeneous dropout for each task wherein new classes are learned by using neurons having the lowest class-wise activity count for previously seen classes and by setting a probability of a neuron being dropped to be inversely proportional to the global activity count of the neuron (Section 3.3, first-second paragraphs: "While training a network on a task, we keep track of the frequency of the neural activations. In short, we assign an activation counter per neuron, which increments when a neuron’s activation is in the top-k activations in its layer (i.e., the neuron is activated). Let
[
b
t
l
]
j denote the activation counter for the j’th neuron in the l’th layer of the network, after learning task t. Note that
[
b
t
l
]
j represents the number of times the j’th neuron in layer l was in the top k activations over all previously seen tasks τ ∈ [1, ..., t]. Then, while learning task t + 1, we would like to encourage the network to utilize the less activated neurons. To that end, we propose a dropout (Srivastava et al., 2014) mechanism that favors to retain neurons that are less activated in previous tasks. We define a binary Bernoulli random variable, [
δ
t
+
1
l
]j, for the j’th neuron in layer l during training on task t + 1 that indicates whether the neuron is disabled by the dropout or not. In particular, we set P([
δ
t
+
1
l
]j = 1) = [
p
t
+
1
l
]j for:
PNG
media_image1.png
48
190
media_image1.png
Greyscale
where α > 0 is a hyper-parameter of our proposed dropout mechanism. Larger P corresponds to less dropout and larger values of α lead to a more stringent enforcement of non-overlapping representations … We call the proposed dropout a heterogeneous dropout as the probability of dropout is different for various neurons in the network. Importantly, the probability of dropout is directly correlated with the frequency of activations of a neuron for previous tasks. Hence, heterogeneous dropout will encourage the network to use non-overlapping neural activations for different tasks" teaches heterogeneous dropout for each task, wherein a new task (new classes) is learned by using neurons that are less activated in previous tasks (e.g. lower class-wise activity count), with the probability of a neuron being dropped out being inversely proportional to the global activity count b of the neuron).
Shaker et al., Piot et al., and Abbasi et al. are analogous to the claimed invention because they are directed towards continual learning in neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate further comprising the step of employing heterogeneous dropout for each task wherein new classes are learned by using neurons having the lowest class-wise activity count for previously seen classes and by setting a probability of a neuron being dropped to be inversely proportional to the global activity count of the neuron as taught by Abbasi et al. to the disclosed invention of Shaker et al. in view of Piot et al.
One of ordinary skill in the art would have been motivated to make this modification to "significantly increase a continual learner’s performance over a long sequence of tasks" (Abbasi et al. Abstract).
Regarding claim 13,
Shaker et al. in view of Piot et al. and further in view of Abbasi et al. teaches the computer-implemented method of claim 1.
In addition, Abbasi et al. further teaches further comprising the step of emerging class-wise activations followed by the step of employing semantic dropout wherein a probability of retention of a neuron for a class is set to be proportional to the class-wise activity count of the neuron for the class (Section 4.2, first-second paragraphs: "Here we numerically confirm that our heterogeneous dropout leads to fewer overlaps between neural representations of different tasks. For these experiments, we use the GPM algorithm on a model with k-winner activations and learn two tasks from Permuted-MNIST sequentially, where the first task is MNIST and the second task is a permuted version. After training on Task 1, we calculate the number of times each neuron is activated for all samples in the validation set of Task 1. Then, we learn Task 2 using gradient-projection and afterwards calculate the number of times each neuron is activated for all samples in the validation set of Task 2. For task t and for the j’th neuron in layer l, we denote the neural activations on the validation set as
[
v
t
l
]
j Note that
v
t
l
is different from
b
t
l
introduced in Subsection 3.3, as it is calculated on the validation set (as opposed to the training set), and it is calculated per task, while
b
t
l
is the accumulation of activations over all tasks. Let
v
t
l
=
∑
j
=
1
d
l
[
v
t
l
]
j where dl is the number of neurons in the l’th layer, then we can define a probability mass function of activations for each layer as
[
q
t
l
]
j =
[
v
t
l
]
j/
v
t
l
. Finally, we measure the neural activation overlap between tasks t1 and t2, via the Jensen-Shannon divergence (i.e., the symmetric KL-divergence) between their neural activation probability mass functions, i.e.,:
PNG
media_image2.png
54
430
media_image2.png
Greyscale
Figure 3 measures the overlap between neural representations (between Task 1, MNIST, and Task 2, a Permuted MNIST) when the networks are trained with and without our heterogeneous dropout and or
different values of α. Higher Jensen-Shannon divergence means less overlap. In short, α = 0 means no dropout, and we confirm that higher α translates to less overlap (higher JS-divergence) between the neural representations" teaches an activation counter per neuron for each task (class-wise activity count) for tracking the activation count of each neuron for each particular task (e.g. activating for processing samples of a particular task/class) followed by heterogeneous dropout (semantic dropout) to have fewer overlaps between neural representations of different tasks, with the probability of retention for a neuron activation of the task (class) is proportional to the class-wise activity count of the neuron).
Shaker et al., Piot et al., and Abbasi et al. are analogous to the claimed invention because they are directed towards continual learning in neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate further comprising the step of emerging class-wise activations followed by the step of employing semantic dropout wherein a probability of retention of a neuron for a class is set to be proportional to the class-wise activity count of the neuron for the class as taught by Abbasi et al. to the disclosed invention of Shaker et al. in view of Piot et al.
One of ordinary skill in the art would have been motivated to make this modification to "significantly increase a continual learner’s performance over a long sequence of tasks" (Abbasi et al. Abstract).
Regarding claim 14,
Shaker et al. in view of Piot et al. and further in view of Abbasi et al. teaches the computer-implemented method of claim 1.
In addition, Abbasi et al. further teaches further comprising the step of updating probabilities for semantic dropout and heterogeneous dropout at an end of each epoch and each task respectively for enforcing an emerged pattern (Section 3.3, first-second paragraphs: "While training a network on a task, we keep track of the frequency of the neural activations. In short, we assign an activation counter per neuron, which increments when a neuron’s activation is in the top-k activations in its layer (i.e., the neuron is activated). Let
[
b
t
l
]
j denote the activation counter for the j’th neuron in the l’th layer of the network, after learning task t. Note that
[
b
t
l
]
j represents the number of times the j’th neuron in layer l was in the top k activations over all previously seen tasks τ ∈ [1, ..., t]. Then, while learning task t + 1, we would like to encourage the network to utilize the less activated neurons. To that end, we propose a dropout (Srivastava et al., 2014) mechanism that favors to retain neurons that are less activated in previous tasks. We define a binary Bernoulli random variable, [
δ
t
+
1
l
]j, for the j’th neuron in layer l during training on task t + 1 that indicates whether the neuron is disabled by the dropout or not. In particular, we set P([
δ
t
+
1
l
]j = 1) = [
p
t
+
1
l
]j for:
PNG
media_image1.png
48
190
media_image1.png
Greyscale
where α > 0 is a hyper-parameter of our proposed dropout mechanism. Larger P corresponds to less dropout and larger values of α lead to a more stringent enforcement of non-overlapping representations … We call the proposed dropout a heterogeneous dropout as the probability of dropout is different for various neurons in the network. Importantly, the probability of dropout is directly correlated with the frequency of activations of a neuron for previous tasks. Hence, heterogeneous dropout will encourage the network to use non-overlapping neural activations for different tasks" teaches calculating a probability of heterogeneous/semantic dropout for each task for each layer (i.e. at end of each epoch and each task). Section 5, first paragraph: "we proposed a heterogeneous dropout mechanism, which encourages non-overlapping patterns of neural activations between tasks" teaches that the heterogeneous/semantic dropout enforces non-overlapping patterns of neural activations between tasks).
Shaker et al., Piot et al., and Abbasi et al. are analogous to the claimed invention because they are directed towards continual learning in neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate further comprising the step of updating probabilities for semantic dropout and heterogeneous dropout at an end of each epoch and each task respectively for enforcing an emerged pattern as taught by Abbasi et al. to the disclosed invention of Shaker et al. in view of Piot et al.
One of ordinary skill in the art would have been motivated to make this modification to "significantly increase a continual learner’s performance over a long sequence of tasks" (Abbasi et al. Abstract).
Regarding claim 15,
Shaker et al. in view of Piot et al. and further in view of Abbasi et al. teaches the computer-implemented method of claim 1.
In addition, Shaker et al. further teaches a computer-readable medium provided with a computer program wherein when the computer program is loaded and executed by a computer, the computer program causes the computer to carry out the steps of the computer-implemented method according to claim 1 ([0002]: "The present invention relates to a method and system for continual learning of artificial intelligent systems" teaches a method for continual learning in artificial intelligent systems (e.g. neural network). [0037]: "In an embodiment, a tangible, non-transitory computer-readable medium includes instructions thereon which, upon being executed by one or more processors, alone or in combination, provide for execution of the method" teaches a computer-readable medium with instructions (computer program) for execution by a processor (computer) to implement the method for continual learning).
Regarding claim 16,
Shaker et al. in view of Piot et al. and further in view of Abbasi et al. teaches the computer-implemented method of claim 1.
In addition, Shaker et al. further teaches an autonomous vehicle comprising a data processing system loaded with a computer program, wherein the program is arranged for causing the data processing system to carry out the steps of the computer-implemented method according to claim 1 for enabling the autonomous vehicle to continually adapt and acquire knowledge from an environment surrounding the autonomous vehicle ([0002]: "The present invention relates to a method and system for continual learning of artificial intelligent systems" teaches a method for continual learning in artificial intelligent systems (e.g. neural network). [0004]-[0005]: "Eventually, a trained system might be exposed to a new and unfamiliar environment (also called a “new task”) in which the desired distribution of input and output data is different than the distribution of input and output data encountered during training. Examples of new tasks are when a trained robot operates in a new environment or when new categories (e.g., classifications) are added to trained image recognition system ... Within machine learning, the field of continual learning endeavors to finds an architecture and a learning or training algorithm that allows the learning of new tasks while not forgetting the past tasks, without the necessity to store the previous experience, re-train the full network or store multiple networks per task" teaches that the continual learning method can be used for enabling a robot (autonomous vehicle) to adapt and acquire knowledge from a surrounding environment. [0061]: "An embodiment is directed to where public transport vehicles can be dispatched based on demand of transport. Each vehicle has a pre-defined route and is dispatched in a specific time interval if the predicted demand justifies its deployment and if maximum delay is met. Each vehicle can be autonomous and configured to automatically route based on the prediction model" teaches the embodied method can be used with an autonomous vehicle).
Claims 3, and 6-7 are rejected under 35 U.S.C. 103 as being unpatentable over Shaker et al. (US 2021/0064989 A1) in view of Piot et al. ("Dual-Memory Model for Incremental Learning: The Handwriting Recognition Use Case") in view of Abbasi et al. ("Sparsity and Heterogeneous Dropout for Continual Learning in the Null Space of Neural Activations") and further in view of Aljundi ("Continual Learning in Neural Networks")
Regarding claim 3,
Shaker et al. in view of Piot et al. and further in view of Abbasi et al. teaches the computer-implemented method of claim 1.
Shaker et al. in view of Piot et al. and further in view of Abbasi et al. does not appear to explicitly teach wherein the step of maintaining a long-term memory by aggregating the synaptic weights of the working memory comprises the step of calculating an exponentially moving average of the synaptic weights of the working memory in a stochastic manner.
However, Aljundi teaches wherein the step of maintaining a long-term memory by aggregating the synaptic weights of the working memory comprises the step of calculating an exponentially moving average of the synaptic weights of the working memory in a stochastic manner (Page 111, second paragraph: "Accumulating importance weights: As we frequently update the importance weights, simply adding the new estimated importance values to the previous ones would lead to very high values and exploding gradients. Instead, we maintain a cumulative moving average of the estimated importance weights. Note, one could deploy a decaying factor that allows replacing old knowledge in the long term" teaches accumulating weights (e.g. in long-term memory) based on a moving average of updated weights (e.g. from working memory). Page 117, last paragraph: "The second factor is the mechanism for accumulating importance weights across updates. In our system we use a cumulative moving average, which gives all the estimated importance weights the same weight. An alternative is to deploy a decaying average. This reduces the impact of old importance weights in favor of the newest ones" teaches aggregating the weights of the updated model (e.g. from the working memory) by calculating a moving average of the weights).
Shaker et al., Piot et al., Abbasi et al., and Aljundi are analogous to the claimed invention because they are directed towards continual learning in neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the step of maintaining a long-term memory by aggregating the synaptic weights of the working memory comprises the step of calculating an exponentially moving average of the synaptic weights of the working memory in a stochastic manner as taught by Aljundi to the disclosed invention of Shaker et al. in view of Piot et al. and further in view of Abbasi et al.
One of ordinary skill in the art would have been motivated to make this modification to "significantly reduce the percentage of parameters dedicated to each task and as a consequence remarkably improve the continual learning performance" (Aljundi, Abstract).
Regarding claim 6,
Shaker et al. in view of Piot et al. and further in view of Abbasi et al. teaches the computer-implemented method of claim 1.
Shaker et al. in view of Piot et al. and further in view of Abbasi et al. does not appear to explicitly teach further comprising the step of interleaving samples from a current task with random samples from the episodic memory.
However, Aljundi teaches further comprising the step of interleaving samples from a current task with random samples from the episodic memory (Section 9.1, third paragraph: "The replay-based approach stores the information in the example space either directly in a replay buffer or in a generative model. When learning new data, old examples are reproduced from the replay buffer or generative model, which is used for rehearsal/retraining or used as constraints for the current learning" teaches a replay buffer (episodic memory) storing past samples (random samples from episodic memory) that are used (interleaved) with new data (samples from current task) for training/learning).
Shaker et al., Piot et al., Abbasi et al., and Aljundi are analogous to the claimed invention because they are directed towards continual learning in neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate further comprising the step of interleaving samples from a current task with random samples from the episodic memory as taught by Aljundi to the disclosed invention of Shaker et al. in view of Piot et al. and further in view of Abbasi et al.
One of ordinary skill in the art would have been motivated to make this modification to "significantly reduce the percentage of parameters dedicated to each task and as a consequence remarkably improve the continual learning performance" (Aljundi, Abstract).
Regarding claim 7,
Shaker et al. in view of Piot et al. in view of Abbasi et al. and further in view of Aljundi teaches the computer-implemented method of claim 6.
In addition, Aljundi further teaches further comprising the step of training the working memory by combining a cross-entropy loss on the interleaved samples with a knowledge retrieval loss on the random samples from the episodic memory (Equation 8.2; Page 111, lines 17-21: "After updating the importance weights, the model continues the learning process while penalizing changes to parameters that have been identified as important so far. As such our final objective function is:
PNG
media_image3.png
60
638
media_image3.png
Greyscale
where θ∗ are the parameters values at the last importance weight update step" teaches that learning/training of the working model (working memory) is based on an objective function that combines a cross-entropy loss of the data samples (l(f(xt; θ),yt)) with a knowledge retrieval loss on the samples from the replay buffer/episodic memory (l(f(XB; θ),YB)). Section 8.4.3, second paragraph: "SGD optimizer with cross-entropy loss is used" teaches that the loss function is cross-entropy loss (i.e. the (l(f(xt; θ),yt)) loss term of the objective function is a cross-entropy loss of data samples). Equation 8.1; lines 11-14: "We formulate the learning objective of an online system as follows. Given an input model with parameters θ, the system at each time step reduces the empirical risk based on the recently received samples and a small buffer B = (XB, YB) composed of updated hard samples" teaches that the loss (l(f(XB; θ),YB)) is based on samples from a small buffer of updated hard samples (random samples from episodic memory) (i.e. the (l(f(XB; θ),YB)) loss term is retrieval loss on samples from episodic memory)).
Shaker et al., Piot et al., Abbasi et al., and Aljundi are analogous to the claimed invention because they are directed towards continual learning in neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate further comprising the step of training the working memory by combining a cross-entropy loss on the interleaved samples with a knowledge retrieval loss on the random samples from the episodic memory as taught by Aljundi to the disclosed invention of Shaker et al. in view of Piot et al. and further in view of Abbasi et al.
One of ordinary skill in the art would have been motivated to make this modification to "significantly reduce the percentage of parameters dedicated to each task and as a consequence remarkably improve the continual learning performance" (Aljundi, Abstract).
Claims 4 and 5 are rejected under 35 U.S.C. 103 as being unpatentable over Shaker et al. (US 2021/0064989 A1) in view of Piot et al. ("Dual-Memory Model for Incremental Learning: The Handwriting Recognition Use Case") in view of Abbasi et al. ("Sparsity and Heterogeneous Dropout for Continual Learning in the Null Space of Neural Activations") and further in view of Chaudhry et al. ("Continual Learning with Tiny Episodic Memories").
Regarding claim 4,
Shaker et al. in view of Piot et al. and further in view of Abbasi et al. teaches the computer-implemented method of claim 1.
Shaker et al. in view of Piot et al. and further in view of Abbasi et al. does not appear to explicitly teach further comprising the step of assigning a fixed size to the instance-based episodic memory.
However, Chaudhry et al. teaches further comprising the step of assigning a fixed size to the instance-based episodic memory (Algorithm 1, line 2: "2: M ← {} ∗ mem_sz > Allocate memory buffer of size mem_sz" teaches allocating a fixed size to the episodic memory M).
Shaker et al., Piot et al., Abbasi et al., and Chaudhry et al. are analogous to the claimed invention because they are directed towards continual learning in neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate further comprising the step of assigning a fixed size to the instance-based episodic memory as taught by Chaudhry et al. to the disclosed invention of Shaker et al. in view of Piot et al. and further in view of Abbasi et al.
One of ordinary skill in the art would have been motivated to make this modification to "offer a very large performance boost at a very marginal increase of computational cost compared to the finetuning baseline" (Chaudhry et al. Conclusions, first paragraph).
Regarding claim 5,
Shaker et al. in view of Piot et al. and further in view of Abbasi et al. teaches the computer-implemented method of claim 1.
Shaker et al. in view of Piot et al. and further in view of Abbasi et al. does not appear to explicitly teach wherein the step of maintaining an episodic memory comprises the step of maintaining the episodic memory with reservoir sampling by assigning to each incoming sample of the continuous data stream equal probability of being represented in the episodic buffer.
However, Chaudhry et al. teaches wherein the step of maintaining an episodic memory comprises the step of maintaining the episodic memory with reservoir sampling by assigning to each incoming sample of the continuous data stream equal probability of being represented in the episodic buffer (Algorithm 1; Section 4, second paragraph: "The overall training procedure is given in Alg. 1. Compared to the simplest baseline model that merely fine-tunes the parameters on the new task starting from the previous task parameter vector, ER makes two modifications. First, it has an episodic memory which is updated at every time step, line 8" teaches maintaining an episodic memory M. Section 4, fourth-fifth paragraphs: "Since we study the usage of tiny episodic memories, the sample that the learner selects to populate the memory becomes crucial, see line 8 of the algorithm. For this, we describe various strategies to write into the memory. All these strategies assume access to a continuous stream of data and a small episodic memory … Reservoir Sampling: Similarly to Riemer et al. (2019), Reservoir sampling (Vitter, 1985) takes as input a stream of data of unknown length and returns a random subset of items from that stream. If ‘n’ is the number of points observed so far and ‘mem_sz’ is the size of the reservoir (sampling buffer), this selection strategy samples each data point with a probability (mem_sz)/n" teaches that the episodic memory M is maintained with reservoir sampling by assigning incoming samples of the continuous data stream with an equal probability of being represented in the sampling buffer (episodic buffer)).
Shaker et al., Piot et al., Abbasi et al., and Chaudhry et al. are analogous to the claimed invention because they are directed towards continual learning in neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the step of maintaining an episodic memory comprises the step of maintaining the episodic memory with reservoir sampling by assigning to each incoming sample of the continuous data stream equal probability of being represented in the episodic buffer as taught by Chaudhry et al. to the disclosed invention of Shaker et al. in view of Piot et al. and further in view of Abbasi et al.
One of ordinary skill in the art would have been motivated to make this modification to "offer a very large performance boost at a very marginal increase of computational cost compared to the finetuning baseline" (Chaudhry et al. Conclusions, first paragraph).
Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Shaker et al. (US 2021/0064989 A1) in view of Piot et al. ("Dual-Memory Model for Incremental Learning: The Handwriting Recognition Use Case") in view of Abbasi et al. ("Sparsity and Heterogeneous Dropout for Continual Learning in the Null Space of Neural Activations") and further in view of Oswald et al. ("Learning where to learn: Gradient sparsity in meta and continual learning").
Regarding claim 10,
Shaker et al. in view of Piot et al. and further in view of Abbasi et al. teaches the computer-implemented method of claim 1.
Shaker et al. in view of Piot et al. and further in view of Abbasi et al. does not appear to explicitly teach further comprising the step of setting a sparsity ratio for each layer of the network wherein earlier layers have a lower sparsity ratio than later layers.
However, Oswald et al. teaches further comprising the step of setting a sparsity ratio for each layer of the network wherein earlier layers have a lower sparsity ratio than later layers (Section 4.1, last paragraph: "As in our few-shot learning experiments, structured sparsity emerges across the different parameter groups of the network (cf. Figure 3). We observe that now sparsity is highest closest to the output layer, the exact opposite of the trend found in our few-shot learning experiments. This provides evidence that online meta-learning can discover how to rewire low-level features without interference in order to accommodate different tasks that share high-level structure. We further investigate a multi-pass setting, where the examples from each task are visited multiple times (10 epochs instead of 1) before proceeding to the next task. In this setting, it can be seen that sparsity levels (displayed in Figure 4) tend to converge within tasks and then raise again when tasks switch, presumably to preserve past memories via gradient sparsification. Taken together, our results support the hypothesis that gradient sparsity is beneficial for continual learning and that appropriate patterns of sparsity can be discovered by simple online gradient-based meta-learning" teaches that in continual learning a sparsity is set for each layer, with the sparsity (sparsity ratio) being the highest closest to the output layer (i.e. earlier layers have a lower sparsity ratio than later layers)).
Shaker et al., Piot et al., Abbasi et al., and Oswald et al. are analogous to the claimed invention because they are directed towards continual learning in neural networks.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate further comprising the step of setting a sparsity ratio for each layer of the network wherein earlier layers have a lower sparsity ratio than later layers as taught by Oswald et al. to the disclosed invention of Shaker et al. in view of Piot et al. and further in view of Abbasi et al.
One of ordinary skill in the art would have been motivated to make this modification because "this selective sparsity results in better generalization and less interference in a range of few-shot and continual learning problems" (Oswald et al. Abstract).
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to BRIAN J HALES whose telephone number is (571)272-0878. The examiner can normally be reached M-F 9:00am - 5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached at (571) 272-7796. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/BRIAN J HALES/Examiner, Art Unit 2125
/KAMRAN AFSHAR/Supervisory Patent Examiner, Art Unit 2125