DETAILED ACTION
This action is responsive to the claims filed on 08/15/2025. Claims 1-27 are pending for examination.
This action is Final.
Response to Arguments
Applicant’s arguments with respect to the 35 U.S.C. 102/103 traversal of the claims have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or non-obviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1-6, 10-11, 13-15, and 26-27 are rejected under 35 U.S.C. 103 as being unpatentable by Michielli et al. (Michielli, N., Acharya, U. R., & Molinari, F. (2019). Cascaded LSTM recurrent neural network for automated sleep stage classification using single-channel EEG signals. Computers in biology and medicine, 106, 71-81.), hereafter referred to as Michielli, in view of Kim et al., (Kim, J., El-Khamy, M., & Lee, J. (2017). Residual LSTM: Design of a deep recurrent architecture for distant speech recognition. arXiv preprint arXiv:1701.03360.), hereafter referred to as Kim.
Claim 1: Michielli teaches the following limitations:
A method performed by one or more computers, the method comprising: obtaining a network input; (Michielli, abstract, paragraph 2, “In this study, a novel cascaded recurrent neural network (RNN) architecture based on long short-term memory (LSTM) blocks, is proposed for the automated scoring of sleep stages using EEG signals derived from a single-channel. Fifty-five time and frequency-domain features were extracted from the EEG signals and fed to feature reduction algorithms to select the most relevant ones. The selected features constituted as the inputs to the LSTM networks”, features extracted from signals serve as inputs to the network.)
and generating a network output for the network input by, at each time step of a time step sequence comprising a plurality of time steps: (Michielli, page 74, col. 2, section 2.6, paragraph 2, “In the RNN architecture the input layer is a sequence layer which takes the input a sequence of vectors { … } , which contain all the features for each timestep; then the network computes a sequence of hidden activations { … } and the output vector { … } for T timsteps”, network output is generated for each timestep t.)
processing a time step input derived from the network input using a cascaded neural network to generate a candidate network output for the time step, (Michielli, page 75, col. 2, paragraph 1, “The LSTM memory cell consists of five components: the memory cell c <t > (a new variable computed for each timestep), the candidate value c˜ <t > for replacing the memory cell at each timestep and three gates defined as update gate u, forget gate f and output gate o. The memory cell is useful to remember certain values even for a long time during the training process.”, candidate values are provided to the generation of the output for a particular time step.)
wherein the cascaded neural network comprises a plurality of neural network blocks that are arranged in a stack one after another, (Michielli, page 76, col. 1, paragraph 1, “In this study, we employed a cascade of two RNNs with LSTM units. The first network took the input the features selected by mRMR algorithm and performed 4-class (W, N1-REM, N2 and N3) classification (the N1 and REM epochs were merged into a single class), while the second network used the input new features computed by PCA of the correctly classified N1-REM epochs by the first RNN and classified these epochs into two classes (N1 and REM).”, the cascaded neural network comprises a plurality of neural networks arranged in a cascade (a stack one after another).)
and wherein each of the plurality of neural network blocks is configured to, for each particular time step of a plurality of particular time steps in the time step sequence: receive a block input for the neural network block for the particular time step; (Michielli, page 74, col. 2, section 2.6, paragraph 2, “In the RNN architecture the input layer is a sequence layer which takes the input a sequence of vectors { … } , which contain all the features for each timestep;”, each of the RNN blocks in the cascade are provided an input layer for each timestep.)
apply a learned block transformation to the block input for the particular time step to generate a transformed block input for the particular time step; (Michielli, page 72, section 2, paragraph 1, “Both RNNs shared the first three steps: data acquisition, signal pre-processing and feature extraction from single-channel EEG signals. Subsequently, a feature selection or feature transformation method was adopted to reduce the number of input features for neural network.”, a feature transformation is applied to the input to generate a reduced input feature set (a transformed block input) for a particular time step.)
Kim, in the same field of neural network implementation, teaches the following limitations which Michielli fails to teach:
PNG
media_image1.png
436
662
media_image1.png
Greyscale
Figure 1 of Kim, Residual LSTM: A shortcut from a prior output layer
h
t
l
-
1
t is added to a projection output
m
t
l
.
W
h
l
is a dimension matching matrix between input and output. If K is equal to M, it is replaced with an identity matrix.
and generate a block output for the particular time step, comprising applying a skip connection for the neural network block to at least (i) the block input for the particular time step (Kim, page 2, section 3, “a shortcut path is added to an LSTM output layer
h
t
l
…
h
t
l
=
o
t
l
*
o
t
l
+
W
h
l
*
x
t
l
;
Where
W
h
l
can be replaced by an identity matrix if the dimension of
x
t
l
matches that of
h
t
l
.”, the shortcut path is the skip connection at the block output that adds the block input x_t (via
W
h
l
or the identity matrix) to the blocks internal/”delay path” output
m
t
l
.)
and (ii) an output of a delay component within the neural network block that operates on respective transformed block inputs that have each been generated by the neural network block by applying the learned block transformation to a respective block input for each of one or more preceding time steps that precede the current time step in the time step sequence. (Kim, page 2, section 2.3, “
c
t
l
=
f
t
l
*
c
t
-
1
l
+
i
t
l
*
t
a
n
h
(
W
x
l
x
t
l
+
W
h
c
l
h
t
-
1
l
+
b
c
l
)
…
h
t
l
is an input from (l−1)th layer …
h
t
l
is an lth output layer at time t−1 and
c
t
-
1
l
is an internal cell state at t−1.”, This expressly shows the delay component operating on outputs from the immediately preceding time step (t-1); unrolled over time, this captures one or more preceding steps.
Kim, page 2, section 3, “Equations (4), (5), (6) and (7) do not change for residual LSTM. The updated equations are as follows:
r
t
l
=
t
a
n
h
(
c
t
l
)
;
m
t
l
=
W
p
l
*
r
t
l
;
h
t
l
=
o
t
l
*
o
t
l
+
W
h
l
*
x
t
l
;
Where
W
h
l
can be replaced by an identity matrix if the dimension of
x
t
l
matches that of
h
t
l
.”, In Residual LSTM, the same LSTM delay over t-1 remains in place, and the skip is applied at the block output by summing the delay path output
m
t
l
with the block input
x
t
l
. )
Claim 2: Michielli in view of Kim teaches the limitations of claim 1. Michielli further teaches:
The method of claim 1, wherein the network output is the candidate network output for the last time step. (Michielli, page 76, col. 1, paragraph 2, “Both RNNs proposed in this study presented the same structure: the input layer was a sequence layer with 30 timesteps; the LSTM layers were used to learn the features from EEG signals; the fully connected (FC) layer was used to convert the output size of the previous layers into the number of sleep stages to recognize”, the input sequence consists of 30 time steps. Naturally, the final output is produced by the fully connected layer after the last time step has been processed by the LSTM layers.)
Claim 3: Michielli in view of Kim teaches the limitations of claim 1. Michielli and Kim further teaches:
The method of claim 1, wherein generating the block output comprises applying the skip connection for the neural network block to (i) the block input for the particular time step (Kim, page 2, section 3, “a shortcut path is added to an LSTM output layer
h
t
l
…
h
t
l
=
o
t
l
*
o
t
l
+
W
h
l
*
x
t
l
;
Where
W
h
l
can be replaced by an identity matrix if the dimension of
x
t
l
matches that of
h
t
l
.”, the shortcut path is the skip connection at the block output that adds the block input x_t (via
W
h
l
or the identity matrix) to the blocks internal/”delay path” output
m
t
l
.)
and (ii) the output of the delay component within the neural network block that operates on only the respective transformed block input generated by the neural network block by applying the learned block transformation to the respective block input for the immediately preceding time step that immediately precedes the particular time step in the time step sequence. (Kim, page 2, section 2.3, “
c
t
l
=
f
t
l
*
c
t
-
1
l
+
i
t
l
*
t
a
n
h
(
W
x
l
x
t
l
+
W
h
c
l
h
t
-
1
l
+
b
c
l
)
…
h
t
l
is an input from (l−1)th layer …
h
t
l
is an lth output layer at time t−1 and
c
t
-
1
l
is an internal cell state at t−1.”, This expressly shows the delay component operating on outputs from the immediately preceding time step (t-1); unrolled over time, this captures one or more preceding steps.
Kim, page 2, section 3, “Equations (4), (5), (6) and (7) do not change for residual LSTM. The updated equations are as follows:
r
t
l
=
t
a
n
h
(
c
t
l
)
;
m
t
l
=
W
p
l
*
r
t
l
;
h
t
l
=
o
t
l
*
o
t
l
+
W
h
l
*
x
t
l
;
Where
W
h
l
can be replaced by an identity matrix if the dimension of
x
t
l
matches that of
h
t
l
.”, In Residual LSTM, the same LSTM delay over t-1 remains in place, and the skip is applied at the block
Claim 4: Michielli in view of Kim teaches the limitations of claim 3. Michielli further teaches:
The method of claim 3, wherein generating the block output comprises: computing a sum of (i) the block input for the particular time step and (ii) the output of the delay component within the neural network block by applying the learned block transformation to the respective block input that operates on the respective transformed block input generated by the neural network block for the immediately preceding time step that immediately precedes the particular time step in the time step sequence. (Michielli, page 74, section 2.6, paragraph 2, “The activation and the output prediction at time t are expressed as:”
PNG
media_image2.png
43
177
media_image2.png
Greyscale
The current input (i) x_t is explicitly included in the block output computation and further is computed in a sum in combination with (ii) the transformed block input for the immediate preceding timestep a_<t-1>, the delayed component, as shown in the formula above.)
Claim 5: Michielli in view of Kim teaches the limitations of claim 4. Michielli further teaches:
The method of claim 4, wherein generating block output further comprises: applying a non-linearity to the sum. (Michielli, page 75, col. 2, paragraph 2, “Finally, the output gate is the section where the activation at the current timestep is generated and can be defined as:”
PNG
media_image3.png
29
166
media_image3.png
Greyscale
“In the previous expressions, σ represents the sigmoid function.”, the activation function σ which is applied to the outputted sum, is a sigmoid function which is inherently non-linear.)
Claim 6: Michielli in view of Kim teaches the limitations of claim 1. Kim further teaches:
The method of claim 1, wherein generating the block output comprises applying the skip connection for the neural network block to (i) the block input for the particular time step, (Kim, page 2, section 3, “a shortcut path is added to an LSTM output layer
h
t
l
…
h
t
l
=
o
t
l
*
o
t
l
+
W
h
l
*
x
t
l
;
Where
W
h
l
can be replaced by an identity matrix if the dimension of
x
t
l
matches that of
h
t
l
.”, the shortcut path is the skip connection at the block output that adds the block input x_t (via
W
h
l
or the identity matrix) to the blocks internal/”delay path” output
m
t
l
.)
(ii) the output of the delay component within the neural network block that operates on the respective transformed block input for the particular time step and (Kim, page 2, section 2.3, “
c
t
l
=
f
t
l
*
c
t
-
1
l
+
i
t
l
*
t
a
n
h
(
W
x
l
x
t
l
+
W
h
c
l
h
t
-
1
l
+
b
c
l
)
.”, at the time t, the LSTM of Kim operates on the current (particular) time step input via the transformed term
t
a
n
h
(
W
x
l
x
t
l
+
W
h
c
l
h
t
-
1
l
+
b
c
l
)
gated by i_t.)
(iii) the output of the delay component within the neural network block that operates on the respective transformed block inputs that have each been generated by the neural network block by applying the learned block transformation to the respective block input for each of all preceding time steps that precede the particular time step in the time step sequence. (Kim, page 2, section 2.3, “
c
t
l
=
f
t
l
*
c
t
-
1
l
+
i
t
l
*
t
a
n
h
(
W
x
l
x
t
l
+
W
h
c
l
h
t
-
1
l
+
b
c
l
)
…
h
t
l
is an input from (l−1)th layer …
h
t
l
is an lth output layer at time t−1 and
c
t
-
1
l
is an internal cell state at t−1.”, This expressly shows the delay component operating on outputs from the immediately preceding time step (t-1); unrolled over time, this captures one or more preceding steps.
Kim, page 2, section 3, “Equations (4), (5), (6) and (7) do not change for residual LSTM. The updated equations are as follows:
r
t
l
=
t
a
n
h
(
c
t
l
)
;
m
t
l
=
W
p
l
*
r
t
l
;
h
t
l
=
o
t
l
*
o
t
l
+
W
h
l
*
x
t
l
;
Where
W
h
l
can be replaced by an identity matrix if the dimension of
x
t
l
matches that of
h
t
l
.”, In Residual LSTM, the same LSTM delay over t-1 remains in place, and the skip is applied at the block
Claim 10: Michielli in view of Kim teaches the limitations of claim 1. Michielli further teaches:
The method of claim 1, wherein the time step input for each time step in the sequence is the network input. (Michielli, page 76, col. 1, paragraph 2, “Both RNNs proposed in this study presented the same structure: the input layer was a sequence layer with 30 timesteps;”, the network input consists of a sequence of inputs split into 30 timesteps, and each time step input comes directly from the original input sequence.)
Claim 11: Michielli in view of Kim teaches the limitations of claim 4. Michielli further teaches:
The method of claim 1, wherein the network input changes over time and each time step input is the network input as of a corresponding time point. (Michielli, page 73, col. 1, paragraph 1, “All EEG time-series data were filtered to remove the frequency components outside the range of 0.3–45 Hz. Next, the 30 s filtered epochs were subdivided in blocks of 1 s duration, thus for each epoch we obtained 30 time-segments. A number of timesteps equals to 30 was set for the RNN classifier.”, the input changes over time because each time block corresponds to a different portion of the input EEG signals at a corresponding time.)
Claim 13: Michielli in view of Kim teaches the limitations of claim 1. Michielli further teaches:
The method of claim 1, wherein, for each time step and for each block after the first block in the stack, the block input is the block output of the preceding block in the stack for the time step. (Michielli, page 76, col. 1, paragraph 2, “In this study, we employed a cascade of two RNNs with LSTM units. The first network took the input the features selected by mRMR algorithm and performed 4-class (W, N1-REM, N2 and N3) classification…, while the second network used the input new features computed by PCA of the correctly classified N1-REM epochs by the first RNN”, the cascading logic shows the output of one block (the first network) feeding into the input of the next.)
Claim 14: Michielli in view of Kim teaches the limitations of claim 1. Michielli further teaches:
The method of claim 1, wherein, for each time step and for the first block in the stack, the block input is the time step input for the time step. (Michielli, page 73, col. 1, paragraph 1, “All EEG time-series data were filtered to remove the frequency components outside the range of 0.3–45 Hz. Next, the 30 s filtered epochs were subdivided in blocks of 1 s duration, thus for each epoch we obtained 30 time-segments. A number of timesteps equals to 30 was set for the RNN classifier.”, each input for the first network corresponds to the EEG signal input for that time step.)
Claim 15: Michielli in view of Kim teaches the limitations of claim 1. Michielli further teaches:
The method of claim 1, wherein, for each time step and for the first block in the stack, the block input is one of: an output of one or more initial layers of the neural network for the time step and generated by processing the time step input for the time step; (Michielli, page 76, col. 1, paragraph 2, “the LSTM layers were used to learn the features from EEG signals”, the first LSTM block receives as input the feature vectors extracted for each current time step.)
an output of one or more initial layers of the neural network for the immediately preceding time step and generated by processing the time step input for the immediately preceding time step; (Michielli, page 75, col. 1, paragraph 1, “The characteristic of RNN is that each neuron of the hidden layer receives the activation of the previous time step to compute the activation of the current time step.”, this explicitly teaches that the network uses the previous hidden activation (the output of earlier processing of the previous time step input). Thus, the block input can come from an earlier immediately preceding time step.)
or a combination of the output of one or more initial layers of the neural network for the time step and generated by processing the time step input for the time step and respective outputs of the one or more initial layers of the neural network for one or more preceding time steps that are each generated by processing the time step input for the preceding time step. (Michielli, page 75, col. 2, paragraph 2, “Finally, the output gate is the section where the activation at the current timestep is generated and can be defined as:”
PNG
media_image3.png
29
166
media_image3.png
Greyscale
“In the previous expressions, σ represents the sigmoid function.”, this formula shows that the current activation a_t that is outputted is computed based on both x_t (input x at a current timestep) and a_t-1 (the activation from the previous time step). Thus the block input is a combination of outputs from the current time step and the preceding time step.)
Claims 26 and 27 recite limitations substantially similar to the limitations of claim 1, and as such a similar analysis applies.
Claims 7, 8, and 9 are rejected under 35 U.S.C. 103 as being unpatentable over Michielli in view of Kim as applied to claims 1-6, 10-11, 13-15, and 26-27 above, and Dixon et al. (Dixon, M. F., Halperin, I., & Bilokon, P. (2020). Machine learning in finance (Vol. 1170). New York, NY, USA: Springer International Publishing.), hereafter referred to as Dixon.
Claim 7: Michielli in view of Kim teaches the limitations of claim 6. Dixon, in the same field of neural network transformations further teaches the following limitations which Michielli fails to teach:
The method of claim 6, wherein generating the block output comprises: computing an exponentially weighted smoothing sum of the (ii) the output of the delay component within the neural network block that operates on the respective transformed block input for the particular time step and (iii) the output of the delay component within the neural network block that operates on the respective transformed block inputs that have each been generated by the neural network block by applying the learned block transformation to the respective block input for each of all preceding time steps that precede the particular time step in the time step sequence; (Dixon, page 20, section 2.10, “Exponential smoothing is a type of forecasting or filtering method that exponentially decreases the weight of past and current observations to give smoothed predictions y˜t+1… Writing this as a geometric decaying autoregressive series back to the first observation:”
PNG
media_image4.png
59
364
media_image4.png
Greyscale
The expression shows an exponentially weighted smoothing sum that is calculated as a sum of past values weighted exponentially by α, which includes current and all past transformed inputs.)
and computing a sum of the block input for the particular time step and the exponentially weighted smoothing sum. (Dixon, page 20, section 2.10, “Writing this as a geometric decaying autoregressive series back to the first observation:”
PNG
media_image4.png
59
364
media_image4.png
Greyscale
The current input (y_t) being added to a weighted version of past smoothed output (y˜_t), which corresponds to summing the current block input and a smoothed version of past transformed inputs.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the teachings of Michielli and Kim with that of Dixon and include exponential smoothing sum to the cascaded network block output. A motivation of which would be the incorporation of information from an entire time series unlike regressive models that only consider short term time frames. (Dixon, page 204, paragraph 2, “hence we observe that smoothing introduces a long-term model of the entire observed data, not just a sub-sequence used for prediction in a AR model, for example.”, exponential smoothing allows for the modeling of long term temporal structures.)
Claim 8: Michielli in view of Kim and Dixon teaches the limitations of claim 7. Michielli further teaches:
The method of claim 7, wherein generating block output further comprises: applying a non-linearity to the sum. (Michielli, page 75, col. 2, paragraph 2, “Finally, the output gate is the section where the activation at the current timestep is generated and can be defined as:”
PNG
media_image3.png
29
166
media_image3.png
Greyscale
“In the previous expressions, σ represents the sigmoid function.”, the activation function σ which is applied to the outputted sum, is a sigmoid function which is inherently non-linear. It would have been obvious to one or ordinary skill in the art before the effective filing date of the invention to have applied a non-linear transformation to the exponentially weighted smoothing sum disclosed in Dixon.)
Claim 9: Michielli in view of Kim, and Dixon teaches the limitations of claim 7. Dixon, in the same field of neural network transformations further teaches the following limitations which Michielli and Kim fails to teach:
The method of claim 7, wherein computing the exponentially weighted smoothing sum comprises: accessing a previous exponentially weighted smoothing sum that was computed at the immediately preceding time step; (Dixon, page 20, section 2.10, “Writing this as a geometric decaying autoregressive series back to the first observation:”
PNG
media_image4.png
59
364
media_image4.png
Greyscale
The current input (y_t) being added to a weighted version of past smoothed output (y˜_t), which corresponds to summing the current block input and a smoothed version of past transformed inputs including the immediate preceding time step.)
and computing a sum of (i) the previous exponentially weighted smoothing sum weighted by α and (ii) the output of the delay component within the neural network block that operates on the respective transformed block input for the particular time step weighted by (1 - α), (Dixon, page 204, section 2.10, paragraph 1, “Exponential smoothing takes the forecast for the previous period y˜t and adjusts with the forecast error, yt − ˜yt . The forecast for the next period becomes”
PNG
media_image5.png
25
146
media_image5.png
Greyscale
the exponential smoothing formula used above explicitly computes a sum of the previous exponentially smoothed value y˜t weighted by (1 - α) and the new transformed input for that particular time step yt weighted by α, as similarly disclosed in the claim language.)
wherein a is a constant smoothing factor between zero and one. (Dixon, page 20, section 2.10, “It requires a single parameter, α, also called the smoothing factor or smoothing coefficient. This parameter controls the rate at which the influence of the observations at prior time steps decay exponentially. α is often set to a value between 0 and 1.”, the coefficient α is explicitly described as a constant between 0 and 1.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have incorporated the teachings of Michielli with that of Dixon. The rationale for combination being similar to that of claim 8 above.
Claims 16-20 are rejected under 35 U.S.C. 103 as being unpatentable over Michielli in view of Kim as applied to claims 1-6, 10-11, 13-15, and 26-27 above, and Jeong et al. (Jeong, T., & Kim, H. (2020). Ood-maml: Meta-learning for few-shot out-of-distribution detection and classification. Advances in Neural Information Processing Systems, 33, 3907-3916.), hereafter referred to as Jeong.
Claim 16: Michielli in view of Kim teaches the limitations of claim 1. Jeong, in the same field of neural network transformations further teaches the following limitations which Michielli and Kim fails to teach:
The method of claim 1, further comprising: determining that criteria for terminating processing of the network input have been satisfied; (Jeong, page 6, paragraph 1, “The threshold λ can be determined based on some criteria such as the true positive ratio (TPR), or simply set to 0.5 as a default value for binary classification”, The OOD (Out-of-Distribution) meta learning model of Jeong uses criteria like TPR thresholds to determine when a task (or classification) is considered successfully adapted. Thus the processing of network input is performed when λ is satisfied)
and in response, refraining from processing for any time steps after the last time step in the sequence and selecting the candidate network output for the last time step in the sequence as the network output. (Jeong, page 6, paragraph 1, “If all the N elements in p j (x) are less than λ, we assign the unseen class (out-of-distribution) for x. Otherwise, we assign the maximum index of p j (x) as the class for x among N classes”, Once the threshold criteria are evaluated, the model immediately classifies the sample or detects it as OOD without further processing and assigns it the unseen class for x outputted by the candidate network.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the teachings of Michielli and Kim with that of Jeong and include meta-learning models for OOD and early termination. A motivation of which would be the effective detection of out of distribution samples. (Jeon, page 7, section 4.3, paragraph 1, “Table 1 reports the few-shot OOD detection performance of OOD-MAML with α% TPR threshold and the considered competing methods. It lists the average and standard deviation of the TNR and detection accuracy over 1000 different tasks. We can observe that OOD examples were detected more effectively using OOD-MAML than others. In particular, OOD-MAML demonstrated a significant improvement in TNR. We also report OOD detection results of OOD-MAML with λ being simply set as 0.5 in Supplementary material.”)
Claim 17: Michielli in view of Kim, and Jeong teaches the limitations of claim 16. Jeong further teaches:
The method of claim 16, wherein determining that criteria for terminating processing of the network input have been satisfied comprises determining that the criteria have been satisfied from (i) the candidate network outputs, (ii) intermediate logits generated by the cascaded neural network, or both for at least some of the time steps in the sequence. (Jeong, page 6, paragraph 1, “Given x ∈ D j test, we concatenate the adaptation results for x from Tjns, i.e.,
PNG
media_image6.png
20
197
media_image6.png
Greyscale
, where p j (x) denotes the K-shot N-way results for Tj . Note that fθ jn adapt (·) are binary classifiers, and the label 0 can be assigned if fθ jn adapt (·) < λ, where λ is a threshold”, the adaption results pj(x) represent candidate network outputs or intermediate logits for each class. These outputs are evaluated against a threshold to determine if an input should be classified or detected as OOD, thus deciding whether to terminate further processing. Therefore, the decision to stop is based directly on candidate outputs or logits in comparison with the threshold criteria λ.)
The rationale for the combination is similar to that of claim 16 above.
Claim 18: Michielli in view of Kim, and Jeong teaches the limitations of claim 17. Jeong further teaches:
The method of claim 17, wherein determining that criteria are satisfied comprises processing an input derived from (i) the candidate network outputs, (ii) intermediate logits generated by the cascaded neural network, or both for at least some of the time steps in the sequence using a meta-cognitive machine learning model that has been trained to predict whether the last candidate network output should be selected as the network output. (Jeong, page 4, section 3.1, “In the perspective of OOD detection, examples of N classes are in-distribution examples, but our meta learner is trained in a situation wherein Di train contains samples of only a single class. In order to match the situation in the training and testing, we split Tj into multiple sub-tasks {Tjn}n=1,2,...,N , where the nth class is the only seen class in Tjn. Given sub-task Tjn, we define D jn train = {x j nk}1≤k≤K, such that only samples of the nth class belong to the seen class for Tjn, and we obtain the adapted parameters for Tjn.”, the meta-learned models used in OOD-MAML, for each sub-task, generate the network outputs or intermediate logits that form pj(x) as disclosed above. These outputs are then evaluated to detect whether an input should be classified as in-distribution or out-of-distribution. Thus each sub-tasks meta learned model processes these outputs to predict whether to finalize the result.)
The rationale for the combination is similar to that of claim 16 above.
Claim 19: Michielli in view of Kim teaches the limitations of claim 1. Jeong, in the same field of neural network transformations further teaches the following limitations which Michielli and Kim fails to teach:
The method of claim 1, further comprising: detecting, based on (i) the candidate network outputs, (ii) intermediate logits generated by the cascaded neural network, or both for at least some of the time steps in the sequence, whether the network input is an out-of-distribution (OOD) input. (Jeong, page 6, paragraph 1, “If all the N elements in p j (x) are less than λ, we assign the unseen class (out-of-distribution) for x. Otherwise, we assign the maximum index of p j (x) as the class for x among N classes”, the decision whether a network input is an OOD input is directly based on the outputs pj(x) which are adapted logits, which matches the concept of using candidate neural network outputs or intermediate logits to detect OOD inputs.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have incorporated the teachings of Michielli and Kim with that of Jeong. The rationale for combination being similar to claim 16 above.
Claim 20: Michielli in view of Kim, and Jeong teaches the limitations of claim 19. Jeong further teaches:
The method of claim 19, wherein the detecting comprises processing an input derived from (i) the candidate network outputs, (ii) intermediate logits generated by the cascaded neural network, or both for at least some of the time steps in the sequence using a meta-cognitive machine learning model that has been trained to predict whether the network input is an OOD input. (Jeong, page 6, paragraph 1, “Given x ∈ D j test, we concatenate the adaptation results for x from Tjns, i.e.,
PNG
media_image6.png
20
197
media_image6.png
Greyscale
, where p j (x) denotes the K-shot N-way results for Tj . Note that fθ jn adapt (·) are binary classifiers, and the label 0 can be assigned if fθ jn adapt (·) < λ, where λ is a threshold”, the classifiers trained in OOD-MAML are meta-learned to predict OOD samples based on outputs as previously disclosed above.)
The rationale for the combination is similar to that of claim 16 above.
Claims 21-24 are rejected under 35 U.S.C. 103 as being unpatentable over Michielli in view of Kim as applied to claims 1-6, 10-11, 13-15, and 26-27 above and Seijen et al. (Seijen, H., & Sutton, R. (2014, January). True online TD (lambda). In International Conference on Machine Learning (pp. 692-700). PMLR.), hereafter referred to as Seijen.
Claim 21: Michielli in view of Kim teaches the limitations of claim 1.
The method of claim 1, wherein the network input is obtained during training, and wherein the method further comprises (Michielli, page 76, col. 2, section 3, paragraph 2, “The entire dataset was split into a training (80% of data), validation (10%) and test (10%) set.”, the data used as the network input is explicitly obtained and used within a training set)
Seijen, in the same field of neural network transformations further teaches the following limitations which Michielli fails to teach:
obtaining a target output for the network input; (Seijen, page 3, section 2.4, “The conventional forward view relates TD(λ) (with accumulating traces) to the λ-return algorithm. This algorithm performs at each time step a standard update (1) with the λ-return as update target.”, the λ-return function is used as the update target in training. The target output is a dynamically computed λ-return value used to guide learning.)
determining, through backpropagation through time, a gradient with respect to the parameters of the cascaded neural network of a temporal difference loss that measures, at each time step in the sequence, a difference between a temporal difference target for the time step and the candidate network output at the time step; (Seijen, page 2, col. 2, section 2.2, paragraph 3, “Although TD(λ) is not strictly speaking a gradient-descent method (Barnard, 1993) it is still useful to view it in these terms (as in Sutton & Barto, 1998). Stochastic gradient descent involves incremental updates of the weight vector θ at each time step in the direction of the gradient of the time step’s error. The updates can generally be written as”
PNG
media_image7.png
29
226
media_image7.png
Greyscale
where Ut is the update target and ∇θt vˆt(St) is the gradient of vˆ with respect to θt.”
A temporal difference loss is determined and used to compute gradients via stochastic backpropagation through time. The difference between the candidate output (vˆt(St) and the TD target (Ut) in the formula aligns with the claims language directly.)
and determining an update to the parameters of the cascaded neural network from the gradient. (Seijen, page 2, col. 2, section 2.2, paragraph 3, “Although TD(λ) is not strictly speaking a gradient-descent method (Barnard, 1993) it is still useful to view it in these terms (as in Sutton & Barto, 1998). Stochastic gradient descent involves incremental updates of the weight vector θ at each time step in the direction of the gradient of the time step’s error.”, The update rule directly applies to the neural networks weight parameter vector θ.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the teachings of Michielli and Kim with that of Seijen and include a temporal difference loss into the calculation of block outputs. A motivation of which would be the effective implementation of a learning signal that reflects not just immediate error but also long-term prediction accuracy. (Seijen, abstract, “In our empirical comparisons, our algorithm outperformed TD(λ) in all of its variations. It seems, by adhering more truly to the original goal of TD(λ)—matching an intuitively clear forward view even in the online case—that we have found a new algorithm that simply improves on classical TD(λ).”, the true online temporal difference algorithm shown in Seijen demonstrated superior performance scores compared to other TD-algorithms.)
Claim 22: Michielli in view of Kim, and Seijen teaches the limitations of claim 19. Seijen further teaches:
The method of claim 21, wherein for each time step t, the temporal difference target yt satisfies:
PNG
media_image8.png
82
455
media_image8.png
Greyscale
wherein T is the total number of time steps in the sequence, ytrue is the target output, and is the candidate network output at time step t+1. (Seijen, page 3, col. 2, section 2.4, “The conventional forward view relates TD(λ) (with accumulating traces) to the λ-return algorithm. This algorithm performs at each time step a standard update (1) with the λ-return as update target. The λ-return Gλ t is an estimate of the expected return based on a combinations of rewards and other estimates:”
PNG
media_image9.png
55
288
media_image9.png
Greyscale
The formula in Seijen matches the claim’s formula because both compute a λ-weighted target by summing exponentially decayed future predictions and a terminal ground-truth output. The candidate outputs y_{t+i} in the claim correspond to the predicted returns G^{(n)}{t,θ} in Seijen, while the true target y_true aligns with G^{(T−t)}{t,θ}. Since the mathematical structure and weighting scheme are the same, Seijen’s λ-return teaches the claimed temporal difference target formulation.)
The rationale for the combination is similar to that of claim 21 above.
Claim 23: Michielli in view of Kim, and Seijen teaches the limitations of claim 22. Seijen further teaches:
The method of claim 22, wherein λ is greater than or equal to zero but less than one. (Seijen, page 9, figure 3, “Figure 3. RMS error of state values at the end of each episode, averaged over the first 10 episodes, as well as 100 independent runs, for different values of λ at the best value of α.”,
PNG
media_image10.png
238
326
media_image10.png
Greyscale
Figure 3 of Seijen shows a plot of the λ parameter and how its varying configurations between 0 and 1 affect the model’s RMS error during training.)
Claim 24: Michielli in view of Kim, and Seijen teaches the limitations of claim 22. Seijen further teaches:
The method of claim 22, wherein λ is less than .5. (Seijen, page 9, figure 3, “Figure 3. RMS error of state values at the end of each episode, averaged over the first 10 episodes, as well as 100 independent runs, for different values of λ at the best value of α.”, Figure 3 of Seijen shows model performance for λ values below 0.5. Since these values are explicitly tested using the same λ-return (Equation 7) the use of λ < 0.5 is taught.)
Claims 12 and 25 are rejected under 35 U.S.C. 103 as being unpatentable over Michielli in view of Kim as applied to claims 1-6, 10-11, 13-15, and 26-27 above and Bouganis et al. (Kyrkou, C., Theocharides, T., & Bouganis, C. S. (2013, July). An embedded hardware-efficient architecture for real-time cascade support vector machine classification. In 2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS) (pp. 129-136). IEEE.), hereafter referred to as Bouganis.
Claim 12: Michielli in view of Kim teaches the limitations of claim 11. Bouganis, in the same field of cascaded neural network implementation, teahes the following limitations which the above prior art fails to teach:
The method of claim 11, wherein each time step input is an image of a scene taken at the corresponding time point or a video frame at the corresponding time point in a video. (Bouganis, page 133, col. 2, paragraph 2, “, the image region is processed in a window-by-window fashion. Once, a window has been processed a part of it is shifted out of the array, while new pixels are shifted in; thus a new window is formed at the leftmost region of the scanline buffer and is ready to be processed next.”, the system processes images sequentially over a window, and is interpreted as analogous to images taken a corresponding time point in a video.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have incorporated the teachings of Michielli and Kim with that of Bouganis and include a temporal difference loss into the calculation of block outputs. A motivation of which would be the effective implementation of a learning signal that reflects not just immediate error but also long-term prediction accuracy. (Bouganis, page 135, section C, “Overall, both the adapted cascade and initial cascade implementations achieve a performance of 70 fps, resulting in a 5× speedup over the single parallel SVM classifier implementation which achieved 14 fps, despite processing six windows in parallel.”, the processing speed of the hardware architecture used in Bouganis has proved to show noticeable performance increase compared to other classifiers.)
Claim 25: Michielli in view of Kim teaches the limitations of claim 1. Bouganis, in the same field of cascaded neural network implementation, teahes the following limitations which the above prior art fails to teach:
The method of claim 1, wherein each block is deployed on respective dedicated hardware for the block. (Bouganis, page 133, col. 2, paragraph 2, “The data flow of the left-most registers changes depending on whether the data are fed to the parallel or the sequential processing module. In the case of the parallel module the window data are outputted in parallel. On the other case, the registers form a chain of shift registers so that data are outputted sequentially for the sequential processing module, from the leftmost top row register.”, the cascaded network in Bouganis has modules (parallel and sequential) that each handle specific SVM stages in the cascade and is interpreted by the examiner that each module can be viewed as a ‘block’ in the cascade. The parallel block processes all input elements at once, which required fixed hardware for each respective component, while the sequential block processes data step-wise using shift registers, showing that it too has its own dedicated hardware for each computation.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have incorporated the teachings of Michielli and Kim with that of Bouganis. The rationale for the combination is similar to that of claim 12 above.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Zilly, J. G., Srivastava, R. K., Koutnık, J., & Schmidhuber, J. (2017, July). Recurrent highway networks. In International conference on machine learning (pp. 4189-4198). PMLR.
Huang, L., Xu, J., Sun, J., & Yang, Y. (2017, July). An improved residual LSTM architecture for acoustic modeling. In 2017 2nd International Conference on Computer and Communication Systems (ICCCS) (pp. 101-105). IEEE.
Pundak, G., & Sainath, T. N. (2017, August). Highway-LSTM and Recurrent Highway Networks for Speech Recognition. In Interspeech (pp. 1303-1307).
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HYUNGJUN B YI whose telephone number is (703)756-4799. The examiner can normally be reached M-F 9-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Jung can be reached at (571) 270-3779. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/H.B.Y./Examiner, Art Unit 2146 /ANDREW J JUNG/Supervisory Patent Examiner, Art Unit 2146