Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
The Office Action is in response to amendment filed on November 24, 2025
Claims 1, 7, 9, 13, 16, 17, 19 and 20 have been amended.
Claim 8 has been cancelled.
The rejections from the prior corresponds that are not restated here are withdrawn.
Response to Arguments
Applicant’s arguments with respect to claim(s) 1 – 7 and 9 – 20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
The Examiner has found the arguments with respect to 101 persuasive and the rejection has been withdrawn.
Claim Rejections - 35 USC § 101
The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
Claims 1 - 7 and 9 - 12 are rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter. The claim(s) does/do not fall within at least one of the four categories of patent eligible subject matter because a computer program per se (often referred to as "software per se") when claimed as a product without any structural recitations. It is recommended to positively recite the processor and the memory in the claim limitations.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1 – 7 and 10 – 19 are rejected under 35 U.S.C. 103 as being unpatentable over Fan (NPL Multi-Horizon Time Series Forecasting with Temporal Attention Learning) in view of Song (US 20200134804 A1) further in view of Huang (US 20200293653 A1)
Regarding claim 1, Fan teaches a sequence-to-sequence layer configured to detect patterns within a predetermined distance relative to a given time period; (See e.g. [P2529:S3.1:C2], We adopt this sequence-to-sequence [a sequence-to-sequence layer] learning pipeline to encode historical (and future) input variables and decode to future predictions, but with major modifications.) (See e.g. [P2533:S5.3:C1], we utilize only one period [a given time period] of historical data, while we add temporal attention mechanism to help find correct hidden pattern [detect patterns] for future steps.) (See e.g. [P2532:S4.2:C1], To evaluate the
full quantile forecasts over all evaluation weeks, losses of all target quantiles (99) for all time periods (12 weeks) [a predetermined distance relative to a given time period] over all forecast horizons are calculated and then averaged)
[a temporal self-attention layer configured] to detect patterns of characteristics observed across an entire analyzed time-window (See e.g. [P2533:S5.3:C1], we utilize only one period of historical data, while we add temporal attention mechanism to help find correct hidden pattern [detect patterns] for future steps.)
the sequence-to-sequence layer [is implemented by one or more computers,] and includes: one or more recurrent neural network (RNN) encoders for generating encoder vectors based on static covariates and time-varying input data captured during respective past time- periods (See e.g. [P2529:S3.1:C2], We adopt this sequence-to-sequence [a sequence-to-sequence layer] learning pipeline to encode historical (and future) input variables and decode to future predictions, but with major modifications.) (See e.g. [P2527:S1:C2], Recently, Recurrent Neural Networks (RNNs), and the frequently used variant Long Short-Term Memory Networks (LSTMs), have been proposed for modeling complicated sequential data, such as natural language [27], audio waves [25], and video frames [7]. LSTMs have also been applied to
forecasting tasks in recent studies [4, 9], and have shown the ability to capture complex temporal patterns in dynamic time series.) (See e.g. [P2529:S2:C1], For the above methods, separate
models have to be trained for different quantiles for multi-quantile forecasting tasks, which is inefficient in practice. Wen et al. [29] proposed a technique called MQ-RNN to generate multiple quantile forecasts for multiple horizons. MQ-RNN uses an LSTM to encode the history of time [static covariates] (i.e. dates are unique values) series into one hidden vector)
one or more RNN decoders, each RNN decoder configured to predict a pattern within the predetermined distance of a corresponding future time period of a respective time period based on the encoder vectors, the static covariates, and time-varying known future input data; (See e.g. [P2529:S3.1:C2], We adopt this sequence-to-sequence learning pipeline to encode historical (and future) input variables and decode to future predictions, but with major modifications.) (See e.g. [P2527:S1:C2], Recently, Recurrent Neural Networks (RNNs), and the frequently used variant Long Short-Term Memory Networks (LSTMs), have been proposed for modeling complicated sequential data, such as natural language [27], audio waves [25], and video frames [7]. LSTMs have also been applied to forecasting tasks in recent studies [4, 9], and have shown the ability to capture complex temporal patterns [predict a pattern] in dynamic time series.) (See e.g. [P2529:S2:C1], For the above methods, separate models have to be trained for different quantiles for multi-quantile forecasting tasks, which is inefficient in practice. Wen et al. [29] proposed a technique called MQ-RNN to generate multiple quantile forecasts for multiple horizons. MQ-RNN uses an LSTM to encode the history of time [static covariates] (i.e. dates are unique values) series into one hidden vector) (See e.g. [P2533:S5.3:C1], For GEF2014 data, the unit of one period of time is a week [the predetermined distance] of hourly price data (24*7);) (See e.g. [P2533:S5:C1], MQ-RNN [29] uses LSTM encoder to summarize history of the sequence to one hidden feature, and uses MLPs to make forecasts for all future horizons from the hidden feature together with future input variables. [time-varying known future input data])
[wherein each GRN is configured to] adjust a weight of static covariates that influence temporal dynamics corresponding to the respective time period; (See e.g. [P2531:S3.3:C1], Our designed multimodal attention mechanism, as shown in the middle part of Fig 3, is guided by BiLSTM hidden states generated at each future time step t , and multiple attentions are applied to different parts of historical data [static covariates] and fused with attentional weights. [adjust a weight]) (See e.g. [P2529:S2:C1], MQ-RNN uses an LSTM to encode the history of time series into one hidden vector, and uses two MLPs to summarize this hidden vector, together with all future inputs, [temporal dynamics corresponding to the respective time period] into one global context feature and horizon-specific context features for all horizons.)
[a multi-head attention layer configured to receive outputs of the plurality of GRNs as input,] and generate a forecast for each horizon based on the static covariates, the time- varying input data captured during the respective past time-periods, and the time-varying known future input data. (See e.g. [P2529:S2:C1], For the above methods, separate models have to be trained for different quantiles for multi-quantile forecasting tasks, which is inefficient in practice. Wen et al. [29] proposed a technique called MQ-RNN to generate multiple quantile forecasts for multiple horizons. MQ-RNN uses an LSTM to encode the history of time [static covariates] (i.e. dates are unique values) series into one hidden vector) (See e.g. [P2533:S5:C1], MQ-RNN [29] uses LSTM encoder to summarize history of the sequence to one hidden feature [the time- varying input data captured during the respective past time-periods], and uses MLPs to make forecasts for all future horizons from the hidden feature together with future input variables. [time-varying known future input data])
Fan does not teach the temporal self-attention layer is implemented by the one or more computers and includes: , a multi-head attention layer
Song teaches the temporal self-attention layer is implemented by the one or more computers and includes: (See e.g. [0019], A computer-usable or computer readable medium may include any apparatus that stores, communicates, propagates, or transports the program for use by or in connection with the instruction execution system, apparatus, or device.) (See e.g. [0058], Temporal self-attention 420 explicitly encodes the temporal information in the video sequence, extending self-attention mechanism in the transformer model modelling the temporal information of the spatial feature maps generated by the fully convolutional encoder 410 at each level.)
a multi-head attention layer (See e.g. [0063], the temporal self-attention module 420 can implement a multi-head self-attention mechanism…Temporal self-attention module 420 then compares the similarity Dcos 555 between query q(tk) (545) and memory m(tk) (540) feature vectors and generates the attention weights by normalizing across time steps using a softmax function)
Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Fan and Song
before them, to include Song’s multi-head layer which would allow Fan’s static covariates to
maintain relationships of data. One would have been motivated to make such a combination in
order to implement, unsupervised, semi-supervised and fully supervised learning of a neural
network, the method can also generate new data with the same statistics as a training set, as
suggested by Song (US 20200134804 A1) (0003)
Fan and Song do not teach a static enrichment layer including a plurality of gated residual networks (GRNs), wherein each GRN is associated with a respective time period and receives an output from a corresponding RNN encoder or decoder associated with the respective time period
Huang teaches a static enrichment layer including a plurality of gated residual networks (GRNs), wherein each GRN is associated with a respective time period and receives an output from a corresponding RNN encoder or decoder associated with the respective time period, and (See e.g. [0025], The mechanisms of the illustrative embodiments comprise a recurrent neural network (RNN) having a plurality of long short term memory (LSTM) or gated recurrent unit (GRU) cells [plurality of gated residual networks (GRNs)] which are trained, through a machine learning training process, to predict sequences of system calls and the probability of each next system call given a previous system call. In one embodiment, a variant of LSTM or GRU cells incorporating timestamp [respective time period] of system calls can be used.) (See e.g. [0036], The RNN receives the vector representation of the system call as an input [receives an output from a corresponding RNN encoder or decoder associated with the respective time period] and processes the input vector via one or more layers of long short term memory (LSTM) memory cells to generate an output indicating each system call feature and a corresponding probability value, and each argument and a corresponding probability value.)
Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Fan, Song and Huang before them, to include Huang’s gated recurrent unit (GRU) network which would allow Fan and Song’s multi-headed relationship of data. One would have been motivated to make such a combination in order to perform data processing and anomaly detection, as suggested by Huang (US 20200293653 A1) (0002)
Regarding claim 2, Fan, Song and Huang. Fan teaches the system of claim 1, further comprising a variable selection layer, the variable selection layer configured to generate variable selection weights to each input variable including one or more of the static covariates, the time-varying input data captured during respective past time- periods, or the time-varying known future input data. (See e.g. [P2529:S2:C1], The general idea of attention mechanism is to assign soft weights [variable selection weights] to a series of features [input variable] to measure each feature’s relevance to a given query feature, and combine them by weighted sum to obtain attended feature…their decoder is a feed-forward LSTM which ignores dynamic future information such as promotion and calendar events which could have strong influence in real-world time [one or more of the static covariates] (i.e. dates) series forecasts. In addition, at each future step the hidden state depends on previous attended historical pattern [the time-varying input data captured during respective past time-periods] representation, which could lead to error accumulation.)
Regarding claim 3, Fan, Song and Huang. Fan teaches the system of claim 2, wherein the variable selection layer includes a plurality of variable selectors, with each variable selector of the plurality of variable selectors being configured to generate variable selection weights for the input variables at a respective future or past time period. (See e.g. [P2529:S2:C1], The general idea of attention mechanism is to assign soft weights [variable selection weights] to a series of features [input variable] to measure each feature’s relevance to a given query feature, and combine them by weighted sum to obtain attended feature…their decoder is a feed-forward LSTM which ignores dynamic future information such as promotion and calendar events which could have strong influence in real-world time series forecasts. In addition, at each future step the hidden state depends on previous attended historical pattern [past time period] representation, which could lead to error accumulation.)
Regarding claim 4, Fan, Song and Huang. Fan teaches the system of claim 3, wherein the variable selection layer includes a variable selector configured to generate variable selection weights for a selection of the static covariates. (See e.g. [P2529:S2:C1], The general idea of attention mechanism is to assign soft weights [generate variable selection] to a series of features to measure each feature’s relevance to a given query feature, and combine them by weighted sum to obtain attended feature. Hori et al. [17] proposed to handle multimodal data by fusing features of different modalities such as texts, audios and videos together with softly assigned weights of each modality. Cinar et al. [6] proposed using an LSTM encoder-decoder with position-based attention model to capture patterns of pseudo-periods in sequence data. They applied the attention mechanism to explore similar local patterns in historical data [the static covariates] for future prediction.)
Regarding claim 5, Fan, Song and Huang. Fan teaches the system of claim 4, further comprising one or more static covariate encoders configured to encode context vectors based on the variable selection weights for the selection of the static covariates. (See e.g. [P2529:S2:C1], MQ-RNN uses an LSTM to encode the history of time [static covariate encoders] series into one hidden vector, and uses two MLPs to summarize this hidden vector, together with all future inputs, into one global context feature and horizon-specific context features for all horizons… The general idea of attention mechanism is to assign soft weights [the variable selection weights] to a series of features to measure each feature’s relevance to a given query feature, and combine them by weighted sum to obtain attended feature. Hori et al. [17] proposed to handle multimodal data by fusing features of different modalities such as texts, audios and videos together with softly assigned weights of each modality. Cinar et al. [6] proposed using an LSTM encoder-decoder with position-based attention model to capture patterns of pseudo-periods in sequence data.)
Regarding claim 6, Fan, Song and Huang. Fan teaches the system of claim 5, wherein the encoded context vectors are passed to each of the plurality of variable selectors in the variable selection layer. (See e.g. [P2530:Figure 2]: Structure of LSTM-based Encoder-Decoder. In the encoding stage, historical inputs [encoded context vectors] x are passed through embedding layers and combined with ground truth quantities y. The output at each step is a prediction of the next step to regularize encoder training.) (See e.g. [P2529:S2:C1], MQ-RNN uses an LSTM to encode the history of time series into one hidden vector [encoded context vectors], and uses two MLPs to summarize this hidden vector, together with all future inputs, into one global context feature and horizon-specific context features for all horizons)
Regarding claim 7, Fan, Song and Huang. Fan teaches the system of claim 6, wherein the encoded [context vectors are passed the plurality of (GRNs), and wherein each GRN] is configured to increase the weight of static covariates within the encoded context vectors that influence temporal dynamics corresponding to its respective time period. (See e.g. [P2531:S3.3:C1], Our designed multimodal attention mechanism, as shown in the middle part of Fig 3, is guided by BiLSTM hidden states generated at each future time step t , and multiple attentions are applied to different parts of historical data [static covariates] and fused with attentional weights. [adjust a weight]) (See e.g. [P2529:S2:C1], MQ-RNN uses an LSTM to encode the history of time series into one hidden vector, and uses two MLPs to summarize this hidden vector, together with all future inputs, [temporal dynamics corresponding to the respective time period] into one global context feature and horizon-specific context features for all horizons.)
Fan and Song do not teach context vectors are passed the plurality of (GRNs) Huang teaches context vectors are passed the plurality of (GRNs) (See e.g. [0025], The mechanisms of the illustrative embodiments comprise a recurrent neural network (RNN) having a plurality of long short term memory (LSTM) or gated recurrent unit (GRU) cells [plurality of (GRNs)] which are trained, through a machine learning training process, to predict sequences of system calls and the probability of each next system call given a previous system call. In one embodiment, a variant of LSTM or GRU cells incorporating timestamp of system calls can be used.)
Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Fan, Song and Huang before them, to include Huang’s gated recurrent unit (GRU) network which would allow Fan and Song’s multi-headed relationship of data. One would have been motivated to make such a combination in order to perform data processing and anomaly detection, as suggested by Huang (US 20200293653 A1) (0002)
Regarding claim 10, Fan, Song and Huang. Fan teaches the system of claim 1, wherein the forecast for each future time period is transformed into a quantile forecast. (See e.g. [P2534:S5.5:C1], In Figure 4,we show multi-quantile forecasts provided by Multimodal-Attention(h=3) on two evaluation weeks of distinctive patterns: the upper series have two modalities within 24 hours while the lower series have one. By observing the quantile predictions (yellow shaded areas) of 0.25 and 0.75, we show that our model is able to capture
these distinct temporal patterns on future horizons [future time period] by attending to the history.)
Regarding claim 13, Fan teaches, determining, [by one or more computers executing a] sequence-to-sequence layer that detects patterns within a predetermined distance relative to a given time period and [a temporal self-attention layer] that detects patterns of characteristics observed across an entire analyzed time-window, temporal characteristics for respective forecasting horizons of one or more time- steps, the determining including: (See e.g. [P2527:S1:C2], By using an LSTM encoder-decoder model to map the history of a time
series to its future, multi-step forecasting can be naturally formulated as sequence-to-sequence learning.) (See e.g. [P2528: Figure 1], An example of sales forecasts on real-world online
sales data with quantile predictions. The average daily sales of 50,000 products of each month are shown. Given historical daily sales from January to March 2018 as shown with magenta line, the task is to forecast from April to June. The black line shows the true sales; the dark blue line shows forecast of quantile 0.5; the blue shaded areas show quantile forecasts of 0.2 and 0.8. Best viewed in color. [forecasting horizons of one or more time- steps])
generating, using one or more recurrent neural network (RNN) encoders in the sequence-to-sequence layer, encoder vectors based on static covariates and time-varying input data captured during respective past time-periods, wherein static covariates are data that is constant across the time represented in the time-varying input data and in respective future time-periods, and (See e.g. [P2527:S1:C2], Recently, Recurrent Neural Networks (RNNs), and the frequently used variant Long Short-Term Memory Networks (LSTMs)… By
using an LSTM encoder-decoder model to map the history of a time series to its future, multi-step forecasting can be naturally formulated as sequence-to-sequence learning.) (See e.g. [P2529:S2:C1], MQ-RNN uses an LSTM to encode the history of time series into one hidden vector, and uses two MLPs to summarize this hidden vector, together with all future inputs, into one global context feature and horizon-specific context features for all horizons. However, this global context feature is too general to capture short-term patterns… Then at each future time [in respective future time-periods] step we attend to several different periods of the history [based on static covariates] and generate attention vector individually)
predicting, using one or more RNN decoders in the sequence-to-sequence layer, a pattern within the predetermined distance of a corresponding future time period of a respective time period based on the encoder vectors, the static covariates, and time- varying known future input data; (See e.g. [P2529:S2:C1], MQ-RNN uses an LSTM to encode the history of time series into one hidden vector) (See e.g. [P2528:S1:C1], Given the daily sales of 50,000 products from January to March, the target is to predict sales of the future [time- varying known future input data], e.g., from April to June. Shaded blue areas demonstrate the quantile predictions between 0.2 and 0.8, which we observe that they are robust to sudden changes.) (See e.g. [P2529:S3:C2], We adopt this sequence-to-sequence learning pipeline to encode historical (and future) input variables and decode to future predictions) (See e.g. [P2530: Figure 2], Structure of LSTM-based Encoder-Decoder. In the encoding stage, historical inputs x [the static covariates] are passed through embedding layers and combined with ground truth quantities y. The output at each step is a prediction of the next step to regularize encoder training. In the decoding stage, future inputs are passed through the same embedding layers shared with encoder, and future quantities are predicted on top of temporal context features and evaluated with quantile losses.)
[adjusting, using the GRN], a weight of static covariates that influence temporal dynamics corresponding to the respective time period; (See e.g. [P2531:S3.3:C1], Our designed multimodal attention mechanism, as shown in the middle part of Fig 3, is guided by BiLSTM hidden states generated at each future time step t , and multiple attentions are applied to different parts of historical data [static covariates] and fused with attentional weights. [adjust a weight]) (See e.g. [P2529:S2:C1], MQ-RNN uses an LSTM to encode the history of time series into one hidden vector, and uses two MLPs to summarize this hidden vector, together with all future inputs, [temporal dynamics corresponding to the respective time period] into one global context feature and horizon-specific context features for all horizons.)
generating, [using the multi-head attention layer of the temporal self-attention layer,] a forecast for each of the respective forecasting horizons based on the static covariates, the time-varying input data captured during the respective past time-periods, and the time-varying known future input data. (See e.g. [P2529:S2:C1], For the above methods, separate models have to be trained for different quantiles for multi-quantile forecasting tasks, which is inefficient in practice. Wen et al. [29] proposed a technique called MQ-RNN to generate multiple quantile forecasts for multiple horizons. MQ-RNN uses an LSTM to encode the history of time [static covariates] (i.e. dates are unique values) series into one hidden vector) (See e.g. [P2533:S5:C1], MQ-RNN [29] uses LSTM encoder to summarize history of the sequence to one hidden feature [the time- varying input data captured during the respective past time-periods], and uses MLPs to make forecasts for all future horizons from the hidden feature together with future input variables. [time-varying known future input data])
Fan does not teach receiving, by the one or more computers implementing -the temporal self-attention layer, outputs of the plurality of GRNs as input to a multi-head attention layer;
Song teaches receiving, by the one or more computers implementing -the temporal self-attention layer, [outputs of the plurality of GRNs as] input to a multi-head attention layer; (See e.g. [0019], A computer-usable or computer readable medium may include any apparatus that stores, communicates, propagates, or transports the program for use by or in connection with the instruction execution system, apparatus, or device.) (See e.g. [0058], Temporal self-attention 420 explicitly encodes the temporal information in the video sequence, extending self-attention mechanism in the transformer model modelling the temporal information of the spatial feature maps generated by the fully convolutional encoder 410 at each level.) (See e.g. [0063], the temporal self-attention module 420 can implement a multi-head self-attention mechanism…Temporal self-attention module 420 then compares the similarity Dcos 555 between query q(tk) (545) and memory m(tk) (540) feature vectors and generates the attention weights by normalizing across time steps using a softmax function)
Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Fan and Song
before them, to include Song’s multi-head layer which would allow Fan’s static covariates to
maintain relationships of data. One would have been motivated to make such a combination in
order to implement, unsupervised, semi-supervised and fully supervised learning of a neural
network, the method can also generate new data with the same statistics as a training set, as
suggested by Song (US 20200134804 A1) (0003)
Fan and Song do not teach receiving, using a gated residual network (GRN) of a static enrichment layer including a plurality of GRNs in which each GRN is associated with a respective time period, an output from a corresponding RNN encoder or decoder associated with the respective time period;
Huang teaches receiving, using a gated residual network (GRN) of a static enrichment layer including a plurality of GRNs in which each GRN is associated with a respective time period, an output from a corresponding RNN encoder or decoder associated with the respective time period; (See e.g. [0025], The mechanisms of the illustrative embodiments comprise a recurrent neural network (RNN) having a plurality of long short term memory (LSTM) or gated recurrent unit (GRU) cells [plurality of gated residual networks (GRNs)] which are trained, through a machine learning training process, to predict sequences of system calls and the probability of each next system call given a previous system call. In one embodiment, a variant of LSTM or GRU cells incorporating timestamp [respective time period] of system calls can be used.) (See e.g. [0036], The RNN receives the vector representation of the system call as an input [receives an output from a corresponding RNN encoder or decoder associated with the respective time period] and processes the input vector via one or more layers of long short term memory (LSTM) memory cells to generate an output indicating each system call feature and a corresponding probability value, and each argument and a corresponding probability value.)
Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Fan, Song and Huang before them, to include Huang’s gated recurrent unit (GRU) network which would allow Fan and Song’s multi-headed relationship of data. One would have been motivated to make such a combination in order to perform data processing and anomaly detection, as suggested by Huang (US 20200293653 A1) (0002)
Regarding claim 14, Fan, Song and Huang. Fan teaches the method of claim 13, further comprising: generating, by a variable selection layer, variable selection weights to each input variable including one or more of the static covariates, the time-varying input data captured during respective past time-periods, or the time-varying known future input data. (See e.g. [P2529:S2:C1], The general idea of attention mechanism is to assign soft weights [variable selection weights] to a series of features [input variable] to measure each feature’s relevance to a given query feature, and combine them by weighted sum to obtain attended feature…their decoder is a feed-forward LSTM which ignores dynamic future information such as promotion and calendar events which could have strong influence in real-world time [one or more of the static covariates] (i.e. dates) series forecasts. In addition, at each future step the hidden state depends on previous attended historical pattern [the time-varying input data captured during respective past time-periods] representation, which could lead to error accumulation.)
Regarding claim 15, Fan, Song and Huang teach the method of claim 14, Fan teaches wherein the variable selection layer includes a plurality of variable selectors, with each variable selector of the plurality of variable selectors being configured to generate variable selection weights for the input variables at a respective future or past time period. (See e.g. [P2529:S2:C1], The general idea of attention mechanism is to assign soft weights [variable selection weights] to a series of features [input variable] to measure each feature’s relevance to a given query feature, and combine them by weighted sum to obtain attended feature…their decoder is a feed-forward LSTM which ignores dynamic future information such as promotion and calendar events which could have strong influence in real-world time series forecasts. In addition, at each future step the hidden state depends on previous attended historical pattern [past time period] representation, which could lead to error accumulation.)
Regarding claim 16, Fan, Song and Huang teaches the method of claim 14, Fan teaches wherein the variable selection layer includes a variable selector configured to generate variable selection weights for a selection of the static covariates. (See e.g. [P2529:S2:C1], The general idea of attention mechanism is to assign soft weights [generate variable selection] to a series of features to measure each feature’s relevance to a given query feature, and combine them by weighted sum to obtain attended feature. Hori et al. [17] proposed to handle multimodal data by fusing features of different modalities such as texts, audios and videos together with softly assigned weights of each modality. Cinar et al. [6] proposed using an LSTM encoder-decoder with position-based attention model to capture patterns of pseudo-periods in sequence data. They applied the attention mechanism to explore similar local patterns in historical data [the static covariates] for future prediction.)
Regarding claim 17, Fan, Song and Huang teaches the method of claim 16, Fan teaches encoding, by one or more static covariate encoders, context vectors based on the variable selection weights for the selection of the static covariates. (See e.g. [P2529:S2:C1], MQ-RNN uses an LSTM to encode the history of time [static covariate encoders] series into one hidden vector, and uses two MLPs to summarize this hidden vector, together with all future inputs, into one global context feature and horizon-specific context features for all horizons… The general idea of attention mechanism is to assign soft weights [the variable selection weights] to a series of features to measure each feature’s relevance to a given query feature, and combine them by weighted sum to obtain attended feature. Hori et al. [17] proposed to handle multimodal data by fusing features of different modalities such as texts, audios and videos together with softly assigned weights of each modality. Cinar et al. [6] proposed using an LSTM encoder-decoder with position-based attention model to capture patterns of pseudo-periods in sequence data.)
Regarding claim 18, Fan, Song and Huang teaches the method of claim 17, Fan teaches comprising passing the encoded context vectors to each of the plurality of variable selectors in the variable selection layer. (See e.g. [P2530:Figure 2]: Structure of LSTM-based Encoder-Decoder. In the encoding stage, historical inputs [encoded context vectors] x are passed through embedding layers and combined with ground truth quantities y. The output at each step is a prediction of the next step to regularize encoder training.) (See e.g. [P2529:S2:C1], MQ-RNN uses an LSTM to encode the history of time series into one hidden vector [encoded context vectors], and uses two MLPs to summarize this hidden vector, together with all future inputs, into one global context feature and horizon-specific context features for all horizons)
Regarding claim 19, Fan, Song and Huang teaches the method of claim 18, Fan teaches comprising passing the encoded [context vectors to the plurality of (GRNs)], and increasing, [by each GRN], the weight of static covariates within the encoded context vectors that influence temporal dynamics corresponding to its respective time period. (See e.g. [P2531:S3.3:C1], Our designed multimodal attention mechanism, as shown in the middle part of Fig 3, is guided by BiLSTM hidden states generated at each future time step t , and multiple attentions are applied to different parts of historical data [static covariates] and fused with attentional weights. [adjust a weight]) (See e.g. [P2529:S2:C1], MQ-RNN uses an LSTM to encode the history of time series into one hidden vector, and uses two MLPs to summarize this hidden vector, together with all future inputs, [temporal dynamics corresponding to the respective time period] into one global context feature and horizon-specific context features for all horizons.)
Fan and Song do not teach context vectors are passed the plurality of (GRNs) Huang teaches context vectors are passed the plurality of (GRNs) (See e.g. [0025], The mechanisms of the illustrative embodiments comprise a recurrent neural network (RNN) having a plurality of long short term memory (LSTM) or gated recurrent unit (GRU) cells [plurality of (GRNs)] which are trained, through a machine learning training process, to predict sequences of system calls and the probability of each next system call given a previous system call. In one embodiment, a variant of LSTM or GRU cells incorporating timestamp of system calls can be used.)
Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Fan, Song and Huang before them, to include Huang’s gated recurrent unit (GRU) network which would allow Fan and Song’s multi-headed relationship of data. One would have been motivated to make such a combination in order to perform data processing and anomaly detection, as suggested by Huang (US 20200293653 A1) (0002)
Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Fan (NPL Multi-Horizon Time Series Forecasting with Temporal Attention Learning) in view of Song (US 20200134804 A1) further in view of Huang (US 20200293653 A1) further in view of Sankar (US 20210326389 A1)
Regarding claim 9, Fan, Song and Huang teaches the system of claim 1.
Fan and Song do not teach wherein the output of each GRN
Huang teaches wherein the output of each GRN (See e.g. [0025], The mechanisms of the illustrative embodiments comprise a recurrent neural network (RNN) having a plurality of long short term memory (LSTM) or gated recurrent unit (GRU) cells [plurality of (GRNs)] which are trained) (See e.g. [0040], This is done for each output from the output layer [output of each GRN], which reside on top of each LSTM cell in the last layer of the RNN with each LSTM cell)
Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Fan, Song and Huang before them, to include Huang’s gated recurrent unit (GRU) network which would allow Fan and Song’s multi-headed relationship of data. One would have been motivated to make such a combination in order to perform data processing and anomaly detection, as suggested by Huang (US 20200293653 A1) (0002)
Fan, Song and Huang do not teach input into a respective mask of the multi-head attention layer, wherein each mask corresponds to a respective time period for causal prediction.
Sankar teaches input into a respective mask of the multi-head attention layer, wherein each mask corresponds to a respective time period for causal prediction. (See e.g. [0126], Temporal multi-head self-attention) (See e.g. [0120], In equation (2), above, √{square root over (F′)} can be a scaling factor (e.g., at step 510) [input into a respective mask] which can scale the linear projected matrices by the dimensionality F′. Further, β.sub.v ∈ custom-character.sup.T×T can be an attention weight matrix obtained by the multiplicative attention function and M ∈ custom-character.sup.T×T can be a mask matrix (e.g., at step 512) [mask of the multi-head attention layer] with each entry M.sub.ij ∈ {−∞, 0}. When M.sub.ij=−∞, the softmax function (e.g., at step 514) can result in a zero attention weight, i.e., β.sub.v.sup.ij=0, which can switch off the attention from time-step i to j. To encode the temporal order) (See e.g. [0102], The output of the temporal self-attention process can include final node representations which may be utilized in training a machine learning model and performing graph context prediction. [causal prediction])
Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Fan, Song, Huang
and Sankar before them, to include Sankar’s respective masking which would allow Fan, Song
and Huang’s selectively limit or ignore input data to the model. One would have been motivated to make such a combination in order to improve scalability to large graphs, more recent work on
graph embeddings has established the effectiveness of random walk based methods, as suggested
by Sankar (US 20210326389 A1) (0078)
Claim 20 is rejected under 35 U.S.C. 103 as being unpatentable over Fan (NPL Multi-Horizon Time Series Forecasting with Temporal Attention Learning) in view of Song (US 20200134804 A1) further in view of Huang (US 20200293653 A1) further in view of Adib (US 20190108440 A1)
Regarding claim 20, Fan teaches determining, [by one or more first computers of the one or more computers executing] a sequence-to-sequence layer that detects patterns within a predetermined distance relative to a given time period and [a temporal self-attention layer having a multi-head attention layer] that detects patterns of characteristics observed across an entire analyzed time-window, temporal characteristics for respective future time-periods, the determining including: (See e.g. [P2529:S3.1:C2], We adopt this sequence-to-sequence [a sequence-to-sequence layer] learning pipeline to encode historical (and future) input variables and decode to future predictions, but with major modifications.) (See e.g. [P2533:S5.3:C1], we utilize only one period [a given time period] of historical data, while we add temporal attention mechanism to help find correct hidden pattern [detect patterns] for future steps.) (See e.g. [P2532:S4.2:C1], To evaluate the full quantile forecasts over all evaluation weeks, losses of all target quantiles (99) for all time periods (12 weeks) [a predetermined distance relative to a given time period] over all forecast horizons are calculated and then averaged)
generating, using one or more recurrent neural network (RNN) encoders in the sequence-to- sequence layer, encoder vectors based on static covariates and time-varying input data captured during respective past time-periods, wherein static covariates are data that is constant across the time represented in the time-varying input data and in the respective future time-periods, and (See e.g. [P2527:S1:C2], Recently, Recurrent Neural Networks (RNNs), and the frequently used variant Long Short-Term Memory Networks (LSTMs)… By using an LSTM encoder-decoder model to map the history of a time series to its future, multi-step forecasting can be naturally formulated as sequence-to-sequence learning.) (See e.g. [P2529:S2:C1], MQ-RNN uses an LSTM to encode the history of time series into one hidden vector, and uses two MLPs to summarize this hidden vector, together with all future inputs, into one global context feature and horizon-specific context features for all horizons. However, this global context feature is too general to capture short-term patterns… Then at each future time [in respective future time-periods] step we attend to several different periods of the history [based on static covariates] and generate attention vector individually)
predicting, using one or more RNN decoders in the sequence-to-sequence layer, pattern within the predetermined distance of a corresponding future time period of a respective time period based on the encoder vectors, the static covariates, and time- varying known future input data; and (See e.g. [P2529:S2:C1], MQ-RNN uses an LSTM to encode the history of time series into one hidden vector) (See e.g. [P2528:S1:C1], Given the daily sales of 50,000 products from January to March, the target is to predict sales of the future [time- varying known future input data], e.g., from April to June. Shaded blue areas demonstrate the quantile predictions between 0.2 and 0.8, which we observe that they are robust to sudden changes.) (See e.g. [P2529:S3:C2], We adopt this sequence-to-sequence learning pipeline to encode historical (and future) input variables and decode to future predictions) (See e.g. [P2530: Figure 2], Structure of LSTM-based Encoder-Decoder. In the encoding stage, historical inputs x [the static covariates] are passed through embedding layers and combined with ground truth quantities y. The output at each step is a prediction of the next step to regularize encoder training. In the decoding stage, future inputs are passed through the same embedding layers shared with encoder, and future quantities are predicted on top of temporal context features and evaluated with quantile losses.)
receiving, [by one or more second computers of the one or more computers implementing the temporal self-attention layer, outputs of the plurality of GRNs as input to the multi-head attention layer], a forecast for respective forecasting horizons based on the static covariates, the time-varying input data captured during the respective past time-periods, and the time-varying known future input data. (See e.g. [P2533:S5:C1], MQ-RNN [29] uses LSTM encoder to summarize history of the sequence to one hidden feature [the time- varying input data captured during the respective past time-periods], and uses MLPs to make forecasts for all future horizons from the hidden feature together with future input variables. [time-varying known future input data])
Fan does not teach a temporal self-attention layer having a multi-head attention layer
Song does teach temporal self-attention layer having a multi-head attention layer (See e.g. [0058], Temporal self-attention 420 explicitly encodes the temporal information in the video sequence, extending self-attention mechanism in the transformer model modelling the temporal information of the spatial feature maps generated by the fully convolutional encoder 410 at each level.) (See e.g. [0063], the temporal self-attention module 420 can implement a multi-head self-attention mechanism…Temporal self-attention module 420 then compares the similarity Dcos 555 between query q(tk) (545) and memory m(tk) (540) feature vectors and generates the attention weights by normalizing across time steps using a softmax function)
Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Fan and Song
before them, to include Song’s multi-head layer which would allow Fan’s static covariates to
maintain relationships of data. One would have been motivated to make such a combination in
order to implement, unsupervised, semi-supervised and fully supervised learning of a neural
network, the method can also generate new data with the same statistics as a training set, as
suggested by Song (US 20200134804 A1) (0003)
Fan and Song do not teach receiving, using a gated residual network (GRN) of a static enrichment layer including a plurality of GRNs in which each GRN is associated with a respective time period, an output from a corresponding RNN encoder or decoder associated with the respective time period;
Huang teaches receiving, using a gated residual network (GRN) of a static enrichment layer including a plurality of GRNs in which each GRN is associated with a respective time period, an output from a corresponding RNN encoder or decoder associated with the respective time period; (See e.g. [0025], The mechanisms of the illustrative embodiments comprise a recurrent neural network (RNN) having a plurality of long short term memory (LSTM) or gated recurrent unit (GRU) cells [plurality of gated residual networks (GRNs)] which are trained, through a machine learning training process, to predict sequences of system calls and the probability of each next system call given a previous system call. In one embodiment, a variant of LSTM or GRU cells incorporating timestamp [respective time period] of system calls can be used.) (See e.g. [0036], The RNN receives the vector representation of the system call as an input [receives an output from a corresponding RNN encoder or decoder associated with the respective time period] and processes the input vector via one or more layers of long short term memory (LSTM) memory cells to generate an output indicating each system call feature and a corresponding probability value, and each argument and a corresponding probability value.)
Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Fan, Song and Huang before them, to include Huang’s gated recurrent unit (GRU) network which would allow Fan and Song’s multi-headed relationship of data. One would have been motivated to make such a combination in order to perform data processing and anomaly detection, as suggested by Huang (US 20200293653 A1) (0002)
Fan, Song and Huang do not teach one or more first computer and one or more second computers
Adib teaches one or more first computer and one or more second computers (See e.g. [0015], a first computer and a second computer)
Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teaching of Fan, Song, Huang and Adib before them, to include Adib’s multiple computers which would allow Fan, Song and Huang’s to more quickly and efficiently train the model. One would have been motivated to make such a combination in order to disclosed techniques enable systematic cross-leveraging of information inferred from one set of tags to infer more details of the other tags. These details are generally unavailable in conventional techniques. More details can correspond to higher accuracy, as suggested by Adib (US 20190108440 A1) (0005)
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KYLE ALLMAN THOMPSON whose telephone number is (571)272-3671. The examiner can normally be reached Monday - Thursday, 6 a.m. - 3 p.m. ET..
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached at (571) 272-7796. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/K.A.T./Examiner, Art Unit 2125
/KAMRAN AFSHAR/Supervisory Patent Examiner, Art Unit 2125