Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 05/12/2025 has been entered.
Status of Claims
The present application is being examined under the claims filed on 05/12/2025.
Claims 1-20 are pending.
Claims 1, 8 and 15 are amended.
Response to Arguments
In reference to rejections under 35 USC§ 103:
Applicant’s arguments filed 05/12/2025, with respect to the rejection(s) of claim(s) under 35 U.S.C 103 have been fully considered and are persuasive. Therefore, the rejection has been withdrawn. However, upon further consideration, a new ground(s) of rejection is made in view of Li et al. (“DIFFUSION CONVOLUTIONAL RECURRENT NEURAL NETWORK: DATA-DRIVEN TRAFFIC FORECASTING”).
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1, 5, 6, 8, 12, 13, 15, 19 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Ciarlini et al. (US 10,339,235 B1) (hereafter referred to as “Ciarlini”) in view of Flunkert et al. (US 10,936,947 B1) (hereafter referred to as “Flunkert”) and further in view of Li et al. (“DIFFUSION CONVOLUTIONAL RECURRENT NEURAL NETWORK: DATA-DRIVEN TRAFFIC FORECASTING”)
As per claim 1, Ciarlini discloses:
A method implemented on at least one machine including at least one processor, memory, (Ciarlini, Col. 8, Lines 4-6: “each compute node 410 comprises a processor coupled to a memory. The processor may comprise a microprocessor”) and communication platform capable of connecting to a network for machine learning for time series, the method comprising: (Ciarlini, Col. 9, Lines 26-31: “numerous other arrangements of computers, servers storage devices or other components are possible in the exemplary distributed computing environment 400. Such components can communicate with other elements of the exemplary metadata storage environment 100 over any type of network or other communication media.”)
performing hierarchical learning, which comprises deep learning global model parameters of a base model for forecasting time series measurements of a plurality of time series, (Ciarlini, Col. 4, Lines 47-57: “Consider a measurement matrix X where each column is a measured sensor stream or a lag of the measured sensor stream:
PNG
media_image1.png
60
435
media_image1.png
Greyscale
. In order to obtain a model for a target series Y, solve the following system of equations: Y = Xb + e, where b is the vector of the coefficients of the model, and e corresponds to the residuals.”) [Examiner’s note: The citation from Ciarlini above describes a linear regression-based model where a target time series (Y) is predicted based on a matrix of X of measured sensor streams and their time-lagged version. Here, X represents a base model structure that includes lagged versions of different sensor time series. The equation Y = Xb + e suggests a linear regression framework, where b represents coefficients and e represents residual errors. ]
Ciarlini fails to disclose:
querving, across the plurality of time series using the global model parameters of the base model, a forward historical temporal pattern graph and a backward historical temporal pattern graph relevant to a target time series in the plurality of time series,
wherein the target time series has insufficient data to train a target model for forecasting the target time series; and
obtaining target model parameters of the target model by customizing the base model based on the forward historical temporal pattern graph and the backward historical temporal pattern graph
However, Li explicitly discloses:
querving, across the plurality of time series using the global model parameters of the base model, a forward historical temporal pattern graph and a backward historical temporal pattern graph relevant to a target time series in the plurality of time series, (Li, Pg. 2, Section 2.1: “The goal of traffic forecasting is to predict the future traffic speed given previously observed traffic flow from N correlated sensors on the road network.”, Pg. 2, Section 2.2: “We also include the reversed direction diffusion process, such that the bidirectional diffusion offers the model more flexibility to capture the influence from both the upstream and the downstream traffic”, Pg. 3, Eq. (2) explicitly uses two transition matrices (forward and reverse): it include both
D
o
-
1
W
kand
D
1
-
1
W
T
k and Li states that these are the transition matrices of the diffusion process and the reverse one respectively, P.g. 4, Figure 2: “System architecture for the Diffusion Convolutional Recurrent Neural Network designed for spatiotemporal traffic forecasting. The historical time series are fed into an encoder whose final states are used to initialize the decoder. The decoder makes predictions based on either previous ground truth or the model output.”
PNG
media_image2.png
361
858
media_image2.png
Greyscale
Pg. 4, ¶[2]: “The model architecture of DCRNN is shown in Figure 2. The entire network is trained by maximizing the likelihood of generating the target future time series using backpropagation through time. DCRNN is able to capture spatiotemporal dependencies among time series and can be applied to various spatiotemporal forecasting problems.”) [Examiner’s note: Li discloses training the diffusion convolutional recurrent neural network end-to-end using historical observations, whereby the model parameters are adjusted based on the extracted temporal patterns to generate forecasts for a target time series]
obtaining target model parameters of the target model by customizing the base model based on the forward historical temporal pattern graph and the backward historical temporal pattern graph (Li, P.g. 4, Figure 2: “System architecture for the Diffusion Convolutional Recurrent Neural Network designed for spatiotemporal traffic forecasting. The historical time series are fed into an encoder whose final states are used to initialize the decoder. The decoder makes predictions based on either previous ground truth or the model output.”
PNG
media_image2.png
361
858
media_image2.png
Greyscale
Pg. 4, ¶[2]: “The model architecture of DCRNN is shown in Figure 2. The entire network is trained by maximizing the likelihood of generating the target future time series using backpropagation through time. DCRNN is able to capture spatiotemporal dependencies among time series and can be applied to various spatiotemporal forecasting problems.”) [Examiner’s note: Li discloses training the diffusion convolutional recurrent neural network end-to-end using historical observations, whereby the model parameters are adjusted based on the extracted temporal patterns to generate forecasts for a target time series]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ciarlini and Li. Ciarlini teaches methods for performing massively parallel processing (MPP) large-scale combinations of time series data. Li teaches modelling the traffic flow as a diffusion process on a directed graph and introduce Diffusion Convolutional Recurrent Neural Network (DCRNN), a deep learning framework for traffic forecasting that incorporates both spatial and temporal dependency in the traffic flow. One of ordinary skill would have motivation to combine Ciarlini and Li because MPEP 2143 sets forth the Supreme Court rationales for obviousness including: (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results; (E): “Obvious to try” choosing from a finite number of identified, predictable solutions, with a reasonable expectation of success; (F) Known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of the ordinary skill in the art.
However, Flunkert explicitly discloses:
wherein the target time series has insufficient data to train a target model for forecasting the target time series; and (Flunkert, Col. 3, Lines 46-57: “In at least one embodiment, if a sufficiently large training data set is used, the model may be general enough to be able to make predictions regarding demands for an item for which no (or very few) demand observations were available for training. For example, if information regarding the similarity of a new item Inew to some set of other items {Iold} along one or more dimensions such as item/product category, price, etc. is provided, where demand observations for {Iold} items were used to train the model while demand observations for Inew was not used to train the model, the model may still be able to provide useful predictions for Inew demand in such an embodiment.”) [Examiner’s note: Inew represents the target time series with insufficient data, while Iold represents other time series that have sufficient historical demand data. The model compensates for the missing data by leveraging correlations with similar items to predict the demand for the target item.]
obtaining target model parameters of the target model by customizing the base model based on the extracted temporal patterns relevant to the target time series (Flunkert, Col. 5, Lines 54-65: “If the evaluations indicate that the model does not meet a desired quality/accuracy criterion, the model may be adjusted in some embodiments---e.g., various hyperparameters, initial parameters and/or feature extraction techniques may be modified and the model may be retrained. In at least one embodiment, new versions of the
models may be generated over time as new demand observations are obtained. For example, in one implementation, new demand forecasts for K weeks into the future may be generated every week using demand data collected over a time window of the previous N weeks as input for the composite modeling methodology.”) [Examiner’s note: target model parameters i.e., new demand forecasts for K weeks, customizing the base model i.e., adjust or modify the demand data collected over a time window of previous N weeks]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ciarlini and Flunkert. Ciarlini teaches methods for performing massively parallel processing (MPP) large-scale combinations of time series data. Flunkert teaches a recurrent neural network model is trained using a plurality of time series of demand observations to generate demand forecasts for various items. One of ordinary skill would have motivation to combine Ciarlini and Flunkert because time series data can exhibit changes over time due to external factors. Extracting temporal patterns allows the model to adapt to these changes by recognizing and incorporating them into predictions. Customizing the model based on these extracted patterns helps in maintaining the relevance and accuracy of the model even as conditions evolve. (Flunkert, Col. 14, Lines 49-67)
As per claim 5, the combination of Ciarlini and Flunkert discloses all limitations of Claim 1 (as shown in the rejections above)
Ciarlini in view of Flunkert further discloses:
The method of claim 1, wherein the deep learning of the base model and the customizing the base model are performed simultaneously during the hierarchical learning (Ciarlini, Col. 6: “The model 230 in each group 220 can be computed substantially in parallel on distributed compute nodes and, since the input matrix is small, the processing time is relatively short…In the generated models, the same time series 210 can have multiple coefficients, one for each different time lag”, and Col. 7:
PNG
media_image3.png
185
639
media_image3.png
Greyscale
) [Examiner note: Ciarlini disclose that the models are computed in parallel using the OMP algorithm and then the coefficients of the base model will be updated at the same time]
As per claim 6, the combination of Ciarlini and Flunkert discloses all the limitations of Claim 1 (as shown in the rejections above)
Ciarlini in view of Flunkert further discloses:
The method of claim 1, wherein the deep learning of the base model and the customizing the base model are performed in sequence during the hierarchical learning. (Ciarlini, Col. 6-7: “In this case, there is a plurality of hierarchical learning levels to generate the final model and in each intermediate level of the hierarchy, intermediate compute nodes execute both the roles of master compute node for compute nodes of the lower hierarchical level and working compute nodes for the compute nodes of the upper hierarchical level. Intermediate compute nodes receive selected variables and scores from lower-level compute nodes and perform the following steps: rank the variables; select a pre-defined number of variables based on their scores to be considered as input for the generation of an intermediate linear model using an Orthogonal Matching Pursuit algorithm; assign a score to each variable of the intermediate model; and provide such variables and their corresponding scores to the upper level in the hierarchy.”) [Examiner note: the procedure is generalized by creating multiple learning stages and the involvement of intermediate compute nodes in the hierarchical structure reinforces the sequential nature of the process. These nodes receive information from lower-level compute nodes and perform specific steps of customization before providing variables and scores to the upper level.]
As per claim 8, Ciarlini discloses:
Machine readable and non-transitory medium having information recorded thereon for machine learning for time series, wherein the information, once read by a machine, causes the machine to (Ciarlini, Col. 9, Lines 61-63: “a computer readable storage medium, as used herein, is not to be construed as being transitory signals,”, and Col. 10, Lines 17-19: “the system includes distinct software modules, each being embodied on a tangible computer- readable recordable storage medium”, and Ciarlini, Abstract: “Methods and apparatus are provided for performing massively parallel processing (MPP) large-scale combinations of time series data.”)
perform hierarchical learning by (Ciarlini, Col. 6, Lines 53-54: “the procedure can be generalized by creating multiple hierarchical learning stages”)
performing hierarchical learning, which comprises deep learning global model parameters of a base model for forecasting time series measurements of a plurality of time series, (Ciarlini, Col. 4, Lines 47-57: “Consider a measurement matrix X where each column is a measured sensor stream or a lag of the measured sensor stream:
PNG
media_image1.png
60
435
media_image1.png
Greyscale
. In order to obtain a model for a target series Y, solve the following system of equations: Y = Xb + e, where b is the vector of the coefficients of the model, and e corresponds to the residuals.”) [Examiner’s note: The citation from Ciarlini above describes a linear regression-based model where a target time series (Y) is predicted based on a matrix of X of measured sensor streams and their time-lagged version. Here, X represents a base model structure that includes lagged versions of different sensor time series. The equation Y = Xb + e suggests a linear regression framework, where b represents coefficients and e represents residual errors. ]
Ciarlini fails to disclose:
querving, across the plurality of time series using the global model parameters of the base model, a forward historical temporal pattern graph and a backward historical temporal pattern graph relevant to a target time series in the plurality of time series
wherein the target time series has insufficient data to train a target model for forecasting the target time series; and
obtaining target model parameters of the target model by customizing the base model based on the forward historical temporal pattern graph and the backward historical temporal pattern graph
However, Li explicitly discloses:
querving, across the plurality of time series using the global model parameters of the base model, a forward historical temporal pattern graph and a backward historical temporal pattern graph relevant to a target time series in the plurality of time series, (Li, Pg. 2, Section 2.1: “The goal of traffic forecasting is to predict the future traffic speed given previously observed traffic flow from N correlated sensors on the road network.”, Pg. 2, Section 2.2: “We also include the reversed direction diffusion process, such that the bidirectional diffusion offers the model more flexibility to capture the influence from both the upstream and the downstream traffic”, Pg. 3, Eq. (2) explicitly uses two transition matrices (forward and reverse): it include both
D
o
-
1
W
kand
D
1
-
1
W
T
k and Li states that these are the transition matrices of the diffusion process and the reverse one respectively, P.g. 4, Figure 2: “System architecture for the Diffusion Convolutional Recurrent Neural Network designed for spatiotemporal traffic forecasting. The historical time series are fed into an encoder whose final states are used to initialize the decoder. The decoder makes predictions based on either previous ground truth or the model output.”
PNG
media_image2.png
361
858
media_image2.png
Greyscale
Pg. 4, ¶[2]: “The model architecture of DCRNN is shown in Figure 2. The entire network is trained by maximizing the likelihood of generating the target future time series using backpropagation through time. DCRNN is able to capture spatiotemporal dependencies among time series and can be applied to various spatiotemporal forecasting problems.”) [Examiner’s note: Li discloses training the diffusion convolutional recurrent neural network end-to-end using historical observations, whereby the model parameters are adjusted based on the extracted temporal patterns to generate forecasts for a target time series]
obtaining target model parameters of the target model by customizing the base model based on the forward historical temporal pattern graph and the backward historical temporal pattern graph (Li, P.g. 4, Figure 2: “System architecture for the Diffusion Convolutional Recurrent Neural Network designed for spatiotemporal traffic forecasting. The historical time series are fed into an encoder whose final states are used to initialize the decoder. The decoder makes predictions based on either previous ground truth or the model output.”
PNG
media_image2.png
361
858
media_image2.png
Greyscale
Pg. 4, ¶[2]: “The model architecture of DCRNN is shown in Figure 2. The entire network is trained by maximizing the likelihood of generating the target future time series using backpropagation through time. DCRNN is able to capture spatiotemporal dependencies among time series and can be applied to various spatiotemporal forecasting problems.”) [Examiner’s note: Li discloses training the diffusion convolutional recurrent neural network end-to-end using historical observations, whereby the model parameters are adjusted based on the extracted temporal patterns to generate forecasts for a target time series]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ciarlini and Li. Ciarlini teaches methods for performing massively parallel processing (MPP) large-scale combinations of time series data. Li teaches modelling the traffic flow as a diffusion process on a directed graph and introduce Diffusion Convolutional Recurrent Neural Network (DCRNN), a deep learning framework for traffic forecasting that incorporates both spatial and temporal dependency in the traffic flow. One of ordinary skill would have motivation to combine Ciarlini and Li because MPEP 2143 sets forth the Supreme Court rationales for obviousness including: (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results; (E): “Obvious to try” choosing from a finite number of identified, predictable solutions, with a reasonable expectation of success; (F) Known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of the ordinary skill in the art.
However, Flunkert explicitly discloses:
extracting, across the plurality of time series, temporal patterns relevant to a target time series in the plurality of time series, (Flunkert, Col. 4, Lines 2-8: “In at least one embodiment in which information about the similarity between a particular target item Ia (for which demand is to be forecast) and another item Ib (whose demand observations were used for training) is provided as input to the RNN model, the number of actual demand observations available for lb may exceed the number of demand observations available for Ia”, Col. 3, Lines 50-64: “For example, if information regarding the similarity of a new item Inew to some set of other items {Iold} along one or more dimensions such as item/product category, price, etc. is provided, where demand observations for {Iold} items were used to train the model while demand observations for Inew was not used to train the model, the model may still be able to provide useful predictions for Inew demand in such an embodiment. The accuracy of the forecasts may increase with the amount of information available about the items being considered in at least some embodiments- e.g., if several weeks of actual demand observations an item Ij are used to train the model, and several months or years of demand observations for another item Ik are used to train the model, the forecasts for Ik may tend to be more accurate than the forecasts for Ij"”) [Examiner’s note: temporal patterns relevant to the target time series i.e., the number of actual demand observation]
wherein the target time series has insufficient data to train a target model for forecasting the target time series; and (Flunkert, Col. 3, Lines 46-57: “In at least one embodiment, if a sufficiently large training data set is used, the model may be general enough to be able to make predictions regarding demands for an item for which no (or very few) demand observations were available for training. For example, if information regarding the similarity of a new item Inew to some set of other items {Iold} along one or more dimensions such as item/product category, price, etc. is provided, where demand observations for {Iold} items were used to train the model while demand observations for Inew was not used to train the model, the model may still be able to provide useful predictions for Inew demand in such an embodiment.”) [Examiner’s note: Inew represents the target time series with insufficient data, while Iold represents other time series that have sufficient historical demand data. The model compensates for the missing data by leveraging correlations with similar items to predict the demand for the target item.]
obtaining target model parameters of the target model by customizing the base model based on the extracted temporal patterns relevant to the target time series (Flunkert, Col. 5, Lines 54-65: “If the evaluations indicate that the model does not meet a desired quality/accuracy criterion, the model may be adjusted in some embodiments---e.g., various hyperparameters, initial parameters and/or feature extraction techniques may be modified and the model may be retrained. In at least one embodiment, new versions of the
models may be generated over time as new demand observations are obtained. For example, in one implementation, new demand forecasts for K weeks into the future may be generated every week using demand data collected over a time window of the previous N weeks as input for the composite modeling methodology.”) [Examiner’s note: target model parameters i.e., new demand forecasts for K weeks, customizing the base model i.e., adjust or modify the demand data collected over a time window of previous N weeks]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ciarlini and Flunkert. Ciarlini teaches methods for performing massively parallel processing (MPP) large-scale combinations of time series data. Flunkert teaches a recurrent neural network model is trained using a plurality of time series of demand observations to generate demand forecasts for various items. One of ordinary skill would have motivation to combine Ciarlini and Flunkert because time series data can exhibit changes over time due to external factors. Extracting temporal patterns allows the model to adapt to these changes by recognizing and incorporating them into predictions. Customizing the model based on these extracted patterns helps in maintaining the relevance and accuracy of the model even as conditions evolve. (Flunkert, Col. 14, Lines 49-67)
As per claim 12, the combination of Ciarlini and Flunkert discloses all the limitations of claim 8 (as shown in the rejections above)
Ciarlini in view of Flunkert further discloses:
wherein the deep learning of the base model and the customizing the base model are performed simultaneously during the hierarchical learning (Ciarlini, Col. 6: “The model 230 in each group 220 can be computed substantially in parallel on distributed compute nodes and, since the input matrix is small, the processing time is relatively short…In the generated models, the same time series 210 can have multiple coefficients, one for each different time lag”, and Col. 7:
PNG
media_image3.png
185
639
media_image3.png
Greyscale
) [Examiner note: Ciarlini disclose that the models are computed in parallel using the OMP algorithm and then the coefficients of the base model will be updated at the same time]
As per claim 13, the combination of Ciarlini and Flunkert discloses all the limitations of Claim 8 (as shown in the rejections above)
Ciarlini in view of Flunkert further discloses:
wherein the deep learning of the base model and the customizing the base model are performed in sequence during the hierarchical learning. (Ciarlini, Col. 6-7: “In this case, there is a plurality of hierarchical learning levels to generate the final model and in each intermediate level of the hierarchy, intermediate compute nodes execute both the roles of master compute node for compute nodes of the lower hierarchical level and working compute nodes for the compute nodes of the upper hierarchical level. Intermediate compute nodes receive selected variables and scores from lower-level compute nodes and perform the following steps: rank the variables; select a pre-defined number of variables based on their scores to be considered as input for the generation of an intermediate linear model using an Orthogonal Matching Pursuit algorithm; assign a score to each variable of the intermediate model; and provide such variables and their corresponding scores to the upper level in the hierarchy.”) [Examiner note: the procedure is generalized by creating multiple learning stages and the involvement of intermediate compute nodes in the hierarchical structure reinforces the sequential nature of the process. These nodes receive information from lower-level compute nodes and perform specific steps of customization before providing variables and scores to the upper level.]
As per claim 15, Ciarlini further discloses:
A system for machine learning for time series, comprising: (Ciarlini, Abstract: “Methods and apparatus are provided for performing massively parallel processing (MPP) large-scale combinations of time series data.”)
a general deep machine learning mechanism configured for deep learning global model parameters of a base model for forecasting time series measurements of a plurality of time series, (Ciarlini, Col. 4, Lines 47-57: “Consider a measurement matrix X where each column is a measured sensor stream or a lag of the measured sensor stream:
PNG
media_image1.png
60
435
media_image1.png
Greyscale
. In order to obtain a model for a target series Y, solve the following system of equations: Y = Xb + e, where b is the vector of the coefficients of the model, and e corresponds to the residuals.”) [Examiner’s note: The citation from Ciarlini above describes a linear regression-based model where a target time series (Y) is predicted based on a matrix of X of measured sensor streams and their time-lagged version. Here, X represents a base model structure that includes lagged versions of different sensor time series. The equation Y = Xb + e suggests a linear regression framework, where b represents coefficients and e represents residual errors. ]
Ciarlini fails to disclose:
a customized deep learning mechanism configured for: querving, across the plurality of time series using the global model parameters of the base model, a forward historical temporal pattern graph and a backward historical temporal pattern graph relevant to a target time series in the plurality of time series
wherein the target time series has insufficient data to train a target model for forecasting the target time series; and
obtaining target model parameters of the target model by customizing the base model based on the forward historical temporal pattern graph and the backward historical temporal pattern graph
However, Li explicitly discloses:
querving, across the plurality of time series using the global model parameters of the base model, a forward historical temporal pattern graph and a backward historical temporal pattern graph relevant to a target time series in the plurality of time series, (Li, Pg. 2, Section 2.1: “The goal of traffic forecasting is to predict the future traffic speed given previously observed traffic flow from N correlated sensors on the road network.”, Pg. 2, Section 2.2: “We also include the reversed direction diffusion process, such that the bidirectional diffusion offers the model more flexibility to capture the influence from both the upstream and the downstream traffic”, Pg. 3, Eq. (2) explicitly uses two transition matrices (forward and reverse): it include both
D
o
-
1
W
kand
D
1
-
1
W
T
k and Li states that these are the transition matrices of the diffusion process and the reverse one respectively, P.g. 4, Figure 2: “System architecture for the Diffusion Convolutional Recurrent Neural Network designed for spatiotemporal traffic forecasting. The historical time series are fed into an encoder whose final states are used to initialize the decoder. The decoder makes predictions based on either previous ground truth or the model output.”
PNG
media_image2.png
361
858
media_image2.png
Greyscale
Pg. 4, ¶[2]: “The model architecture of DCRNN is shown in Figure 2. The entire network is trained by maximizing the likelihood of generating the target future time series using backpropagation through time. DCRNN is able to capture spatiotemporal dependencies among time series and can be applied to various spatiotemporal forecasting problems.”) [Examiner’s note: Li discloses training the diffusion convolutional recurrent neural network end-to-end using historical observations, whereby the model parameters are adjusted based on the extracted temporal patterns to generate forecasts for a target time series]
obtaining target model parameters of the target model by customizing the base model based on the forward historical temporal pattern graph and the backward historical temporal pattern graph (Li, P.g. 4, Figure 2: “System architecture for the Diffusion Convolutional Recurrent Neural Network designed for spatiotemporal traffic forecasting. The historical time series are fed into an encoder whose final states are used to initialize the decoder. The decoder makes predictions based on either previous ground truth or the model output.”
PNG
media_image2.png
361
858
media_image2.png
Greyscale
Pg. 4, ¶[2]: “The model architecture of DCRNN is shown in Figure 2. The entire network is trained by maximizing the likelihood of generating the target future time series using backpropagation through time. DCRNN is able to capture spatiotemporal dependencies among time series and can be applied to various spatiotemporal forecasting problems.”) [Examiner’s note: Li discloses training the diffusion convolutional recurrent neural network end-to-end using historical observations, whereby the model parameters are adjusted based on the extracted temporal patterns to generate forecasts for a target time series]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ciarlini and Li. Ciarlini teaches methods for performing massively parallel processing (MPP) large-scale combinations of time series data. Li teaches modelling the traffic flow as a diffusion process on a directed graph and introduce Diffusion Convolutional Recurrent Neural Network (DCRNN), a deep learning framework for traffic forecasting that incorporates both spatial and temporal dependency in the traffic flow. One of ordinary skill would have motivation to combine Ciarlini and Li because MPEP 2143 sets forth the Supreme Court rationales for obviousness including: (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results; (E): “Obvious to try” choosing from a finite number of identified, predictable solutions, with a reasonable expectation of success; (F) Known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of the ordinary skill in the art.
However, Flunkert explicitly discloses:
a customized deep learning mechanism configured for: extracting, across the plurality of time series, temporal patterns relevant to a target time series in the plurality of time series, (Flunkert, Col. 4, Lines 2-8: “In at least one embodiment in which information about the similarity between a particular target item Ia (for which demand is to be forecast) and another item Ib (whose demand observations were used for training) is provided as input to the RNN model, the number of actual demand observations available for lb may exceed the number of demand observations available for Ia”, Col. 3, Lines 50-64: “For example, if information regarding the similarity of a new item Inew to some set of other items {Iold} along one or more dimensions such as item/product category, price, etc. is provided, where demand observations for {Iold} items were used to train the model while demand observations for Inew was not used to train the model, the model may still be able to provide useful predictions for Inew demand in such an embodiment. The accuracy of the forecasts may increase with the amount of information available about the items being considered in at least some embodiments- e.g., if several weeks of actual demand observations an item Ij are used to train the model, and several months or years of demand observations for another item Ik are used to train the model, the forecasts for Ik may tend to be more accurate than the forecasts for Ij"”) [Examiner’s note: temporal patterns relevant to the target time series i.e., the number of actual demand observation]
wherein the target time series has insufficient data to train a target model for forecasting the target time series; and (Flunkert, Col. 3, Lines 46-57: “In at least one embodiment, if a sufficiently large training data set is used, the model may be general enough to be able to make predictions regarding demands for an item for which no (or very few) demand observations were available for training. For example, if information regarding the similarity of a new item Inew to some set of other items {Iold} along one or more dimensions such as item/product category, price, etc. is provided, where demand observations for {Iold} items were used to train the model while demand observations for Inew was not used to train the model, the model may still be able to provide useful predictions for Inew demand in such an embodiment.”) [Examiner’s note: Inew represents the target time series with insufficient data, while Iold represents other time series that have sufficient historical demand data. The model compensates for the missing data by leveraging correlations with similar items to predict the demand for the target item.]
obtaining target model parameters of the target model by customizing the base model based on the extracted temporal patterns relevant to the target time series (Flunkert, Col. 5, Lines 54-65: “If the evaluations indicate that the model does not meet a desired quality/accuracy criterion, the model may be adjusted in some embodiments---e.g., various hyperparameters, initial parameters and/or feature extraction techniques may be modified and the model may be retrained. In at least one embodiment, new versions of the
models may be generated over time as new demand observations are obtained. For example, in one implementation, new demand forecasts for K weeks into the future may be generated every week using demand data collected over a time window of the previous N weeks as input for the composite modeling methodology.”) [Examiner’s note: target model parameters i.e., new demand forecasts for K weeks, customizing the base model i.e., adjust or modify the demand data collected over a time window of previous N weeks]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ciarlini and Flunkert. Ciarlini teaches methods for performing massively parallel processing (MPP) large-scale combinations of time series data. Flunkert teaches a recurrent neural network model is trained using a plurality of time series of demand observations to generate demand forecasts for various items. One of ordinary skill would have motivation to combine Ciarlini and Flunkert because time series data can exhibit changes over time due to external factors. Extracting temporal patterns allows the model to adapt to these changes by recognizing and incorporating them into predictions. Customizing the model based on these extracted patterns helps in maintaining the relevance and accuracy of the model even as conditions evolve. (Flunkert, Col. 14, Lines 49-67)
As per claim 19, the combination of Ciarlini and Flunkert discloses all the limitations of Claim 15 (as shown in the rejections above)
Ciarlini in view of Flunkert further discloses:
wherein the deep learning of the base model and the customizing the base model are performed simultaneously during the hierarchical learning (Ciarlini, Col. 6: “The model 230 in each group 220 can be computed substantially in parallel on distributed compute nodes and, since the input matrix is small, the processing time is relatively short…In the generated models, the same time series 210 can have multiple coefficients, one for each different time lag”, and Col. 7:
PNG
media_image3.png
185
639
media_image3.png
Greyscale
) [Examiner note: Ciarlini disclose that the models are computed in parallel using the OMP algorithm and then the coefficients of the base model will be updated at the same time]
As per claim 20, the combination of Ciarlini and Flunkert discloses all the limitations of Claim 15 (as shown in the rejections above)
Ciarlini in view of Flunkert further discloses:
wherein the deep learning of the base model and the customizing the base model are performed in sequence during the hierarchical learning. (Ciarlini, Col. 6-7: “In this case, there is a plurality of hierarchical learning levels to generate the final model and in each intermediate level of the hierarchy, intermediate compute nodes execute both the roles of master compute node for compute nodes of the lower hierarchical level and working compute nodes for the compute nodes of the upper hierarchical level. Intermediate compute nodes receive selected variables and scores from lower-level compute nodes and perform the following steps: rank the variables; select a pre-defined number of variables based on their scores to be considered as input for the generation of an intermediate linear model using an Orthogonal Matching Pursuit algorithm; assign a score to each variable of the intermediate model; and provide such variables and their corresponding scores to the upper level in the hierarchy.”) [Examiner note: the procedure is generalized by creating multiple learning stages and the involvement of intermediate compute nodes in the hierarchical structure reinforces the sequential nature of the process. These nodes receive information from lower-level compute nodes and perform specific steps of customization before providing variables and scores to the upper level.]
Claim(s) 2, 9, 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ciarlini in view of Flunkert and further in view of Deshpande & Sarawagi (“Streaming Adaptation of Deep Forecasting Models using Adaptive Recurrent Units”, Publication Date: 07/04/2019) (hereafter referred to as "Deshpande")
Regarding Claim 2, the combination of Ciarlini and Flunkert discloses all the limitations of Claim 1 (as shown in the rejections above)
Ciarlini in view of Flunkert further discloses:
the target model is learned specifically for forecasting a time series measurement of the corresponding target time series. (Ciarlini, Col. 3, Lines 27-29: “… where one or more of these time series are selected to explain, by a linear model, one particular time series of interest, referred to as a target time series.”, Col. 3, line 58: “The massive parallelization allows, for example, many thousands of time series (and time lags) to be processed that are candidates to be used in the model that explains or predicts a target time series”) [The examiner interprets the “linear model” here as the “target model” which is used for predicting or explaining one particular time series of interest. (forecasting target time series). Examiner note: explain time series i.e., predict or forecast time series Ciarlini has used these terms interchangeably to express the meaning of forecasting]
Ciarlini in view of Flunkert fails to disclose:
The method of claim 1, wherein the base model is learned generically for forecasting a time series measurement of any of the plurality time series; and
However, Deshpande explicitly discloses:
The method of claim 1, wherein the base model is learned generically for forecasting a time series measurement of any of the plurality time series; and (Despande, Page 5, Col. 2, Section 4.2: “As a baseline we use the globally trained model where the decoder makes independent predictions for each of the future K points in time.”) [Examiner note: global model i.e., base model, predictions for each K points in time ie., forecasting time series measurements of plurality of time series]
The combination of Ciarlini, Flunkert and Despande are analogous art because they are in the same field of training time series data. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, having the teachings of Ciarlini, Flunkert and Despande before them, to modify the teachings of Ciarlini and Flunkert to include the teachings of Despande training the global model for predicting time series data in order to capture a summary of the historical context and is used as a shared representation for making future predictions in the time series forecasting model (Despande, Page 2, Section 2: “Typical deep learning based global models for multi-horizon time series forecasting [11, 29] deploy the encoder-decoder architecture. The output of the encoder is its final state giT at the end of T steps. This can be treated as a summary of the known y values that is relevant as a context for future predictions”)
Regarding claim 9, the combination of Ciarlini and Flunkert discloses all the limitations of Claim 8 (as shown in the rejections above)
Ciarlini in view of Flunkert further discloses:
the target model is learned specifically for forecasting a time series measurement of the corresponding target time series. (Ciarlini, Col. 4, Lines 65-67: “… the resulting model can capture most of the relevant influences on the target time series.”) [The examiner interprets the “resulting model” here as “the target model” (please see above). The authors disclose that the resulting model (target model) after being trained will be capable of capturing or predicting the relevant influences (the target time series measurement) on the target time series. ]
Ciarlini in view of Flunkert fails to disclose:
The medium of claim 8, wherein the base model is learned generically for forecasting a time series measurement of any of the plurality time series; and
However, Deshpande explicitly discloses:
the base model is learned generically for forecasting a time series measurement of any of the plurality time series; and (Despande, Page 5, Col. 2, Section 4.2: “As a baseline we use the globally trained model where the decoder makes independent predictions for each of the future K points in time.”) [Examiner note: global model i.e., base model, predictions for each K points in time i.e., forecasting time series measurements of plurality of time series]
The combination of Ciarlini, FLunkert and Despande are analogous art because they are in the same field of training time series data. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, having the teachings of Ciarlini, Flunkert and Despande before them, to modify the teachings of Ciarlini and Flunkert to include the teachings of Despande training the global model for predicting time series data in order to capture a summary of the historical context and is used as a shared representation for making future predictions in the time series forecasting model (Despande, Page 2, Section 2: “Typical deep learning based global models for multi-horizon time series forecasting [11, 29] deploy the encoder-decoder architecture. The output of the encoder is its final state giT at the end of T steps. This can be treated as a summary of the known y values that is relevant as a context for future predictions”)
Regarding claim 16, the combination of Ciarlini and Flunkert discloses all the limitations of Claim 15 (as shown in the rejection above)
Ciarlini in view of Flunkert further discloses:
the target model is learned specifically for forecasting a time series measurement of the corresponding target time series. (Ciarlini, Col. 4, Lines 65-67: “… the resulting model can capture most of the relevant influences on the target time series.”) [The examiner interprets the “resulting model” here as “the target model” (please see above). The authors disclose that the resulting model (target model) after being trained will be capable of capturing or predicting the relevant influences (the target time series measurement) on the target time series. ]
Ciarlini in view of Flunkert fails to disclose:
The system of claim 15, wherein the base model is learned generically for forecasting a time series measurement of any of the plurality time series; and
However, Deshpande explicitly discloses:
The system of claim 15, wherein the base model is learned generically for forecasting a time series measurement of any of the plurality time series; and (Despande, Page 5, Col. 2, Section 4.2: “As a baseline we use the globally trained model where the decoder makes independent predictions for each of the future K points in time.”) [Examiner note: global model i.e., base model, predictions for each K points in time ie., forecasting time series measurements of plurality of time series]
The same motivation in Claim 2 applies to Claim 16.
Claim(s) 3, 4, 10, 11, 17, and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ciarlini in view of Flunkert and further in view of Bazrafkan et al. (US 2018/0211164 A1) (hereafter referred to as "Bazrafkan")
Regarding claim 3, the combination of Ciarlini and Flunkert discloses all the limitations of Claim 1 (as shown in the rejections above)
Ciarlini in view of Flunkert further discloses:
The method of claim 1, wherein the step of deep learning comprises:
receiving training data cross the plurality of tine series; (Ciarlini, Abstract: “A given working compute node in a distributed computing environment obtains a given group of time series data of a plurality of groups of time series data;”)
forecasting time series measurements of the training data based on the global model parameters of the base model; and (Ciarlini, Col. 6, Lines 53-54: “the procedure can be generalized by creating multiple hierarchical learning stages”, and Ciarlini, Col. 4, Line 58:
PNG
media_image4.png
187
660
media_image4.png
Greyscale
) [Examiner note: the global model parameters i.e., vector b of the coefficients of the base model. The model i.e., base model because the target model is obtained by solving the equation which contains this model] (Ciarlini, Col. 3, Lines 55-58: “The massive parallelization allows, for example, many thousands of time series (and time lags) to be processed that are candidates to be used in the model that explains or predicts a target time series”) [The authors disclose that the model will be used to explain or predicts a target time series of many thousands of time series (a plurality of time series), and as mentioned above, the coefficients vector b is the global parameter of this model, so the examiner interprets that the coefficients vector will be used for predicting time series.]
Ciarlini in view of Flunkert fails to disclose:
updating the global model parameters by minimizing a first loss determined based on the forecasted time series measurements from the training data and labels of the training data.
However, Bazrafkan explicitly discloses:
updating the global model parameters by minimizing a first loss determined based on the forecasted time series measurements from the training data and labels of the training data. (Bazrafkan, Abstract: “the parameters for the generative network are updated using the first and second loss functions”, and [0029]: “During each training epoch, an instance of Network A accepts at least one sample, … and generates Out1, a new sample in the same class so that this new sample reduces the loss function”, and [0032]: “Network A is a neural network, such as a generative model network …”) (Bazrafkan. [0033]: “the direct loss function LA for each instance of network A accepts Out1 and another image Ii from the same class in the dataset I1 ... Ik as input and can be calculated using a mean square error or any similar suitable measure. These measures can then be combined to provide the loss function LA for a batch.”, and [0030]: “The input can for example comprise a feature vector or time-series (like sounds, medical data, etc) as long as this is labelled”) [The authors disclose that the parameters (global model parameters) for the generative network, which is defined as Network A, are updated using first loss function, then they also disclose that during the training process, Network A will generate Out1 for reducing the loss function.] [The authors disclose that the loss function LA accepts an input from the training dataset I1 … Ik , then they also disclose that the input comprises a time series as long as it is labelled. Thus, the examiner notes that the loss function is generated based on the time series from the training data and labels of the training data.]
The combination of Ciarlini, Flunkert and Bazrafkan are analogous art because they are in the same field of method of training neural network. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, having the teachings of Ciarlini, Flunkert and Bazrafkan before them, to modify the teachings of Ciarlini and Flunkert to include the teachings of Bazrafkan updating the global parameters involves minimizing an initial loss calculated using the predicted time series derived from the training data and the labels associated with the training data in order to improve the model’s ability to train the local models for accurately forecasting future data points (Bazrafkan, [0028]: “Thus, network A learns the best data augmentation to increase the training accuracy of network B, even letting network A come up with nonintuitive but highly performing augmentation strategies”)
Regarding claim 4, the combination of Ciarlini and Flunkert discloses all the limitations of Claim 1 (as shown in the rejections above)
Ciarlini in view of Flunkert further discloses:
The method of claim 1, wherein the step of obtaining target model parameters
comprises: initializing the target model parameters for the target model based on target time series measurements forecasted based on the base model and labels of training data from the target time series; and (Ciarlini, Col. 7: “By running the OMP algorithm 300, a set of k almost linearly independent series 210 are obtained which explain the target variable”, and Col. 8: “The target time series 210 of interest might vary. The values of the time series 210 are normalized before the training phase, thus coefficients represent how important
each time series 210 (and corresponding lag 215) is in the final outcome 260.”) [Examiner note: target model parameters i.e., target variables, set of k series i.e., labels of training data from the target time series.]
Ciarlini in view of Flunkert fails to disclose:
iteratively updating the target model parameters by minimizing a second loss determined based on a discrepancy between target time series measurements predicted using time series data from the target time series and labels of the time series data from the target time series.
However, Bazrafkan explicitly discloses:
iteratively updating the target model parameters by minimizing a second loss determined based on a discrepancy between target time series measurements predicted using time series data from the target time series and labels of the time series data from the target time series. (Bazrafkan, Abstract: “A second loss function is determined for the target neural network by comparing outputs of instances of the target neural network to one or more targets for the neural network. The parameters for the target neural network are updated using the second loss function”, and [0054], lines 28-34: “A training epoch may comprise a number of batches . . . X(T-1 ), X(T), X(T + 1) ... being processed in sequence with each batch being used to train network A using the loss functions from networks A and B, and to train network B using the augmented images generated while training network A and also original images, which together generate the loss for network B.”) [The authors disclose using the second loss function to update the target parameters, and describing how the loss function for the target neural network is calculated by comparing its outputs to the reference targets. This is a way of quantifying the discrepancy or error in the network’s predictions compared to the labeled data. The examiner interprets determining a second loss function as minimizing that loss function, because this is the well-known goal of determining a loss function in machine learning. The authors also disclose this training epoch may comprise multiple batches being processed in sequence with each of them used to generate the loss function for network B (iteratively updating target model using the second loss function)]
The combination of Ciarlini, Flunkert and Bazrafkan are analogous art because they are in the same field of training neural network for time series classification. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, having the teachings of Ciarlini, Flunkert and Bazrafkan before them, to modify the teachings of Ciarlini and Flunkert to include the teachings of Bazrafkan continuously refining the target model’s parameters involves minimizing a secondary loss, which is calculated based on the difference between predicted target time series measurements using the time series data from the target time series and the labels of that time series data in order to improve the accuracy of the target model and make it more precise in its prediction (Bazrafkan, [0010]: “The use of augmentation in deep learning is ubiquitous, and when dealing with images, this can include the application of rotation, translation, blurring and other modifications to existing labelled images in order to improve the training of a target network”)
Regarding claim 10, the combination of Ciarlini and Flunkert discloses all the limitations of Claim 8 (as shown in the rejections above)
Ciarlini in view of Flunkert further discloses:
The medium of claim 8, wherein the step of deep learning comprises:
receiving training data cross the plurality of tine series; (Ciarlini, Abstract: “A given working compute node in a distributed computing environment obtains a given group of time series data of a plurality of groups of time series data;”)
forecasting time series measurements of the training data based on the global model parameters of the base model; and (Ciarlini, Col. 6, Lines 53-54: “the procedure can be generalized by creating multiple hierarchical learning stages”, and Ciarlini, Col. 4, Line 58:
PNG
media_image4.png
187
660
media_image4.png
Greyscale
) [Examiner note: the global model parameters i.e., vector b of the coefficients of the base model. The model i.e., base model because the target model is obtained by solving the equation which contains this model] (Ciarlini, Col. 3, Lines 55-58: “The massive parallelization allows, for example, many thousands of time series (and time lags) to be processed that are candidates to be used in the model that explains or predicts a target time series”) [The authors disclose that the model will be used to explain or predicts a target time series of many thousands of time series (a plurality of time series), and as mentioned above, the coefficients vector b is the global parameter of this model, so the examiner interprets that the coefficients vector will be used for predicting time series.]
Ciarlini in view of Flunkert fails to disclose:
updating the global model parameters by minimizing a first loss determined based on the forecasted time series measurements from the training data and labels of the training data
However, Bazrafkan explicitly discloses:
updating the global model parameters by minimizing a first loss determined
based on the forecasted time series measurements from the training data and labels of the training data (Bazrafkan, Abstract: “the parameters for the generative network are updated using the first and second loss functions”, and [0029]: “During each training epoch, an instance of Network A accepts at least one sample, … and generates Out1, a new sample in the same class so that this new sample reduces the loss function”, and [0032]: “Network A is a neural network, such as a generative model network …”) [The authors disclose that the parameters (global model parameters) for the generative network, which is defined as Network A, are updated using first loss function, then they also disclose that during the training process, Network A will generate Out1 for reducing the loss function.]. (Bazrafkan. [0033]: “the direct loss function LA for each instance of network A accepts Out1 and another image Ii from the same class in the dataset I1 ... Ik as input and can be calculated using a mean square error or any similar suitable measure. These measures can then be combined to provide the loss function LA for a batch.”, and [0030]: “The input can for example comprise a feature vector or time-series (like sounds, medical data, etc) as long as this is labelled”) [The authors disclose that the loss function LA accepts an input from the training dataset I1 … Ik , then they also disclose that the input comprises a time series as long as it is labelled. Thus, the examiner notes that the loss function is generated based on the time series from the training data and labels of the training data.]
The combination of Ciarlini, Flunkert and Bazrafkan are analogous art because they are in the same field of method of training neural network. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, having the teachings of Ciarlini, Flunkert and Bazrafkan before them, to modify the teachings of Ciarlini and Flunkert to include the teachings of Bazrafkan updating the global parameters involves minimizing an initial loss calculated using the predicted time series derived from the training data and the labels associated with the training data in order to improve the model’s ability to train the local models for accurately forecasting future data points (Bazrafkan, [0028]: “Thus, network A learns the best data augmentation to increase the training accuracy of network B, even letting network A come up with nonintuitive but highly performing augmentation strategies”)
Regarding claim 11, the combination of Ciarlini and Flunkert discloses all the limitations of Claim 8 (as shown in the rejections above)
Ciarlini in view of Flunkert further discloses:
The medium of claim 8, wherein the step of obtaining target model parameters comprises: initializing the target model parameters for the target model based on target time series measurements forecasted based on the base model and labels of training data from the target time series; and (Ciarlini, Col. 7: “By running the OMP algorithm 300, a set of k almost linearly independent series 210 are obtained which explain the target variable”, and Col. 8: “The target time series 210 of interest might vary. The values of the time series 210 are normalized before the training phase, thus coefficients represent how important
each time series 210 (and corresponding lag 215) is in the final outcome 260.”) [Examiner note: target model parameters i.e., target variables, set of k series i.e., labels of training data from the target time series.]
Ciarlini in view of Flunkert fails to disclose:
iteratively updating the target model parameters by minimizing a second loss determined based on a discrepancy between target time series measurements predicted using time series data from the target time series and labels of the time series data from the target time series
However, Bazrafkan explicitly discloses:
iteratively updating the target model parameters by minimizing a second loss determined based on a discrepancy between target time series measurements predicted using time series data from the target time series and labels of the time series data from the target time series. (Bazrafkan, Abstract: “A second loss function is determined for the target neural network by comparing outputs of instances of the target neural network to one or more targets for the neural network. The parameters for the target neural network are updated using the second loss function”, and [0054], lines 28-34: “A training epoch may comprise a number of batches . . . X(T-1 ), X(T), X(T + 1) ... being processed in sequence with each batch being used to train network A using the loss functions from networks A and B, and to train network B using the augmented images generated while training network A and also original images, which together generate the loss for network B.”) [The authors disclose using the second loss function to update the target parameters, and describing how the loss function for the target neural network is calculated by comparing its outputs to the reference targets. This is a way of quantifying the discrepancy or error in the network’s predictions compared to the labeled data. The examiner interprets determining a second loss function as minimizing that loss function, because this is the well-known goal of determining a loss function in machine learning. The authors also disclose this training epoch may comprise multiple batches being processed in sequence with each of them used to generate the loss function for network B (iteratively updating target model using the second loss function)]
The combination of Ciarlini, FLunkert and Bazrafkan are analogous art because they are in the same field of training neural network for time series classification. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, having the teachings of Ciarlini, Flunkert and Bazrafkan before them, to modify the teachings of Ciarlini and Flunkert to include the teachings of Bazrafkan continuously refining the target model’s parameters involves minimizing a secondary loss, which is calculated based on the difference between predicted target time series measurements using the time series data from the target time series and the labels of that time series data in order to improve the accuracy of the target model and make it more precise in its prediction (Bazrafkan, [0010]: “The use of augmentation in deep learning is ubiquitous, and when dealing with images, this can include the application of rotation, translation, blurring and other modifications to existing labelled images in order to improve the training of a target network”).
Regarding claim 17, the combination of Ciarlini and Flunkert discloses all the limitations of Claim 15 (as shown in the rejections above)
Ciarlini in view of Flunkert further discloses:
The system of claim 15, wherein the general deep machine learning mechanism performs deep learning by: receiving training data cross the plurality of tine series; (Ciarlini, Abstract: “A given working compute node in a distributed computing environment obtains a given group of time series data of a plurality of groups of time series data;”)
forecasting time series measurements of the training data based on the global model parameters of the base model; and (Ciarlini, Col. 6, Lines 53-54: “the procedure can be generalized by creating multiple hierarchical learning stages”, and Ciarlini, Col. 4, Line 58:
PNG
media_image4.png
187
660
media_image4.png
Greyscale
) [Examiner note: the global model parameters i.e., vector b of the coefficients of the base model. The model i.e., base model because the target model is obtained by solving the equation which contains this model] (Ciarlini, Col. 3, Lines 55-58: “The massive parallelization allows, for example, many thousands of time series (and time lags) to be processed that are candidates to be used in the model that explains or predicts a target time series”) [The authors disclose that the model will be used to explain or predicts a target time series of many thousands of time series (a plurality of time series), and as mentioned above, the coefficients vector b is the global parameter of this model, so the examiner interprets that the coefficients vector will be used for predicting time series.]
Ciarlini in view of Flunkert fails to disclose:
updating the global model parameters by minimizing a first loss determined based on the forecasted time series measurements from the training data and labels of the training data.
However, Bazrafkan explicitly discloses:
updating the global model parameters by minimizing a first loss determined based on the forecasted time series measurements from the training data and labels of the training data. (Bazrafkan, Abstract: “the parameters for the generative network are updated using the first and second loss functions”, and [0029]: “During each training epoch, an instance of Network A accepts at least one sample, … and generates Out1, a new sample in the same class so that this new sample reduces the loss function”, and [0032]: “Network A is a neural network, such as a generative model network …”) [The authors disclose that the parameters (global model parameters) for the generative network, which is defined as Network A, are updated using first loss function, then they also disclose that during the training process, Network A will generate Out1 for reducing the loss function.] (Bazrafkan. [0033]: “the direct loss function LA for each instance of network A accepts Out1 and another image Ii from the same class in the dataset I1 ... Ik as input and can be calculated using a mean square error or any similar suitable measure. These measures can then be combined to provide the loss function LA for a batch.”, and [0030]: “The input can for example comprise a feature vector or time-series (like sounds, medical data, etc) as long as this is labelled”) [The authors disclose that the loss function LA accepts an input from the training dataset I1 … Ik , then they also disclose that the input comprises a time series as long as it is labelled. Thus, the examiner notes that the loss function is generated based on the time series from the training data and labels of the training data.]
The combination of Ciarlini, FLunkert and Bazrafkan are analogous art because they are in the same field of method of training neural network. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, having the teachings of Ciarlini, Flunkert and Bazrafkan before them, to modify the teachings of Ciarlini and Flunkert to include the teachings of Bazrafkan updating the global parameters involves minimizing an initial loss calculated using the predicted time series derived from the training data and the labels associated with the training data in order to improve the model’s ability to train the local models for accurately forecasting future data points (Bazrafkan, [0028]: “Thus, network A learns the best data augmentation to increase the training accuracy of network B, even letting network A come up with nonintuitive but highly performing augmentation strategies”)
Regarding claim 18, the combination of Ciarlini and Flunkert discloses all the limitations of Claim 15 (as shown in the rejections above)
Ciarlini in view of Flunkert further discloses:
The system of claim 15, wherein the customized deep learning mechanism performs obtaining target model parameters by: initializing the target model parameters for the target model based on target time series measurements forecasted based on the base model and labels of training data from the target time series; and (Ciarlini, Col. 7: “By running the OMP algorithm 300, a set of k almost linearly independent series 210 are obtained which explain the target variable”, and Col. 8: “The target time series 210 of interest might vary. The values of the time series 210 are normalized before the training phase, thus coefficients represent how important
each time series 210 (and corresponding lag 215) is in the final outcome 260.”) [Examiner note: target model parameters i.e., target variables, set of k series i.e., labels of training data from the target time series.]
Ciarlini in view of Flunkert fails to disclose:
iteratively updating the target model parameters by minimizing a second loss determined based on a discrepancy between target time series measurements predicted using time series data from the target time series and labels of the time series data from the target time series
However, Bazrafkan explicitly discloses:
iteratively updating the target model parameters by minimizing a second loss determined based on a discrepancy between target time series measurements predicted using time series data from the target time series and labels of the time series data from the target time series. (Bazrafkan, Abstract: “A second loss function is determined for the target neural network by comparing outputs of instances of the target neural network to one or more targets for the neural network. The parameters for the target neural network are updated using the second loss function”, and [0054], lines 28-34: “A training epoch may comprise a number of batches . . . X(T-1 ), X(T), X(T + 1) ... being processed in sequence with each batch being used to train network A using the loss functions from networks A and B, and to train network B using the augmented images generated while training network A and also original images, which together generate the loss for network B.”) [The authors disclose using the second loss function to update the target parameters, and describing how the loss function for the target neural network is calculated by comparing its outputs to the reference targets. This is a way of quantifying the discrepancy or error in the network’s predictions compared to the labeled data. The examiner interprets determining a second loss function as minimizing that loss function, because this is the well-known goal of determining a loss function in machine learning. The authors also disclose this training epoch may comprise multiple batches being processed in sequence with each of them used to generate the loss function for network B (iteratively updating target model using the second loss function)]
The combination of Ciarlini, FLunkert and Bazrafkan are analogous art because they are in the same field of training neural network for time series classification. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, having the teachings of Ciarlini, Flunkert and Bazrafkan before them, to modify the teachings of Ciarlini and Flunkert to include the teachings of Bazrafkan continuously refining the target model’s parameters involves minimizing a secondary loss, which is calculated based on the difference between predicted target time series measurements using the time series data from the target time series and the labels of that time series data in order to improve the accuracy of the target model and make it more precise in its prediction (Bazrafkan, [0010]: “The use of augmentation in deep learning is ubiquitous, and when dealing with images, this can include the application of rotation, translation, blurring and other modifications to existing labelled images in order to improve the training of a target network”)
Claim(s) 7, 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ciarlini in view of Flunkert, Bazrafkan and further in view of Huang et al. (“Adaptive Sampling Towards Fast Graph Representation Learning”, Publication Date: 11/19/2018) (hereafter referred to as “Huang”)
Regarding claim 7, the combination of Ciarlini, Flunkert and Bazrafkan discloses all the limitations of claim 3 (as shown in the rejections above)
Ciarlini in view of Flunkert and Bazrafkan fails to disclose:
The method of claim 3, wherein the first loss includes a graph based portion related to enrichment of hidden representations associated with the base model
However, Huang explicitly discloses:
The method of claim 3, wherein the first loss includes a graph based portion related to enrichment of hidden representations associated with the base model. (Huang, Page 4:
PNG
media_image5.png
66
1094
media_image5.png
Greyscale
and Huang Page 5:
PNG
media_image6.png
234
1077
media_image6.png
Greyscale
) [Examiner note: first loss i.e., classification loss LC , graph-based portion i.e., node vi, the process of computing hidden features during bottom-up propagation effectively enriches the representations. And this hidden feature is associated with the base model because vi is defined as the parent node.]
The combination of Ciarlini, Flunkert, Bazrafkan and Huang are analogous art because they are in the same field of hierarchical learning. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, having the teachings of Ciarlini, Flunkert, Bazrafkan and Huang before them, to modify the teachings of Ciarlini, FLunkert and Bazrafkan to include the teachings of Huang disclosing the initial loss comprises a component based on the graph structure, aimed at enhancing the hidden representations associated with the foundational model in order to minimize the variance in the model training (Huang, Page 5: “To fulfill variance reduction, we add the variance to the loss function and explicitly minimize the variance by model training.”)
Regarding claim 14, the combination of Ciarlini, Flunkert and Bazrafkan discloses all the limitations of Claim 10 (as shown in the rejections above)
Ciarlini in view of Flunkert and Bazrafkan fails to disclose:
The medium of claim 10, wherein the first loss includes a graph based portion related to enrichment of hidden representations associated with the base model.
However, Huang explicitly discloses:
The medium of claim 10, wherein the first loss includes a graph based portion related to enrichment of hidden representations associated with the base model. (Huang, Page 4:
PNG
media_image5.png
66
1094
media_image5.png
Greyscale
and Huang Page 5:
PNG
media_image6.png
234
1077
media_image6.png
Greyscale
) [Examiner note: first loss i.e., classification loss LC , graph-based portion i.e., node vi, the process of computing hidden features during bottom-up propagation effectively enriches the representations. And this hidden feature is associated with the base model because vi is defined as the parent node.]
The combination of Ciarlini, Flunkert, Bazrafkan and Huang are analogous art because they are in the same field of hierarchical learning. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, having the teachings of Ciarlini, Flunkert, Bazrafkan and Huang before them, to modify the teachings of Ciarlini, Flunkert, Bazrafkan to include the teachings of Huang disclosing the initial loss comprises a component based on the graph structure, aimed at enhancing the hidden representations associated with the foundational model in order to minimize the variance in the model training (Huang, Page 5: “To fulfill variance reduction, we add the variance to the loss function and explicitly minimize the variance by model training.”)
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to AMY TRAN whose telephone number is (571)270-0693. The examiner can normally be reached Monday - Friday 7:30 am - 5:00 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, David Yi can be reached at (571)270-7519. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/AMY TRAN/Examiner, Art Unit 2126
/VAN C MANG/Primary Examiner, Art Unit 2126