Last updated: July 17, 2026
Application No. 17/083,093
SYSTEM AND METHOD FOR DEEP CUSTOMIZED NEURAL NETWORKS FOR TIME SERIES FORECASTING

Final Rejection §103
Filed
Oct 28, 2020
Examiner
TRAN, AMY NMN
Art Unit
2126
Tech Center
2100 — Computer Architecture & Software
Assignee
Yahoo Assets LLC
OA Round
6 (Final)
This examiner grants 36% of cases after interview

— +44.3% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 31 resolved cases, 2023–2026
Examiner Intelligence

TRAN, AMY NMN View full profile →
Grants only 36% of cases
Career Allowance Rate
11 granted / 31 resolved
-19.5% vs TC avg
Strong +44% interview lift
Without
With
+44.3%
Interview Lift
resolved cases with interview
Typical timeline
4y 9m
Avg Prosecution
18 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
1.9%
-38.1% vs TC avg
§103
91.7%
+51.7% vs TC avg
§102
0.9%
-39.1% vs TC avg
§112
5.1%
-34.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 31 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Status of Claims
	The present application is being examined under the claims filed on 04/28/2026.
	Claims 1-20 are pending.
	Claims 1, 5, 8, 12, 15 and 19 are amended.
Response to Arguments
	In reference to rejections under 35 USC§ 103:
Applicant’s arguments filed 04/28/2026, with respect to the rejection(s) of claim(s) under 35 U.S.C 103 have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of Iwata & Kumagai (“FEW-SHOT LEARNING FOR TIME-SERIES FORECASTING”)
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1, 5, 6, 8, 12, 13, 15, 19 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Ciarlini et al. (US 10,339,235 B1) (hereafter referred to as “Ciarlini”) in view of Flunkert et al. (US 10,936,947 B1) (hereafter referred to as “Flunkert”) and further in view of Li et al. (“DIFFUSION CONVOLUTIONAL RECURRENT NEURAL NETWORK: DATA-DRIVEN TRAFFIC FORECASTING”) and Iwata & Kumagai (“FEW-SHOT LEARNING FOR TIME-SERIES FORECASTING”)
As per claim 1, Ciarlini discloses:
A method implemented on at least one machine including at least one processor, memory, (Ciarlini, Col. 8, Lines 4-6: “each compute node 410 comprises a processor coupled to a memory. The processor may comprise a microprocessor”) and communication platform capable of connecting to a network for machine learning for time series, the method comprising: (Ciarlini, Col. 9, Lines 26-31: “numerous other arrangements of computers, servers storage devices or other components are possible in the exemplary distributed computing environment 400. Such components can communicate with other elements of the exemplary metadata storage environment 100 over any type of network or other communication media.”)
performing hierarchical learning (Ciarlini, Col. 6, Lines 56-60: “If it is necessary to improve performance even further, due to the number of time series
210, the procedure can be generalized by creating multiple hierarchical learning stages, as would be apparent to a person of ordinary skill in the art.”)
Ciarlini fails to disclose:
obtaining a base model having global model parameters adjusted via deep learning for forecasting time series measurements of a plurality of time series
querving, across the plurality of time series using the global model parameters of the base model, a forward historical temporal pattern graph and a backward historical temporal pattern graph relevant to a target time series in the plurality of time series,
wherein the target time series has insufficient data to train a target model for forecasting the target time series; and
obtaining target model parameters of the target model by customizing the base model based on the forward historical temporal pattern graph and the backward historical temporal pattern graph
	However, Iwata explicitly discloses:
obtaining a base model having global model parameters adjusted via deep learning for forecasting time series measurements of a plurality of time series (Iwata, Pg. 4, Section 3.2: “We estimate model parameters Ф by minimizing the expected loss on a query set given a support set using an episodic training framework, where support and query sets are randomly generated from training datasets X to simulate target tasks: 
    PNG
    media_image1.png
    39
    367
    media_image1.png
    Greyscale
… The training procedure of our model is shown in Algorithm 1. For each iteration, we randomly generate support and query sets (Lines 3 – 5) from a randomly selected task. Given the support and query sets, we calculate the loss (Line 6) by (6). We update model parameters by using stochastic gradient descent methods (Line 7).”) [Examiner’s note: base model: the recurrent neural network model with attention mechanism, i.e., the model with query sets and parameters Ф; Iwata teaches that the model parameters Ф are adjusted during training by calculating loss and gradients and updating the parameters using stochastic gradient descent.]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ciarlini and Iwata. Ciarlini teaches methods for performing massively parallel processing (MPP) large-scale combinations of time series data. Iwata teaches a few-shot learning method that forecasts a future value of a time-series in a target task. One of ordinary skill would have motivation to combine Ciarlini and Iwata to improve the forecasting performance of the target task.
	However, Li explicitly discloses:
querving, across the plurality of time series using the global model parameters of the base model, a forward historical temporal pattern graph and a backward historical temporal pattern graph relevant to a target time series in the plurality of time series, (Li, Pg. 2, Section 2.1: “The goal of traffic forecasting is to predict the future traffic speed given previously observed traffic flow from N correlated sensors on the road network.”, Pg. 2, Section 2.2: “We also include the reversed direction diffusion process, such that the bidirectional diffusion offers the model more flexibility to capture the influence from both the upstream and the downstream traffic”, Pg. 3, Eq. (2) explicitly uses two transition matrices (forward and reverse): it include both                         
                            
                                
                                    
                                        
                                            D
                                        
                                        
                                            o
                                        
                                        
                                            -
                                            1
                                        
                                    
                                    W
                                
                            
                        
                    kand                         
                            
                                
                                    
                                        
                                            D
                                        
                                        
                                            1
                                        
                                        
                                            -
                                            1
                                        
                                    
                                    
                                        
                                            W
                                        
                                        
                                            T
                                        
                                    
                                
                            
                        
                    k and Li states that these are the transition matrices of the diffusion process and the reverse one respectively, P.g. 4, Figure 2: “System architecture for the Diffusion Convolutional Recurrent Neural Network designed for spatiotemporal traffic forecasting. The historical time series are fed into an encoder whose final states are used to initialize the decoder. The decoder makes predictions based on either previous ground truth or the model output.” 
    PNG
    media_image2.png
    361
    858
    media_image2.png
    Greyscale
Pg. 4, ¶[2]: “The model architecture of DCRNN is shown in Figure 2. The entire network is trained by maximizing the likelihood of generating the target future time series using backpropagation through time. DCRNN is able to capture spatiotemporal dependencies among time series and can be applied to various spatiotemporal forecasting problems.”) [Examiner’s note: Li discloses training the diffusion convolutional recurrent neural network end-to-end using historical observations, whereby the model parameters are adjusted based on the extracted temporal patterns to generate forecasts for a target time series]
obtaining target model parameters of the target model by customizing the base model based on the forward historical temporal pattern graph and the backward historical temporal pattern graph (Li, P.g. 4, Figure 2: “System architecture for the Diffusion Convolutional Recurrent Neural Network designed for spatiotemporal traffic forecasting. The historical time series are fed into an encoder whose final states are used to initialize the decoder. The decoder makes predictions based on either previous ground truth or the model output.” 
    PNG
    media_image2.png
    361
    858
    media_image2.png
    Greyscale
Pg. 4, ¶[2]: “The model architecture of DCRNN is shown in Figure 2. The entire network is trained by maximizing the likelihood of generating the target future time series using backpropagation through time. DCRNN is able to capture spatiotemporal dependencies among time series and can be applied to various spatiotemporal forecasting problems.”) [Examiner’s note: Li discloses training the diffusion convolutional recurrent neural network end-to-end using historical observations, whereby the model parameters are adjusted based on the extracted temporal patterns to generate forecasts for a target time series]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ciarlini and Li. Ciarlini teaches methods for performing massively parallel processing (MPP) large-scale combinations of time series data. Li teaches modelling the traffic flow as a diffusion process on a directed graph and introduce Diffusion Convolutional Recurrent Neural Network (DCRNN), a deep learning framework for traffic forecasting that incorporates both spatial and temporal dependency in the traffic flow. One of ordinary skill would have motivation to combine Ciarlini and Li because MPEP 2143 sets forth the Supreme Court rationales for obviousness including: (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results; (E): “Obvious to try” choosing from a finite number of identified, predictable solutions, with a reasonable expectation of success; (F) Known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of the ordinary skill in the art.
	However, Flunkert explicitly discloses:
wherein the target time series has insufficient data to train a target model for forecasting the target time series; and (Flunkert, Col. 3, Lines 46-57: “In at least one embodiment, if a sufficiently large training data set is used, the model may be general enough to be able to make predictions regarding demands for an item for which no (or very few) demand observations were available for training. For example, if information regarding the similarity of a new item Inew to some set of other items {Iold} along one or more dimensions such as item/product category, price, etc. is provided, where demand observations for {Iold} items were used to train the model while demand observations for Inew was not used to train the model, the model may still be able to provide useful predictions for Inew demand in such an embodiment.”) [Examiner’s note: Inew represents the target time series with insufficient data, while Iold represents other time series that have sufficient historical demand data. The model compensates for the missing data by leveraging correlations with similar items to predict the demand for the target item.]
obtaining target model parameters of the target model by customizing the base model based on the extracted temporal patterns relevant to the target time series (Flunkert, Col. 5, Lines 54-65: “If the evaluations indicate that the model does not meet a desired quality/accuracy criterion, the model may be adjusted in some embodiments---e.g., various hyperparameters, initial parameters and/or feature extraction techniques may be modified and the model may be retrained. In at least one embodiment, new versions of the 
models may be generated over time as new demand observations are obtained. For example, in one implementation, new demand forecasts for K weeks into the future may be generated every week using demand data collected over a time window of the previous N weeks as input for the composite modeling methodology.”) [Examiner’s note: target model parameters i.e., new demand forecasts for K weeks, customizing the base model i.e., adjust or modify the demand data collected over a time window of previous N weeks]
	It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ciarlini and Flunkert. Ciarlini teaches methods for performing massively parallel processing (MPP) large-scale combinations of time series data. Flunkert teaches a recurrent neural network model is trained using a plurality of time series of demand observations to generate demand forecasts for various items. One of ordinary skill would have motivation to combine Ciarlini and Flunkert because time series data can exhibit changes over time due to external factors. Extracting temporal patterns allows the model to adapt to these changes by recognizing and incorporating them into predictions. Customizing the model based on these extracted patterns helps in maintaining the relevance and accuracy of the model even as conditions evolve. (Flunkert, Col. 14, Lines 49-67)
	As per claim 5, the combination of Ciarlini, Flunkert, Li and Iwata discloses all limitations of Claim 1 (as shown in the rejections above)
Ciarlini in view of Flunkert, Li and Iwata further discloses:
The method of claim 1, wherein the deep learning of the base model and the customizing the base model are performed simultaneously during the hierarchical learning (Ciarlini, Col. 6: “The model 230 in each group 220 can be computed substantially in parallel on distributed compute nodes and, since the input matrix is small, the processing time is relatively short…In the generated models, the same time series 210 can have multiple coefficients, one for each different time lag”, and Col. 7: 
    PNG
    media_image3.png
    185
    639
    media_image3.png
    Greyscale
) [Examiner note: Ciarlini disclose that the models are computed in parallel using the OMP algorithm and then the coefficients of the base model will be updated at the same time]
wherein the querying further uses hidden representations associated with the base model (Iwata, Pg. 3, Figure 2: 
    PNG
    media_image4.png
    559
    916
    media_image4.png
    Greyscale
)
	As per claim 6, the combination of Ciarlini, Flunkert, Li and Iwata discloses all the limitations of Claim 1 (as shown in the rejections above)
Ciarlini in view of Flunkert, Li and Iwata further discloses:
The method of claim 1, wherein the deep learning of the base model and the customizing the base model are performed in sequence during the hierarchical learning. (Ciarlini, Col. 6-7: “In this case, there is a plurality of hierarchical learning levels to generate the final model and in each intermediate level of the hierarchy, intermediate compute nodes execute both the roles of master compute node for compute nodes of the lower hierarchical level and working compute nodes for the compute nodes of the upper hierarchical level. Intermediate compute nodes receive selected variables and scores from lower-level compute nodes and perform the following steps: rank the variables; select a pre-defined number of variables based on their scores to be considered as input for the generation of an intermediate linear model using an Orthogonal Matching Pursuit algorithm; assign a score to each variable of the intermediate model; and provide such variables and their corresponding scores to the upper level in the hierarchy.”) [Examiner note: the procedure is generalized by creating multiple learning stages and the involvement of intermediate compute nodes in the hierarchical structure reinforces the sequential nature of the process. These nodes receive information from lower-level compute nodes and perform specific steps of customization before providing variables and scores to the upper level.]
As per claim 8, Ciarlini discloses:
Machine readable and non-transitory medium having information recorded thereon for machine learning for time series, wherein the information, once read by a machine, causes the machine to (Ciarlini, Col. 9, Lines 61-63: “a computer readable storage medium, as used herein, is not to be construed as being transitory signals,”, and Col. 10, Lines 17-19: “the system includes distinct software modules, each being embodied on a tangible computer- readable recordable storage medium”, and Ciarlini, Abstract: “Methods and apparatus are provided for performing massively parallel processing (MPP) large-scale combinations of time series data.”)
performing hierarchical learning (Ciarlini, Col. 6, Lines 56-60: “If it is necessary to improve performance even further, due to the number of time series
210, the procedure can be generalized by creating multiple hierarchical learning stages, as would be apparent to a person of ordinary skill in the art.”)
Ciarlini fails to disclose:
obtaining a base model having global model parameters adjusted via deep learning for forecasting time series measurements of a plurality of time series
querving, across the plurality of time series using the global model parameters of the base model, a forward historical temporal pattern graph and a backward historical temporal pattern graph relevant to a target time series in the plurality of time series
wherein the target time series has insufficient data to train a target model for forecasting the target time series; and
obtaining target model parameters of the target model by customizing the base model based on the forward historical temporal pattern graph and the backward historical temporal pattern graph
However, Iwata explicitly discloses:
obtaining a base model having global model parameters adjusted via deep learning for forecasting time series measurements of a plurality of time series (Iwata, Pg. 4, Section 3.2: “We estimate model parameters Ф by minimizing the expected loss on a query set given a support set using an episodic training framework, where support and query sets are randomly generated from training datasets X to simulate target tasks: 
    PNG
    media_image1.png
    39
    367
    media_image1.png
    Greyscale
… The training procedure of our model is shown in Algorithm 1. For each iteration, we randomly generate support and query sets (Lines 3 – 5) from a randomly selected task. Given the support and query sets, we calculate the loss (Line 6) by (6). We update model parameters by using stochastic gradient descent methods (Line 7).”) [Examiner’s note: base model: the recurrent neural network model with attention mechanism, i.e., the model with query sets and parameters Ф; Iwata teaches that the model parameters Ф are adjusted during training by calculating loss and gradients and updating the parameters using stochastic gradient descent.]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ciarlini and Iwata. Ciarlini teaches methods for performing massively parallel processing (MPP) large-scale combinations of time series data. Iwata teaches a few-shot learning method that forecasts a future value of a time-series in a target task. One of ordinary skill would have motivation to combine Ciarlini and Iwata to improve the forecasting performance of the target task.
	However, Li explicitly discloses:
querving, across the plurality of time series using the global model parameters of the base model, a forward historical temporal pattern graph and a backward historical temporal pattern graph relevant to a target time series in the plurality of time series, (Li, Pg. 2, Section 2.1: “The goal of traffic forecasting is to predict the future traffic speed given previously observed traffic flow from N correlated sensors on the road network.”, Pg. 2, Section 2.2: “We also include the reversed direction diffusion process, such that the bidirectional diffusion offers the model more flexibility to capture the influence from both the upstream and the downstream traffic”, Pg. 3, Eq. (2) explicitly uses two transition matrices (forward and reverse): it include both                         
                            
                                
                                    
                                        
                                            D
                                        
                                        
                                            o
                                        
                                        
                                            -
                                            1
                                        
                                    
                                    W
                                
                            
                        
                    kand                         
                            
                                
                                    
                                        
                                            D
                                        
                                        
                                            1
                                        
                                        
                                            -
                                            1
                                        
                                    
                                    
                                        
                                            W
                                        
                                        
                                            T
                                        
                                    
                                
                            
                        
                    k and Li states that these are the transition matrices of the diffusion process and the reverse one respectively, P.g. 4, Figure 2: “System architecture for the Diffusion Convolutional Recurrent Neural Network designed for spatiotemporal traffic forecasting. The historical time series are fed into an encoder whose final states are used to initialize the decoder. The decoder makes predictions based on either previous ground truth or the model output.” 
    PNG
    media_image2.png
    361
    858
    media_image2.png
    Greyscale
Pg. 4, ¶[2]: “The model architecture of DCRNN is shown in Figure 2. The entire network is trained by maximizing the likelihood of generating the target future time series using backpropagation through time. DCRNN is able to capture spatiotemporal dependencies among time series and can be applied to various spatiotemporal forecasting problems.”) [Examiner’s note: Li discloses training the diffusion convolutional recurrent neural network end-to-end using historical observations, whereby the model parameters are adjusted based on the extracted temporal patterns to generate forecasts for a target time series]
obtaining target model parameters of the target model by customizing the base model based on the forward historical temporal pattern graph and the backward historical temporal pattern graph (Li, P.g. 4, Figure 2: “System architecture for the Diffusion Convolutional Recurrent Neural Network designed for spatiotemporal traffic forecasting. The historical time series are fed into an encoder whose final states are used to initialize the decoder. The decoder makes predictions based on either previous ground truth or the model output.” 
    PNG
    media_image2.png
    361
    858
    media_image2.png
    Greyscale
Pg. 4, ¶[2]: “The model architecture of DCRNN is shown in Figure 2. The entire network is trained by maximizing the likelihood of generating the target future time series using backpropagation through time. DCRNN is able to capture spatiotemporal dependencies among time series and can be applied to various spatiotemporal forecasting problems.”) [Examiner’s note: Li discloses training the diffusion convolutional recurrent neural network end-to-end using historical observations, whereby the model parameters are adjusted based on the extracted temporal patterns to generate forecasts for a target time series]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ciarlini and Li. Ciarlini teaches methods for performing massively parallel processing (MPP) large-scale combinations of time series data. Li teaches modelling the traffic flow as a diffusion process on a directed graph and introduce Diffusion Convolutional Recurrent Neural Network (DCRNN), a deep learning framework for traffic forecasting that incorporates both spatial and temporal dependency in the traffic flow. One of ordinary skill would have motivation to combine Ciarlini and Li because MPEP 2143 sets forth the Supreme Court rationales for obviousness including: (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results; (E): “Obvious to try” choosing from a finite number of identified, predictable solutions, with a reasonable expectation of success; (F) Known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of the ordinary skill in the art.
	However, Flunkert explicitly discloses:
extracting, across the plurality of time series, temporal patterns relevant to a target time series in the plurality of time series, (Flunkert, Col. 4, Lines 2-8: “In at least one embodiment in which information about the similarity between a particular target item Ia (for which demand is to be forecast) and another item Ib (whose demand observations were used for training) is provided as input to the RNN model, the number of actual demand observations available for lb may exceed the number of demand observations available for Ia”, Col. 3, Lines 50-64: “For example, if information regarding the similarity of a new item Inew to some set of other items {Iold} along one or more dimensions such as item/product category, price, etc. is provided, where demand observations for {Iold} items were used to train the model while demand observations for Inew was not used to train the model, the model may still be able to provide useful predictions for Inew demand in such an embodiment. The accuracy of the forecasts may increase with the amount of information available about the items being considered in at least some embodiments- e.g., if several weeks of actual demand observations an item Ij are used to train the model, and several months or years of demand observations for another item Ik are used to train the model, the forecasts for Ik may tend to be more accurate than the forecasts for Ij"”) [Examiner’s note: temporal patterns relevant to the target time series i.e., the number of actual demand observation]
wherein the target time series has insufficient data to train a target model for forecasting the target time series; and (Flunkert, Col. 3, Lines 46-57: “In at least one embodiment, if a sufficiently large training data set is used, the model may be general enough to be able to make predictions regarding demands for an item for which no (or very few) demand observations were available for training. For example, if information regarding the similarity of a new item Inew to some set of other items {Iold} along one or more dimensions such as item/product category, price, etc. is provided, where demand observations for {Iold} items were used to train the model while demand observations for Inew was not used to train the model, the model may still be able to provide useful predictions for Inew demand in such an embodiment.”) [Examiner’s note: Inew represents the target time series with insufficient data, while Iold represents other time series that have sufficient historical demand data. The model compensates for the missing data by leveraging correlations with similar items to predict the demand for the target item.]
obtaining target model parameters of the target model by customizing the base model based on the extracted temporal patterns relevant to the target time series (Flunkert, Col. 5, Lines 54-65: “If the evaluations indicate that the model does not meet a desired quality/accuracy criterion, the model may be adjusted in some embodiments---e.g., various hyperparameters, initial parameters and/or feature extraction techniques may be modified and the model may be retrained. In at least one embodiment, new versions of the 
models may be generated over time as new demand observations are obtained. For example, in one implementation, new demand forecasts for K weeks into the future may be generated every week using demand data collected over a time window of the previous N weeks as input for the composite modeling methodology.”) [Examiner’s note: target model parameters i.e., new demand forecasts for K weeks, customizing the base model i.e., adjust or modify the demand data collected over a time window of previous N weeks]
	It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ciarlini and Flunkert. Ciarlini teaches methods for performing massively parallel processing (MPP) large-scale combinations of time series data. Flunkert teaches a recurrent neural network model is trained using a plurality of time series of demand observations to generate demand forecasts for various items. One of ordinary skill would have motivation to combine Ciarlini and Flunkert because time series data can exhibit changes over time due to external factors. Extracting temporal patterns allows the model to adapt to these changes by recognizing and incorporating them into predictions. Customizing the model based on these extracted patterns helps in maintaining the relevance and accuracy of the model even as conditions evolve. (Flunkert, Col. 14, Lines 49-67)
As per claim 12, the combination of Ciarlini, Flunkert, Li and Iwata discloses all the limitations of claim 8 (as shown in the rejections above)
Ciarlini in view of Flunkert, Li and Iwata further discloses:
wherein the deep learning of the base model and the customizing the base model are performed simultaneously during the hierarchical learning (Ciarlini, Col. 6: “The model 230 in each group 220 can be computed substantially in parallel on distributed compute nodes and, since the input matrix is small, the processing time is relatively short…In the generated models, the same time series 210 can have multiple coefficients, one for each different time lag”, and Col. 7: 
    PNG
    media_image3.png
    185
    639
    media_image3.png
    Greyscale
) [Examiner note: Ciarlini disclose that the models are computed in parallel using the OMP algorithm and then the coefficients of the base model will be updated at the same time]
wherein the querying further uses hidden representations associated with the base model (Iwata, Pg. 3, Figure 2: 
    PNG
    media_image4.png
    559
    916
    media_image4.png
    Greyscale
)
As per claim 13, the combination of Ciarlini, Flunkert, Li and Iwata discloses all the limitations of Claim 8 (as shown in the rejections above)
Ciarlini in view of Flunkert, Li and Iwata further discloses:
wherein the deep learning of the base model and the customizing the base model are performed in sequence during the hierarchical learning. (Ciarlini, Col. 6-7: “In this case, there is a plurality of hierarchical learning levels to generate the final model and in each intermediate level of the hierarchy, intermediate compute nodes execute both the roles of master compute node for compute nodes of the lower hierarchical level and working compute nodes for the compute nodes of the upper hierarchical level. Intermediate compute nodes receive selected variables and scores from lower-level compute nodes and perform the following steps: rank the variables; select a pre-defined number of variables based on their scores to be considered as input for the generation of an intermediate linear model using an Orthogonal Matching Pursuit algorithm; assign a score to each variable of the intermediate model; and provide such variables and their corresponding scores to the upper level in the hierarchy.”) [Examiner note: the procedure is generalized by creating multiple learning stages and the involvement of intermediate compute nodes in the hierarchical structure reinforces the sequential nature of the process. These nodes receive information from lower-level compute nodes and perform specific steps of customization before providing variables and scores to the upper level.]     
	As per claim 15, Ciarlini further discloses:
A system for machine learning for time series, comprising: (Ciarlini, Abstract: “Methods and apparatus are provided for performing massively parallel processing (MPP) large-scale combinations of time series data.”)
a general deep machine learning mechanism (Ciarlini, Col. 6, Lines 56-60: “If it is necessary to improve performance even further, due to the number of time series 210, the procedure can be generalized by creating multiple hierarchical learning stages, as would be apparent to a person of ordinary skill in the art.”)
Ciarlini fails to disclose:
obtaining a base model having global model parameters adjusted via deep learning for forecasting time series measurements of a plurality of time series
a customized deep learning mechanism configured for: querving, across the plurality of time series using the global model parameters of the base model, a forward historical temporal pattern graph and a backward historical temporal pattern graph relevant to a target time series in the plurality of time series
wherein the target time series has insufficient data to train a target model for forecasting the target time series; and
obtaining target model parameters of the target model by customizing the base model based on the forward historical temporal pattern graph and the backward historical temporal pattern graph
However, Iwata explicitly discloses:
obtaining a base model having global model parameters adjusted via deep learning for forecasting time series measurements of a plurality of time series (Iwata, Pg. 4, Section 3.2: “We estimate model parameters Ф by minimizing the expected loss on a query set given a support set using an episodic training framework, where support and query sets are randomly generated from training datasets X to simulate target tasks: 
    PNG
    media_image1.png
    39
    367
    media_image1.png
    Greyscale
… The training procedure of our model is shown in Algorithm 1. For each iteration, we randomly generate support and query sets (Lines 3 – 5) from a randomly selected task. Given the support and query sets, we calculate the loss (Line 6) by (6). We update model parameters by using stochastic gradient descent methods (Line 7).”) [Examiner’s note: base model: the recurrent neural network model with attention mechanism, i.e., the model with query sets and parameters Ф; Iwata teaches that the model parameters Ф are adjusted during training by calculating loss and gradients and updating the parameters using stochastic gradient descent.]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ciarlini and Iwata. Ciarlini teaches methods for performing massively parallel processing (MPP) large-scale combinations of time series data. Iwata teaches a few-shot learning method that forecasts a future value of a time-series in a target task. One of ordinary skill would have motivation to combine Ciarlini and Iwata to improve the forecasting performance of the target task.
	However, Li explicitly discloses:
	querving, across the plurality of time series using the global model parameters of the base model, a forward historical temporal pattern graph and a backward historical temporal pattern graph relevant to a target time series in the plurality of time series, (Li, Pg. 2, Section 2.1: “The goal of traffic forecasting is to predict the future traffic speed given previously observed traffic flow from N correlated sensors on the road network.”, Pg. 2, Section 2.2: “We also include the reversed direction diffusion process, such that the bidirectional diffusion offers the model more flexibility to capture the influence from both the upstream and the downstream traffic”, Pg. 3, Eq. (2) explicitly uses two transition matrices (forward and reverse): it include both                         
                            
                                
                                    
                                        
                                            D
                                        
                                        
                                            o
                                        
                                        
                                            -
                                            1
                                        
                                    
                                    W
                                
                            
                        
                    kand                         
                            
                                
                                    
                                        
                                            D
                                        
                                        
                                            1
                                        
                                        
                                            -
                                            1
                                        
                                    
                                    
                                        
                                            W
                                        
                                        
                                            T
                                        
                                    
                                
                            
                        
                    k and Li states that these are the transition matrices of the diffusion process and the reverse one respectively, P.g. 4, Figure 2: “System architecture for the Diffusion Convolutional Recurrent Neural Network designed for spatiotemporal traffic forecasting. The historical time series are fed into an encoder whose final states are used to initialize the decoder. The decoder makes predictions based on either previous ground truth or the model output.” 
    PNG
    media_image2.png
    361
    858
    media_image2.png
    Greyscale
Pg. 4, ¶[2]: “The model architecture of DCRNN is shown in Figure 2. The entire network is trained by maximizing the likelihood of generating the target future time series using backpropagation through time. DCRNN is able to capture spatiotemporal dependencies among time series and can be applied to various spatiotemporal forecasting problems.”) [Examiner’s note: Li discloses training the diffusion convolutional recurrent neural network end-to-end using historical observations, whereby the model parameters are adjusted based on the extracted temporal patterns to generate forecasts for a target time series]
obtaining target model parameters of the target model by customizing the base model based on the forward historical temporal pattern graph and the backward historical temporal pattern graph (Li, P.g. 4, Figure 2: “System architecture for the Diffusion Convolutional Recurrent Neural Network designed for spatiotemporal traffic forecasting. The historical time series are fed into an encoder whose final states are used to initialize the decoder. The decoder makes predictions based on either previous ground truth or the model output.” 
    PNG
    media_image2.png
    361
    858
    media_image2.png
    Greyscale
Pg. 4, ¶[2]: “The model architecture of DCRNN is shown in Figure 2. The entire network is trained by maximizing the likelihood of generating the target future time series using backpropagation through time. DCRNN is able to capture spatiotemporal dependencies among time series and can be applied to various spatiotemporal forecasting problems.”) [Examiner’s note: Li discloses training the diffusion convolutional recurrent neural network end-to-end using historical observations, whereby the model parameters are adjusted based on the extracted temporal patterns to generate forecasts for a target time series]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ciarlini and Li. Ciarlini teaches methods for performing massively parallel processing (MPP) large-scale combinations of time series data. Li teaches modelling the traffic flow as a diffusion process on a directed graph and introduce Diffusion Convolutional Recurrent Neural Network (DCRNN), a deep learning framework for traffic forecasting that incorporates both spatial and temporal dependency in the traffic flow. One of ordinary skill would have motivation to combine Ciarlini and Li because MPEP 2143 sets forth the Supreme Court rationales for obviousness including: (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results; (E): “Obvious to try” choosing from a finite number of identified, predictable solutions, with a reasonable expectation of success; (F) Known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of the ordinary skill in the art.
	However, Flunkert explicitly discloses:
a customized deep learning mechanism configured for: extracting, across the plurality of time series, temporal patterns relevant to a target time series in the plurality of time series, (Flunkert, Col. 4, Lines 2-8: “In at least one embodiment in which information about the similarity between a particular target item Ia (for which demand is to be forecast) and another item Ib (whose demand observations were used for training) is provided as input to the RNN model, the number of actual demand observations available for lb may exceed the number of demand observations available for Ia”, Col. 3, Lines 50-64: “For example, if information regarding the similarity of a new item Inew to some set of other items {Iold} along one or more dimensions such as item/product category, price, etc. is provided, where demand observations for {Iold} items were used to train the model while demand observations for Inew was not used to train the model, the model may still be able to provide useful predictions for Inew demand in such an embodiment. The accuracy of the forecasts may increase with the amount of information available about the items being considered in at least some embodiments- e.g., if several weeks of actual demand observations an item Ij are used to train the model, and several months or years of demand observations for another item Ik are used to train the model, the forecasts for Ik may tend to be more accurate than the forecasts for Ij"”) [Examiner’s note: temporal patterns relevant to the target time series i.e., the number of actual demand observation]
wherein the target time series has insufficient data to train a target model for forecasting the target time series; and (Flunkert, Col. 3, Lines 46-57: “In at least one embodiment, if a sufficiently large training data set is used, the model may be general enough to be able to make predictions regarding demands for an item for which no (or very few) demand observations were available for training. For example, if information regarding the similarity of a new item Inew to some set of other items {Iold} along one or more dimensions such as item/product category, price, etc. is provided, where demand observations for {Iold} items were used to train the model while demand observations for Inew was not used to train the model, the model may still be able to provide useful predictions for Inew demand in such an embodiment.”) [Examiner’s note: Inew represents the target time series with insufficient data, while Iold represents other time series that have sufficient historical demand data. The model compensates for the missing data by leveraging correlations with similar items to predict the demand for the target item.]
obtaining target model parameters of the target model by customizing the base model based on the extracted temporal patterns relevant to the target time series (Flunkert, Col. 5, Lines 54-65: “If the evaluations indicate that the model does not meet a desired quality/accuracy criterion, the model may be adjusted in some embodiments---e.g., various hyperparameters, initial parameters and/or feature extraction techniques may be modified and the model may be retrained. In at least one embodiment, new versions of the 
models may be generated over time as new demand observations are obtained. For example, in one implementation, new demand forecasts for K weeks into the future may be generated every week using demand data collected over a time window of the previous N weeks as input for the composite modeling methodology.”) [Examiner’s note: target model parameters i.e., new demand forecasts for K weeks, customizing the base model i.e., adjust or modify the demand data collected over a time window of previous N weeks]
	It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Ciarlini and Flunkert. Ciarlini teaches methods for performing massively parallel processing (MPP) large-scale combinations of time series data. Flunkert teaches a recurrent neural network model is trained using a plurality of time series of demand observations to generate demand forecasts for various items. One of ordinary skill would have motivation to combine Ciarlini and Flunkert because time series data can exhibit changes over time due to external factors. Extracting temporal patterns allows the model to adapt to these changes by recognizing and incorporating them into predictions. Customizing the model based on these extracted patterns helps in maintaining the relevance and accuracy of the model even as conditions evolve. (Flunkert, Col. 14, Lines 49-67)
As per claim 19, the combination of Ciarlini, Flunkert, Li and Iwata discloses all the limitations of Claim 15 (as shown in the rejections above)
Ciarlini in view of Flunkert, Li and Iwata further discloses:
wherein the deep learning of the base model and the customizing the base model are performed simultaneously during the hierarchical learning (Ciarlini, Col. 6: “The model 230 in each group 220 can be computed substantially in parallel on distributed compute nodes and, since the input matrix is small, the processing time is relatively short…In the generated models, the same time series 210 can have multiple coefficients, one for each different time lag”, and Col. 7: 
    PNG
    media_image3.png
    185
    639
    media_image3.png
    Greyscale
) [Examiner note: Ciarlini disclose that the models are computed in parallel using the OMP algorithm and then the coefficients of the base model will be updated at the same time]
wherein the querying further uses hidden representations associated with the base model (Iwata, Pg. 3, Figure 2: 
    PNG
    media_image4.png
    559
    916
    media_image4.png
    Greyscale
)
As per claim 20, the combination of Ciarlini, Flunkert, Li and Iwata discloses all the limitations of Claim 15 (as shown in the rejections above)
Ciarlini in view of Flunkert, Li and Iwata further discloses:
wherein the deep learning of the base model and the customizing the base model are performed in sequence during the hierarchical learning. (Ciarlini, Col. 6-7: “In this case, there is a plurality of hierarchical learning levels to generate the final model and in each intermediate level of the hierarchy, intermediate compute nodes execute both the roles of master compute node for compute nodes of the lower hierarchical level and working compute nodes for the compute nodes of the upper hierarchical level. Intermediate compute nodes receive selected variables and scores from lower-level compute nodes and perform the following steps: rank the variables; select a pre-defined number of variables based on their scores to be considered as input for the generation of an intermediate linear model using an Orthogonal Matching Pursuit algorithm; assign a score to each variable of the intermediate model; and provide such variables and their corresponding scores to the upper level in the hierarchy.”) [Examiner note: the procedure is generalized by creating multiple learning stages and the involvement of intermediate compute nodes in the hierarchical structure reinforces the sequential nature of the process. These nodes receive information from lower-level compute nodes and perform specific steps of customization before providing variables and scores to the upper level.]


Claim(s) 2, 9, 16 are rejected under 35 U.S.C. 103 as being unpatentable over Ciarlini in view of Flunkert, Li, Iwata and further in view of Deshpande & Sarawagi (“Streaming Adaptation of Deep Forecasting Models using Adaptive Recurrent Units”, Publication Date: 07/04/2019) (hereafter referred to as "Deshpande")
Regarding Claim 2, the combination of Ciarlini, Flunkert, Li and Iwata discloses all the limitations of Claim 1 (as shown in the rejections above)
Ciarlini in view of Flunkert, Li and Iwata further discloses:
the target model is learned specifically for forecasting a time series measurement of the corresponding target time series. (Ciarlini, Col. 3, Lines 27-29: “… where one or more of these time series are selected to explain, by a linear model, one particular time series of interest, referred to as a target time series.”, Col. 3, line 58: “The massive parallelization allows, for example, many thousands of time series (and time lags) to be processed that are candidates to be used in the model that explains or predicts a target time series”) [The examiner interprets the “linear model” here as the “target model” which is used for predicting or explaining one particular time series of interest. (forecasting target time series). Examiner note: explain time series i.e., predict or forecast time series Ciarlini has used these terms interchangeably to express the meaning of forecasting]
	Ciarlini in view of Flunkert, Li and Iwata fails to disclose:
The method of claim 1, wherein  the base model is learned generically for forecasting a time series measurement of any of the plurality time series; and
	However, Deshpande explicitly discloses:
The method of claim 1, wherein the base model is learned generically for forecasting a time series measurement of any of the plurality time series; and (Despande, Page 5, Col. 2, Section 4.2: “As a baseline we use the globally trained model where the decoder makes independent predictions for each of the future K points in time.”) [Examiner note: global model i.e., base model, predictions for each K points in time ie., forecasting time series measurements of plurality of time series]
 The combination of Ciarlini, Flunkert, Li, Iwatat and Despande are analogous art because they are in the same field of training time series data. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, having the teachings of Ciarlini, Flunkert, Li, Iwata and Despande before them, to modify the teachings of Ciarlini, Li, Iwata and Flunkert to include the teachings of Despande training the global model for predicting time series data in order to capture a summary of the historical context and is used as a shared representation for making future predictions in the time series forecasting model (Despande, Page 2, Section 2: “Typical deep learning based global models for multi-horizon time series forecasting [11, 29] deploy the encoder-decoder architecture. The output of the encoder is its final state giT at the end of T steps. This can be treated as a summary of the known y values that is relevant as a context for future predictions”)
Regarding claim 9, the combination of Ciarlini, Li, Itawa and Flunkert discloses all the limitations of Claim 8 (as shown in the rejections above)
Ciarlini in view of Flunkert, Li and Itawa further discloses:
the target model is learned specifically for forecasting a time series measurement of the corresponding target time series. (Ciarlini, Col. 4, Lines 65-67: “… the resulting model can capture most of the relevant influences on the target time series.”) [The examiner interprets the “resulting model” here as “the target model” (please see above). The authors disclose that the resulting model (target model) after being trained will be capable of capturing or predicting the relevant influences (the target time series measurement) on the target time series. ] 
 Ciarlini in view of Flunkert, Li and Itawa fails to disclose:
The medium of claim 8, wherein  the base model is learned generically for forecasting a time series measurement of any of the plurality time series; and
	However, Deshpande explicitly discloses:
the base model is learned generically for forecasting a time series measurement of any of the plurality time series; and (Despande, Page 5, Col. 2, Section 4.2: “As a baseline we use the globally trained model where the decoder makes independent predictions for each of the future K points in time.”) [Examiner note: global model i.e., base model, predictions for each K points in time i.e., forecasting time series measurements of plurality of time series]
	Regarding claim 16, the combination of Ciarlini, Li, Itawa and Flunkert discloses all the limitations of Claim 15 (as shown in the rejection above)
Ciarlini in view of Flunkert, Li and Itawa further discloses:
the target model is learned specifically for forecasting a time series measurement of the corresponding target time series.  (Ciarlini, Col. 4, Lines 65-67: “… the resulting model can capture most of the relevant influences on the target time series.”) [The examiner interprets the “resulting model” here as “the target model” (please see above). The authors disclose that the resulting model (target model) after being trained will be capable of capturing or predicting the relevant influences (the target time series measurement) on the target time series. ] 
	Ciarlini in view of Flunkert, Li and Itawa fails to disclose:
The system of claim 15, wherein the base model is learned generically for forecasting a time series measurement of any of the plurality time series; and
	However, Deshpande explicitly discloses:
The system of claim 15, wherein the base model is learned generically for forecasting a time series measurement of any of the plurality time series; and (Despande, Page 5, Col. 2, Section 4.2: “As a baseline we use the globally trained model where the decoder makes independent predictions for each of the future K points in time.”) [Examiner note: global model i.e., base model, predictions for each K points in time ie., forecasting time series measurements of plurality of time series]
	The same motivation in Claim 2 applies to Claim 16.
Claim(s) 3, 4, 10, 11, 17, and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ciarlini in view of Flunkert, Li, Itawa and further in view of Bazrafkan et al. (US 2018/0211164 A1) (hereafter referred to as "Bazrafkan")
Regarding claim 3, the combination of Ciarlini, Li, Itawa and Flunkert discloses all the limitations of Claim 1 (as shown in the rejections above)
Ciarlini in view of Flunkert, Li, Itawa further discloses:
The method of claim 1, wherein the step of deep learning comprises: 
receiving training data cross the plurality of tine series; (Ciarlini, Abstract: “A given working compute node in a distributed computing environment obtains a given group of time series data of a plurality of groups of time series data;”)
forecasting time series measurements of the training data based on the global model parameters of the base model; and (Ciarlini, Col. 6, Lines 53-54: “the procedure can be generalized by creating multiple hierarchical learning stages”, and Ciarlini, Col. 4, Line 58: 
    PNG
    media_image5.png
    187
    660
    media_image5.png
    Greyscale
) [Examiner note: the global model parameters i.e., vector b of the coefficients of the base model. The model i.e., base model because the target model is obtained by solving the equation which contains this model] (Ciarlini, Col. 3, Lines 55-58: “The massive parallelization allows, for example, many thousands of time series (and time lags) to be processed that are candidates to be used in the model that explains or predicts a target time series”) [The authors disclose that the model will be used to explain or predicts a target time series of many thousands of time series (a plurality of time series), and as mentioned above, the coefficients vector b is the global parameter of this model, so the examiner interprets that the coefficients vector will be used for predicting time series.]
	Ciarlini in view of Flunkert, Li and Itawa fails to disclose:
updating the global model parameters by minimizing a first loss determined based on the forecasted time series measurements from the training data and labels of the training data. 
	However, Bazrafkan explicitly discloses:
updating the global model parameters by minimizing a first loss determined based on the forecasted time series measurements from the training data and labels of the training data. (Bazrafkan, Abstract: “the parameters for the generative network are updated using the first and second loss functions”, and [0029]: “During each training epoch, an instance of Network A accepts at least one sample, … and generates Out1, a new sample in the same class so that this new sample reduces the loss function”, and [0032]: “Network A is a neural network, such as a generative model network …”) (Bazrafkan. [0033]: “the direct loss function LA for each instance of network A accepts Out1 and another image Ii from the same class in the dataset I1 ... Ik as input and can be calculated using a mean square error or any similar suitable measure. These measures can then be combined to provide the loss function LA for a batch.”, and [0030]: “The input can for example comprise a feature vector or time-series (like sounds, medical data, etc) as long as this is labelled”) [The authors disclose that the parameters (global model parameters) for the generative network, which is defined as Network A, are updated using first loss function, then they also disclose that during the training process, Network A will generate Out1 for reducing the loss function.] [The authors disclose that the loss function LA accepts an input from the training dataset I1 … Ik , then they also disclose that the input comprises a time series as long as it is labelled. Thus, the examiner notes that the loss function is generated based on the time series from the training data and labels of the training data.]
The combination of Ciarlini, Flunkert, Li, itawa and Bazrafkan are analogous art because they are in the same field of method of training neural network. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, having the teachings of Ciarlini, Flunker, Li, Itawa and Bazrafkan before them, to modify the teachings of Ciarlini, Flunkert, Li and Itawa to include the teachings of Bazrafkan updating the global parameters involves minimizing an initial loss calculated using the predicted time series derived from the training data and the labels associated with the training data in order to improve the model’s ability to train the local models for accurately forecasting future data points (Bazrafkan, [0028]: “Thus, network A learns the best data augmentation to increase the training accuracy of network B, even letting network A come up with nonintuitive but highly performing augmentation strategies”)
Regarding claim 4, the combination of Ciarlini, Flunkert, Li and Itawa discloses all the limitations of Claim 1 (as shown in the rejections above)
Ciarlini in view of Flunkert, Li and Itawa further discloses:
The method of claim 1, wherein the step of obtaining target model parameters
comprises: initializing the target model parameters for the target model based on target time series measurements forecasted based on the base model and labels of training data from the target time series; and (Ciarlini, Col. 7: “By running the OMP algorithm 300, a set of k almost linearly independent series 210 are obtained which explain the target variable”, and Col. 8: “The target time series 210 of interest might vary. The values of the time series 210 are normalized before the training phase, thus coefficients represent how important
each time series 210 (and corresponding lag 215) is in the final outcome 260.”) [Examiner note: target model parameters i.e., target variables, set of k series i.e., labels of training data from the target time series.]
	Ciarlini in view of Flunkert, Li and Itawa fails to disclose:
iteratively updating the target model parameters by minimizing a second loss determined based on a discrepancy between target time series measurements predicted using time series data from the target time series and labels of the time series data from the target time series.
	However, Bazrafkan explicitly discloses:
iteratively updating the target model parameters by minimizing a second loss determined based on a discrepancy between target time series measurements predicted using time series data from the target time series and labels of the time series data from the target time series. (Bazrafkan, Abstract: “A second loss function is determined for the target neural network by comparing outputs of instances of the target neural network to one or more targets for the neural network. The parameters for the target neural network are updated using the second loss function”, and [0054], lines 28-34: “A training epoch may comprise a number of batches . . . X(T-1 ), X(T), X(T + 1) ... being processed in sequence with each batch being used to train network A using the loss functions from networks A and B, and to train network B using the augmented images generated while training network A and also original images, which together generate the loss for network B.”) [The authors disclose using the second loss function to update the target parameters, and describing how the loss function for the target neural network is calculated by comparing its outputs to the reference targets. This is a way of quantifying the discrepancy or error in the network’s predictions compared to the labeled data. The examiner interprets determining a second loss function as minimizing that loss function, because this is the well-known goal of determining a loss function in machine learning. The authors also disclose this training epoch may comprise multiple batches being processed in sequence with each of them used to generate the loss function for network B (iteratively updating target model using the second loss function)]
Regarding claim 10, the combination of Ciarlini, Flunkert, Li and Itawa discloses all the limitations of Claim 8 (as shown in the rejections above)
Ciarlini in view of Flunkert, Li and Itawa further discloses:
	The medium of claim 8, wherein the step of deep learning comprises: 
receiving training data cross the plurality of tine series; (Ciarlini, Abstract: “A given working compute node in a distributed computing environment obtains a given group of time series data of a plurality of groups of time series data;”)
forecasting time series measurements of the training data based on the global model parameters of the base model; and (Ciarlini, Col. 6, Lines 53-54: “the procedure can be generalized by creating multiple hierarchical learning stages”, and Ciarlini, Col. 4, Line 58: 
    PNG
    media_image5.png
    187
    660
    media_image5.png
    Greyscale
) [Examiner note: the global model parameters i.e., vector b of the coefficients of the base model. The model i.e., base model because the target model is obtained by solving the equation which contains this model] (Ciarlini, Col. 3, Lines 55-58: “The massive parallelization allows, for example, many thousands of time series (and time lags) to be processed that are candidates to be used in the model that explains or predicts a target time series”) [The authors disclose that the model will be used to explain or predicts a target time series of many thousands of time series (a plurality of time series), and as mentioned above, the coefficients vector b is the global parameter of this model, so the examiner interprets that the coefficients vector will be used for predicting time series.]
Ciarlini in view of Flunkert, Li and Itawa fails to disclose:
updating the global model parameters by minimizing a first loss determined based on the forecasted time series measurements from the training data and labels of the training data
	However, Bazrafkan explicitly discloses:
	updating the global model parameters by minimizing a first loss determined
based on the forecasted time series measurements from the training data and labels of the training data (Bazrafkan, Abstract: “the parameters for the generative network are updated using the first and second loss functions”, and [0029]: “During each training epoch, an instance of Network A accepts at least one sample, … and generates Out1, a new sample in the same class so that this new sample reduces the loss function”, and [0032]: “Network A is a neural network, such as a generative model network …”) [The authors disclose that the parameters (global model parameters) for the generative network, which is defined as Network A, are updated using first loss function, then they also disclose that during the training process, Network A will generate Out1 for reducing the loss function.]. (Bazrafkan. [0033]: “the direct loss function LA for each instance of network A accepts Out1 and another image Ii from the same class in the dataset I1 ... Ik as input and can be calculated using a mean square error or any similar suitable measure. These measures can then be combined to provide the loss function LA for a batch.”, and [0030]: “The input can for example comprise a feature vector or time-series (like sounds, medical data, etc) as long as this is labelled”) [The authors disclose that the loss function LA accepts an input from the training dataset I1 … Ik , then they also disclose that the input comprises a time series as long as it is labelled. Thus, the examiner notes that the loss function is generated based on the time series from the training data and labels of the training data.] 
	Regarding claim 11, the combination of Ciarlini, Flunkert, Li and Itawa discloses all the limitations of Claim 8 (as shown in the rejections above)
Ciarlini in view of Flunkert, Li and Itawa further discloses:
The medium of claim 8, wherein the step of obtaining target model parameters comprises: initializing the target model parameters for the target model based on target time series measurements forecasted based on the base model and labels of training data from the target time series; and (Ciarlini, Col. 7: “By running the OMP algorithm 300, a set of k almost linearly independent series 210 are obtained which explain the target variable”, and Col. 8: “The target time series 210 of interest might vary. The values of the time series 210 are normalized before the training phase, thus coefficients represent how important
each time series 210 (and corresponding lag 215) is in the final outcome 260.”) [Examiner note: target model parameters i.e., target variables, set of k series i.e., labels of training data from the target time series.]
	Ciarlini in view of Flunkert, Li and Itawa fails to disclose:
iteratively updating the target model parameters by minimizing a second loss determined based on a discrepancy between target time series measurements predicted using time series data from the target time series and labels of the time series data from the target time series
	However, Bazrafkan explicitly discloses:
iteratively updating the target model parameters by minimizing a second loss determined based on a discrepancy between target time series measurements predicted using time series data from the target time series and labels of the time series data from the target time series.  (Bazrafkan, Abstract: “A second loss function is determined for the target neural network by comparing outputs of instances of the target neural network to one or more targets for the neural network. The parameters for the target neural network are updated using the second loss function”, and [0054], lines 28-34: “A training epoch may comprise a number of batches . . . X(T-1 ), X(T), X(T + 1) ... being processed in sequence with each batch being used to train network A using the loss functions from networks A and B, and to train network B using the augmented images generated while training network A and also original images, which together generate the loss for network B.”) [The authors disclose using the second loss function to update the target parameters, and describing how the loss function for the target neural network is calculated by comparing its outputs to the reference targets. This is a way of quantifying the discrepancy or error in the network’s predictions compared to the labeled data. The examiner interprets determining a second loss function as minimizing that loss function, because this is the well-known goal of determining a loss function in machine learning. The authors also disclose this training epoch may comprise multiple batches being processed in sequence with each of them used to generate the loss function for network B (iteratively updating target model using the second loss function)]
	Regarding claim 17, the combination of Ciarlini, Flunkert, Li and Itawa discloses all the limitations of Claim 15 (as shown in the rejections above)
Ciarlini in view of Flunkert, Li and Itawa further discloses:
The system of claim 15, wherein the general deep machine learning mechanism performs deep learning by:  receiving training data cross the plurality of tine series; (Ciarlini, Abstract: “A given working compute node in a distributed computing environment obtains a given group of time series data of a plurality of groups of time series data;”)
forecasting time series measurements of the training data based on the global model parameters of the base model; and (Ciarlini, Col. 6, Lines 53-54: “the procedure can be generalized by creating multiple hierarchical learning stages”, and Ciarlini, Col. 4, Line 58: 
    PNG
    media_image5.png
    187
    660
    media_image5.png
    Greyscale
) [Examiner note: the global model parameters i.e., vector b of the coefficients of the base model. The model i.e., base model because the target model is obtained by solving the equation which contains this model] (Ciarlini, Col. 3, Lines 55-58: “The massive parallelization allows, for example, many thousands of time series (and time lags) to be processed that are candidates to be used in the model that explains or predicts a target time series”) [The authors disclose that the model will be used to explain or predicts a target time series of many thousands of time series (a plurality of time series), and as mentioned above, the coefficients vector b is the global parameter of this model, so the examiner interprets that the coefficients vector will be used for predicting time series.]
	Ciarlini in view of Flunkert, Li and Itawa fails to disclose:
updating the global model parameters by minimizing a first loss determined based on the forecasted time series measurements from the training data and labels of the training data.
	However, Bazrafkan explicitly discloses:
updating the global model parameters by minimizing a first loss determined based on the forecasted time series measurements from the training data and labels of the training data. (Bazrafkan, Abstract: “the parameters for the generative network are updated using the first and second loss functions”, and [0029]: “During each training epoch, an instance of Network A accepts at least one sample, … and generates Out1, a new sample in the same class so that this new sample reduces the loss function”, and [0032]: “Network A is a neural network, such as a generative model network …”) [The authors disclose that the parameters (global model parameters) for the generative network, which is defined as Network A, are updated using first loss function, then they also disclose that during the training process, Network A will generate Out1 for reducing the loss function.] (Bazrafkan. [0033]: “the direct loss function LA for each instance of network A accepts Out1 and another image Ii from the same class in the dataset I1 ... Ik as input and can be calculated using a mean square error or any similar suitable measure. These measures can then be combined to provide the loss function LA for a batch.”, and [0030]: “The input can for example comprise a feature vector or time-series (like sounds, medical data, etc) as long as this is labelled”) [The authors disclose that the loss function LA accepts an input from the training dataset I1 … Ik , then they also disclose that the input comprises a time series as long as it is labelled. Thus, the examiner notes that the loss function is generated based on the time series from the training data and labels of the training data.]   
	Regarding claim 18, the combination of Ciarlini, Flunkert, Li and Itawa discloses all the limitations of Claim 15 (as shown in the rejections above)
Ciarlini in view of Flunkert, Li and Itawa further discloses:
The system of claim 15, wherein the customized deep learning mechanism performs obtaining target model parameters by: initializing the target model parameters for the target model based on target time series measurements forecasted based on the base model and labels of training data from the target time series; and (Ciarlini, Col. 7: “By running the OMP algorithm 300, a set of k almost linearly independent series 210 are obtained which explain the target variable”, and Col. 8: “The target time series 210 of interest might vary. The values of the time series 210 are normalized before the training phase, thus coefficients represent how important
each time series 210 (and corresponding lag 215) is in the final outcome 260.”) [Examiner note: target model parameters i.e., target variables, set of k series i.e., labels of training data from the target time series.]
	Ciarlini in view of Flunkert, Li and Itawa fails to disclose:
iteratively updating the target model parameters by minimizing a second loss determined based on a discrepancy between target time series measurements predicted using time series data from the target time series and labels of the time series data from the target time series
	However, Bazrafkan explicitly discloses:	
iteratively updating the target model parameters by minimizing a second loss determined based on a discrepancy between target time series measurements predicted using time series data from the target time series and labels of the time series data from the target time series. (Bazrafkan, Abstract: “A second loss function is determined for the target neural network by comparing outputs of instances of the target neural network to one or more targets for the neural network. The parameters for the target neural network are updated using the second loss function”, and [0054], lines 28-34: “A training epoch may comprise a number of batches . . . X(T-1 ), X(T), X(T + 1) ... being processed in sequence with each batch being used to train network A using the loss functions from networks A and B, and to train network B using the augmented images generated while training network A and also original images, which together generate the loss for network B.”) [The authors disclose using the second loss function to update the target parameters, and describing how the loss function for the target neural network is calculated by comparing its outputs to the reference targets. This is a way of quantifying the discrepancy or error in the network’s predictions compared to the labeled data. The examiner interprets determining a second loss function as minimizing that loss function, because this is the well-known goal of determining a loss function in machine learning. The authors also disclose this training epoch may comprise multiple batches being processed in sequence with each of them used to generate the loss function for network B (iteratively updating target model using the second loss function)]  
Claim(s) 7, 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ciarlini in view of Flunkert, Li, Itawa, Bazrafkan and further in view of Huang et al. (“Adaptive Sampling Towards Fast Graph Representation Learning”, Publication Date: 11/19/2018) (hereafter referred to as “Huang”)
	Regarding claim 7, the combination of Ciarlini, Flunkert, Li, Itawa and Bazrafkan discloses all the limitations of claim 3 (as shown in the rejections above)
Ciarlini in view of Flunkert, Li, Itawa and Bazrafkan fails to disclose:
The method of claim 3, wherein the first loss includes a graph based portion related to enrichment of hidden representations associated with the base model
	However, Huang explicitly discloses:
The method of claim 3, wherein the first loss includes a graph based portion related to enrichment of hidden representations associated with the base model. (Huang, Page 4: 
    PNG
    media_image6.png
    66
    1094
    media_image6.png
    Greyscale
and Huang Page 5: 
    PNG
    media_image7.png
    234
    1077
    media_image7.png
    Greyscale
) [Examiner note: first loss i.e., classification loss LC , graph-based portion i.e., node vi, the process of computing hidden features during bottom-up propagation effectively enriches the representations. And this hidden feature is associated with the base model because vi is defined as the parent node.]
The combination of Ciarlini, Flunkert, Li, Itawa, Bazrafkan and Huang are analogous art because they are in the same field of hierarchical learning. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, having the teachings of Ciarlini, Flunkert, Li, Itawa, Bazrafkan and Huang before them, to modify the teachings of Ciarlini, FLunkert, Li, Itawa and Bazrafkan to include the teachings of Huang disclosing the initial loss comprises a component based on the graph structure, aimed at enhancing the hidden representations associated with the foundational model in order to minimize the variance in the model training (Huang, Page 5: “To fulfill variance reduction, we add the variance to the loss function and explicitly minimize the variance by model training.”)
	Regarding claim 14, the combination of Ciarlini, Flunkert, Li, Itawa and Bazrafkan discloses all the limitations of Claim 10 (as shown in the rejections above)
Ciarlini in view of Flunkert, Li, Itawa and Bazrafkan fails to disclose:
The medium of claim 10, wherein the first loss includes a graph based portion related to enrichment of hidden representations associated with the base model.
	However, Huang explicitly discloses:
The medium of claim 10, wherein the first loss includes a graph based portion related to enrichment of hidden representations associated with the base model. (Huang, Page 4: 
    PNG
    media_image6.png
    66
    1094
    media_image6.png
    Greyscale
and Huang Page 5: 
    PNG
    media_image7.png
    234
    1077
    media_image7.png
    Greyscale
) [Examiner note: first loss i.e., classification loss LC , graph-based portion i.e., node vi, the process of computing hidden features during bottom-up propagation effectively enriches the representations. And this hidden feature is associated with the base model because vi is defined as the parent node.]
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to AMY TRAN whose telephone number is (571)270-0693. The examiner can normally be reached Monday - Friday 7:30 am - 5:00 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, David Yi can be reached at (571) 270-7519. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/AMY TRAN/Examiner, Art Unit 2126                                                                                                                                                                                                        
/DAVID YI/Supervisory Patent Examiner, Art Unit 2126
Read full office action
Prosecution Timeline

Show 8 earlier events
Sep 04, 2024
Non-Final Rejection mailed — §103
Dec 04, 2024
Response Filed
Feb 12, 2025
Final Rejection mailed — §103
May 12, 2025
Request for Continued Examination
May 14, 2025
Response after Non-Final Action
Jan 23, 2026
Non-Final Rejection mailed — §103
Apr 28, 2026
Response Filed
Jul 02, 2026
Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/106,966
Patent 12675683
AUTOMATED DEEP LEARNING ARCHITECTURE SELECTION FOR TIME SERIES PREDICTION WITH USER INTERACTION
5y 7m to grant Granted Jul 07, 2026
17/073,341
Patent 12664482
FEDERATED ENSEMBLE LEARNING FROM DECENTRALIZED DATA WITH INCREMENTAL AND DECREMENTAL UPDATES
5y 8m to grant Granted Jun 23, 2026
14/926,852
Patent 12646083
ANALYSIS AND PREDICTION FROM VENUE DATA
10y 7m to grant Granted Jun 02, 2026
17/200,331
Patent 12639615
ENTANGLEMENT FORGING FOR QUANTUM SIMULATIONS
5y 2m to grant Granted May 26, 2026
17/173,605
Patent 12626120
AUTOMATED PIXEL-WISE LABELING OF ROCK CUTTINGS BASED ON CONVOLUTIONAL NEURAL NETWORK-BASED EDGE DETECTION
5y 3m to grant Granted May 12, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

7-8
Expected OA Rounds
36%
Grant Probability
80%
With Interview (+44.3%)
4y 9m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 31 resolved cases by this examiner. Grant probability derived from career allowance rate.