Office Action Analysis: 17446676 — BAYESIAN CONTEXT AGGREGATION FOR NEURAL PROCESSES

Examiner Intelligence

RAMESH, TIRUMALE K View full profile →
Grants only 18% of cases
Career Allow Rate
7 granted / 40 resolved
-37.5% vs TC avg
Minimal +2% lift
Without
With
+2.1%
Interview Lift
resolved cases with interview
Typical timeline
4y 5m
Avg Prosecution
40 currently pending
Career history
80
Total Applications
across all art units
Statute-Specific Performance

§101
30.7%
-9.3% vs TC avg
§103
59.1%
+19.1% vs TC avg
§102
3.7%
-36.3% vs TC avg
§112
5.4%
-34.6% vs TC avg
Black line = Tech Center average estimate • Based on career data from 40 resolved cases
Office Action

§101 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
The examiner first recognizes that the applicant has made significant amendments by bringing the limitations from dependent claims 3-5 and 7-11 to the dependent claims.  The examiner notes that the claims 3-5 and 7-11 are CANCELLED for the amendments. 
The examiner interprets based on substantial amended claims 1 and 19  that the the scope of invention.  might have been narrowed to a “ For a dynamic system involving a controllable device, use a Bayesian inference with posteriori distribution for the dynamic system and provide an optimization of the system and extract data points in latent representations and that includes uncertainties of observed data points”. The examiner has used single prior art from “WANG” that practically teaches the entire theme of the invention.  The dynamic systems  & controllable devices is a fundamental to control theory and robotics. Use of Bayesian Inference and Posterior Distribution methods are common for parameter estimation and uncertainty quantification in dynamic, and further optimization of the system. Using latent variables to represent underlying structures or compress data is a key aspect of machine learning and statistics, often seen in techniques autoencoders.  The examiner submits that perhaps a Novelty lies in the integration and application that may consist of demonstrating a unique combination of elements applied to a previously unsolved problem or using an innovative methodological integration. For example, a new architecture for the Bayesian latent representation model, a specific way of handling a particularly complex uncertainty. The examiner boldly interprets numerous research ideas are underway today in the related area and that the single prior art “WANG” pertaining to broad research related to this area teaches all the elements as presented in the claims.
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 9/22/2025 has been entered.
 					
Response to Amendment
(Submitted 9/22/2025)
Applicant’s arguments with respect to claim 1 and claim 19 have been considered but are moot because 
the new ground of rejection does not rely on any reference applied in the prior rejection of record for 
any teaching or matter specifically challenged in the argument.
In regard to 101 rejections
-	On Page 8 and 9, the applicant argues that there is a “efficiency improvement” in the invention. Further, on Page 9 argues that  the Bayesian aggregation is modulated by uncertainty and is not a  mental step.
-	On Page 10, the applicant argues specifically to claim 4 and claim 5 of previous Office Action (which are now incorporated into claim 1 (claim 19) ) that the Office action merely asserts and not demonstrate why the rejection given without reasonable  analysis.  
	Examiner’s Response:
The examiner submits that there were no “hindsight” by the examiner.  The examiner “disagrees” with the argument. The examiner submits that the specification [0022]-[0024] shows that this process is a series of iterative calculations which are mental process that can be performed using pen and paper. The examiner further submits that  “the act of aggregation of a posterior distribution with variance as a measure of uncertainty”  as a set of steps that can be entirely performed by a human mentally or with pen and paper, is likely to be considered an unpatentable abstract idea or mental process. As a result of the moving the limitations of claim 4 and 5 to the claim 1,the examiner arguments with respect to the specification [0022]-[0024] holds.
The examiner further submits that the improvement directed to providing the improvement to the method that includes training and as such it is an improvement applied to the abstract itself.   It is known in the art that  the performance of a machine learning model and the performance improvement from applying neural networks are not the same.  The improvement to overall computer performance may be represented by any matric such as accuracy, reduced inference time, power consumption throughput ,etc. The same argument with regard to claim 1 also applies to the claim 19.
In CONCLUSION,  the examiner rejects clams 1, 6,  and 12-19 under 101.
In regard to 101 rejections as related “software per se” on claim 19
The examiner had rejected Claim 19 is rejected under 35 U.S.C. 101 for a “software per se” because the claimed invention is directed to non-statutory subject matter.  The claim does not limit the system to hardware only and therefore the system can be interpreted as “software per se”.  The examiner notes that the applicant has given no explanation on this rejection. The examiner notes hardware support in the specification. As per the guidance, the examiner generally should not remove a "software per se" rejection under 35 U.S.C. § 101 without an amendment to the claims themselves, even if the specification provides hardware support. This is in consistent with the MPEP 2111 and specifically 2111.01 (words of the claim is inconsistent with the specification).  As a result, the examiner still treats the claim 19 as a “software per se”. 
In CONCLUSION, the examiner MAINTAIN the “software per se” rejection on  claim 19 under 101.
In regard to 103 rejections 
-	On Page 16, as a first basis, the applicant argues that none of the references disclose dual network encoder for observation and uncertainty. 
	Examiner’s Response
	 In view of the substantial amendments, the examiner has found a single reference that teaches the amendments and as a result of the amendments, the applicant argument is MOOT. The new reference “WANG” teaches two neural network within the context  of teaching Bayesian Neural Networks (BNNs) inherently providing plurality or distribution of possible neural networks because they place probability distributions over weights, rather than single point estimates, effectively representing an ensemble of models to capture uncertainty. 
-	On Page 17, as a second basis, the applicant argues that none of the references disclose “ Bayesian aggregation”.
Examiner’s Response
It is perhaps well known to the POSITA that he posterior distribution is the central outcome and defining feature of a Bayesian model, representing the updated belief about parameters after combining prior knowledge with observed data using Bayes' Theorem. Not conceding further, the examiner submits that in view of new reference, this argument is MOOT.
-	On Page 17, as a third basis, the applicant argues that none of the references disclose parameter-based decoder.  
Examiner’s Response
It is perhaps well known to the POSITA that a VAE's decoder can be used for parameter-based by training the decoder to output parameters (like mean/variance) that describe a distribution, allowing it to generate specific data instances or model complex systems from latent codes, though care is needed to prevent decoder bias during training for reliable results. Not conceding further, the examiner submits that in view of new reference, this argument is MOOT.
-	On Page 18, as a fourth basis, the applicant argues that none of the references disclose “ non-stochastic”.
Examiner’s Response
The examiner submits  that an entire phrase “ non-stochastic loss function” is inherently deterministic, meaning that using exact same input data and model state may always produce the identical loss value, as "non-stochastic" implying the absence of inherent randomness, and focusing on fixed calculations rather than probabilistic outcomes. While the overall model training can be non-deterministic (due to random initializations, data shuffling), the loss function itself (e.g., Mean Squared Error, Cross-Entropy) is a fixed mathematical formula that determines the  cost function (loss function) based purely on inputs. The prior art used before teaches cost function and was well supporting. Not conceding further, the examiner submits that in view of new reference “WANG”, this argument is MOOT.  The reference “WANG” teaches MSE as a reconstruction error in autoencoder which is an MSE perhaps as well known to POSITA.
-	On Page 18, the applicant argues in regard to claim 1 that specification[0018] recites an improvement to the entire method generating an efficient computer implemented machine learning system and that the computational cost may be reduced. 
Examiner’s Response
The examiner submits that the improvement cited by the applicant specification [0018] is the result of the generation of the distribution generation.  Therefore, the improvement is directed to the abstract idea itself (See MPEP 2106.05(a)(1)).
 The same argument holds for the claim 19 with regard to the improvement directed to the abstract idea itself.
In Conclusion, the examiner rejects claim 1 and claims 6, 12-19 under 103 as NON- REJECTION under RCE.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1, 6, and 12- 19 are rejected under 35 U.S.C. 101 because the claimed invention is directed to abstract idea without significantly more. 
In regard to claim 1: Currently Amended)
Step 1:
According to the first part of the analysis, claim 1 is a method claim, and thus, claims 1 fall into a process claim under the categories of  claims for patentability (process, machine, manufacture, or composition of matter). 
In regard to claim 1 (Currently Amended)
Step 2A Prong 1:
“ computing an aggregation of at least one latent variable (except of the machine learning system, 
the machine learning system being configured for determining output data for monitoring and/or controlling a technical device from input sensor data using Bayesian inference, and in view of the training data set, an information item contained in the training data set being transferred directly into a statistical description of the plurality of latent variables”) is a mental step of data accumulation” 
“ and generating an a-posteriori predictive distribution for predicting the dynamic response of the device, using the calculated aggregation, and under a condition that the training data set has set in” is a mental step of data accumulation, distribution  and comparison.
“ mapping each pair of the first plurality of data points and of the second plurality of data points from the training data set onto a corresponding latent observation, using a first neural network, and onto an uncertainty of the corresponding latent observation, using a second neural network” is a mental step of data mapping.
“ aggregating a Bayesian a-posteriori distribution for the plurality of latent variables under a 
condition that the plurality of latent observations has set in, the aggregating being carried out,
using Bayesian inference, through which information contained in the training data set is
transferred directly into the statistical description of the plurality of latent variables” is a mental step of data accumulation.
“ and calculating a plurality of latent observations and a plurality of their uncertainties” is a mental step of data calculation.
“generating a second approximate a-posteriori distribution for the plurality of latent variables 
under a condition that the training data set has set in, the second approximate a-posteriori 
distribution being further described by a set of parameters, which is parameterized over a 
parameter common to the training data set” is a mental step of data accumulation.
“ iteratively calculating the set of parameters based on the calculated plurality of latent
observations and the calculated plurality of their uncertainties” mental step of data calculation.
“calculating the fourth plurality of data points, using the given subset of functions from the
general, given family of functions, the given subset of functions is calculated on the third 
plurality of data points” mental step of data calculation.
“ calculating the integral includes approximating the integral with regard to the plurality of 
latent variables, using a non-stochastic loss function, which is based on the set of parameters of 
the second approximate a-posteriori distribution mental step of data calculation.
The additional elements
Step 2A Prong 2:
“ A  computer-implemented method for generating a computer- implemented machine learning 
system, the machine learning system being configured “recited in the preamble do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
except of the machine learning system, 
“ the machine learning system being configured for determining output data for monitoring and/or controlling a technical device from input sensor data using Bayesian inference, and in view of the training data set, an information item contained in the training data set being transferred directly into a statistical description of the plurality of latent variables” do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“ receiving a training data set, which reflects a dynamic response of a device;” do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“ [[and]] generating an a-posteriori predictive distribution for predicting the dynamic response 
of the device, using the calculated aggregation, and under a condition that the training data set 
has set in, wherein the training data set includes a first plurality of data points and a second 
plurality of data points, and the method includes calculating the second plurality of data points, 
using a given subset of functions from a general, given family of functions, the given subset of 
functions is calculated on the first plurality of data points” do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“ wherein aggregating the Bayesian a-posteriori distribution includes implementing a plurality of 
factored Gaussian distributions, wherein each uncertainty is a variance of a corresponding 
Gaussian distribution, wherein generating the a-posteriori predictive distribution includes the 
following further steps:” do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“receiving another training data set, which includes a third plurality of data points and a fourth 
plurality of data points” do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“and generating the a-posteriori predictive distribution further includes generating a third 
distribution, using a third and fourth neural network, wherein the third distribution is a function 
of the plurality of latent variables, the set of parameters, task-independent variables, and the 
other training data set” do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“ wherein:  generating the a-posteriori predictive distribution includes optimizing a 
likelihood distribution with regard to the task-independent variables and the common 
parameter;” do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“optimizing the likelihood distribution includes maximizing the likelihood distribution with 
regard to the task-independent variables and the maximizing is based on the second 
approximate a-posteriori distribution generated and on the third distribution generated” do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“ maximizing the likelihood distribution includes calculating an integral over a function of latent 
variables, which contains respective products of the second approximate a-posteriori 
distribution and of the third distribution” do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
Step 2B:
“ A  computer-implemented method for generating a computer- implemented machine learning 
system, the machine learning system being configured “recited in the preamble does not amount to significantly more than the judicial exception in the claim. These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“ the machine learning system being configured for determining output data for monitoring and/or controlling a technical device from input sensor data using Bayesian inference, and in view of the training data set, an information item contained in the training data set being transferred directly into a statistical description of the plurality of latent variables” does not amount to significantly more than the judicial exception in the claim. These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“ [[and]] generating an a-posteriori predictive distribution for predicting the dynamic response 
of the device, using the calculated aggregation, and under a condition that the training data set 
has set in, wherein the training data set includes a first plurality of data points and a second 
plurality of data points, and the method includes calculating the second plurality of data points, 
using a given subset of functions from a general, given family of functions, the given subset of 
functions is calculated on the first plurality of data points” does not amount to significantly more than the judicial exception in the claim. These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“ wherein aggregating the Bayesian a-posteriori distribution includes implementing a plurality of 
factored Gaussian distributions, wherein each uncertainty is a variance of a corresponding 
Gaussian distribution, wherein generating the a-posteriori predictive distribution includes the 
following further steps:” does not amount to significantly more than the judicial exception in the claim. These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“receiving another training data set, which includes a third plurality of data points and a fourth 
plurality of data points” does not amount to significantly more than the judicial exception in the claim. These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“and generating the a-posteriori predictive distribution further includes generating a third 
distribution, using a third and fourth neural network, wherein the third distribution is a function 
of the plurality of latent variables, the set of parameters, task-independent variables, and the 
other training data set” does not amount to significantly more than the judicial exception in the claim. These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“ wherein:  generating the a-posteriori predictive distribution includes optimizing a 
likelihood distribution with regard to the task-independent variables and the common 
parameter;” does not amount to significantly more than the judicial exception in the claim. These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“optimizing the likelihood distribution includes maximizing the likelihood distribution with 
regard to the task-independent variables and the maximizing is based on the second 
approximate a-posteriori distribution generated and on the third distribution generated” does not amount to significantly more than the judicial exception in the claim. These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“ maximizing the likelihood distribution includes calculating an integral over a function of latent 
variables, which contains respective products of the second approximate a-posteriori 
distribution and of the third distribution” does not amount to significantly more than the judicial exception in the claim. These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
In regard to claim 6: (Currently Amended)
Step 2A Prong 1: 
“wherein iteratively calculating the set of parameters includes implementing another plurality of factored Gaussian distributions with regard to the latent variables, and the set of parameters corresponds to a plurality of means and variances of the Gaussian distributions”  is a mental step data comparison and math calculation .  
Step 2A Prong 2: no additional elements
Step 2B: no additional elements
In regard to claim 12: (Currently Amended)
Step 2A Prong 2:
“ substituting the task-independent variables derived by the optimization, and the common parameter, in the likelihood distribution, in order to generate the a-posteriori predictive distribution”  do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
Step 2B:
“ substituting the task-independent variables derived by the optimization, and the common parameter, in the likelihood distribution, in order to generate the a-posteriori predictive distribution”  does not amount to significantly more than the judicial exception in the claim.
 These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
In regard to claim 13: (Original)
Step 2A Prong 2:
“ mapping an input vector of a dimension to an output vector of a second dimension”
“ the input vector represents elements of a time series for at least one measured input state variable of the device” do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“ and the output vector represents at least one estimated output state variable of the device, which is predicted using the a-posteriori predictive distribution generated” do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
Step 2B:
“ mapping an input vector of a dimension to an output vector of a second dimension”
“ the input vector represents elements of a time series for at least one measured input state variable of the device” does not amount to significantly more than the judicial exception in the claim. These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“ and the output vector represents at least one estimated output state variable of the device, which is predicted using the a-posteriori predictive distribution generated” does not amount to significantly more than the judicial exception in the claim.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
In regard to claim 14: (Original)
Step 2A Prong 2:
“ wherein the device is a machine” do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
Step 2B:
“ wherein the device is a machine” does not amount to significantly more than the judicial exception in the claim.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
In regard to claim 15: (Original)
Step 2A Prong 2:
“wherein the device is an engine”  do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
Step 2B:
“wherein the device is an engine”  does not amount to significantly more than the judicial exception in the claim.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
In regard to claim 16: (Original)
Step 2A Prong 2:
“the computer-implemented machine learning system is configured for modeling parameterization of a characteristics map of the device”  do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
Step 2B:
“the computer-implemented machine learning system is configured for modeling parameterization of a characteristics map of the device”  does not amount to significantly more than the judicial exception in the claim. These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
In regard to claim 17: (Previously Presented)
Step 2A Prong 2:
“ parameterizing the characteristics map of the device, using the computer-implemented machine learning system generated” do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
Step 2B:
“ parameterizing the characteristics map of the device, using the computer-implemented machine learning system generated” does not amount to significantly more than the judicial exception in the claim.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
In regard to claim 18: (Original)
Step 2A Prong 2:
“ training data sets includes input variables measured on the device and/or calculated for the device, the at least one input variable of the device includes at least one of a rotational speed, or a temperature, or a mass flow rate, and the at least one estimated output state variable of the device includes at least one of a torque, or an efficiency, or a compression ratio” do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
Step 2B:
“ training data sets includes input variables measured on the device and/or calculated for the device, the at least one input variable of the device includes at least one of a rotational speed, or a temperature, or a mass flow rate, and the at least one estimated output state variable of the device includes at least one of a torque, or an efficiency, or a compression ratio” does not amount to significantly more than the judicial exception in the claim. These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).

In regard to claim 19: (Currently Amended) 
Step 2A Prong 1:
“ computing an aggregation of at least one latent variable (except of the machine learning system, the machine learning system being configured for determining output data for monitoring and/or controlling a technical device from input sensor data using Bayesian inference, and in view of the training data set, an information item contained in the training data set being transferred directly into a statistical description of the plurality of latent variables”)   is a mental step of data accumulation, distribution  and comparison.
“ mapping each pair of the first plurality of data points and of the second plurality of data points from the training data set onto a corresponding latent observation, using a first neural network, and onto an uncertainty of the corresponding latent observation, using a second neural network” is a mental step of data mapping.
“ aggregating a Bayesian a-posteriori distribution for the plurality of latent variables under a 
condition that the plurality of latent observations has set in, the aggregating being carried out,
using Bayesian inference, through which information contained in the training data set is
transferred directly into the statistical description of the plurality of latent variables” is a mental step of data accumulation.
“ and calculating a plurality of latent observations and a plurality of their uncertainties” is a mental step of data calculation.
“generating a second approximate a-posteriori distribution for the plurality of latent variables 
under a condition that the training data set has set in, the second approximate a-posteriori 
distribution being further described by a set of parameters, which is parameterized over a 
parameter common to the training data set” is a mental step of data accumulation.
“ iteratively calculating the set of parameters based on the calculated plurality of latent
observations and the calculated plurality of their uncertainties” mental step of data calculation.
“calculating the fourth plurality of data points, using the given subset of functions from the
general, given family of functions, the given subset of functions is calculated on the third 
plurality of data points” mental step of data calculation.
“ calculating the integral includes approximating the integral with regard to the plurality of 
latent variables, using a non-stochastic loss function, which is based on the set of parameters of 
the second approximate a-posteriori distribution mental step of data calculation.

The additional elements
Step 2A Prong 2:
“ A computer-implemented system for generating and/or using a computer-implemented machine learning system, the machine learning system being configured for determining output data for monitoring and/or controlling a technical device from input sensor data, the computer-implemented machine learning system being trained by” recited in the preamble do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“of the machine learning system, the machine learning system being configured for determining output data for monitoring and/or controlling a technical device from input sensor data using Bayesian inference, and in view of the training data set, an information item contained in the training data set being transferred directly into a statistical description of the plurality of latent variables” do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“receiving a training data set, which reflects a dynamic response of a device” do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“ for determining output data for monitoring and/or controlling a technical device from input 
sensor data,” do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“ receiving a training data set, which reflects a dynamic response of a device;” do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“ [[and]] generating an a-posteriori predictive distribution for predicting the dynamic response 
of the device, using the calculated aggregation, and under a condition that the training data set 
has set in, wherein the training data set includes a first plurality of data points and a second 
plurality of data points, and the method includes calculating the second plurality of data points, 
using a given subset of functions from a general, given family of functions, the given subset of 
functions is calculated on the first plurality of data points” do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“ wherein aggregating the Bayesian a-posteriori distribution includes implementing a plurality of 
factored Gaussian distributions, wherein each uncertainty is a variance of a corresponding 
Gaussian distribution, wherein generating the a-posteriori predictive distribution includes the 
following further steps:” do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“receiving another training data set, which includes a third plurality of data points and a fourth 
plurality of data points” do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“and generating the a-posteriori predictive distribution further includes generating a third 
distribution, using a third and fourth neural network, wherein the third distribution is a function 
of the plurality of latent variables, the set of parameters, task-independent variables, and the 
other training data set” do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“ wherein:  generating the a-posteriori predictive distribution includes optimizing a 
likelihood distribution with regard to the task-independent variables and the common 
parameter;” do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“optimizing the likelihood distribution includes maximizing the likelihood distribution with 
regard to the task-independent variables and the maximizing is based on the second 
approximate a-posteriori distribution generated and on the third distribution generated” do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“ maximizing the likelihood distribution includes calculating an integral over a function of latent 
variables, which contains respective products of the second approximate a-posteriori 
distribution and of the third distribution” do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
Step 2B:
“ A computer-implemented system for generating and/or using a computer-implemented 
machine learning system, the machine learning system being configured for determining output 
data for monitoring and/or controlling a technical device from input sensor data, the computer-
implemented machine learning system being trained by” recited in the preamble does not 
amount to significantly more than the judicial exception in the claim. These additional elements 
are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 
2106.05(h).
“of the machine learning system, the machine learning system being configured for determining 
output data for monitoring and/or controlling a technical device from input sensor data using 
Bayesian inference, and in view of the training data set, an information item contained in the 
training data set being transferred directly into a statistical description of the plurality of latent 
variables” does not amount to significantly more than the judicial exception in the claim.
These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“receiving a training data set, which reflects a dynamic response of a device” does not 
amount to significantly more than the judicial exception in the claim. These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“ for determining output data for monitoring and/or controlling a technical device from input 
sensor data, the method includes the following steps:” does not amount to significantly more than the judicial exception in the claim. These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“ receiving a training data set, which reflects a dynamic response of a device;” does not amount to significantly more than the judicial exception in the claim. These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“ [[and]] generating an a-posteriori predictive distribution for predicting the dynamic response 
of the device, using the calculated aggregation, and under a condition that the training data set 
has set in, wherein the training data set includes a first plurality of data points and a second 
plurality of data points, and the method includes calculating the second plurality of data points, 
using a given subset of functions from a general, given family of functions, the given subset of 
functions is calculated on the first plurality of data points” does not amount to significantly more than the judicial exception in the claim. These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“ wherein aggregating the Bayesian a-posteriori distribution includes implementing a plurality of 
factored Gaussian distributions, wherein each uncertainty is a variance of a corresponding 
Gaussian distribution, wherein generating the a-posteriori predictive distribution includes the 
following further steps:” does not amount to significantly more than the judicial exception in the claim. These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“receiving another training data set, which includes a third plurality of data points and a fourth 
plurality of data points” does not amount to significantly more than the judicial exception in the claim. These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“and generating the a-posteriori predictive distribution further includes generating a third 
distribution, using a third and fourth neural network, wherein the third distribution is a function 
of the plurality of latent variables, the set of parameters, task-independent variables, and the 
other training data set” does not amount to significantly more than the judicial exception in the claim. These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“ wherein:  generating the a-posteriori predictive distribution includes optimizing a 
likelihood distribution with regard to the task-independent variables and the common 
parameter;” does not amount to significantly more than the judicial exception in the claim. These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“optimizing the likelihood distribution includes maximizing the likelihood distribution with 
regard to the task-independent variables and the maximizing is based on the second 
approximate a-posteriori distribution generated and on the third distribution generated” does not amount to significantly more than the judicial exception in the claim. These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
“ maximizing the likelihood distribution includes calculating an integral over a function of latent 
variables, which contains respective products of the second approximate a-posteriori 
distribution and of the third distribution” does not amount to significantly more than the judicial exception in the claim. These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1, 6, 12, 14-17, and 19  are rejected under 35 U.S.C. 103  as being unpatentable over 
In view of HAO WANG et.al. (hereinafter WANG)  A Survey on Bayesian Deep Learning, ACM Computing Surreys (CSUR), Volume 53, Issue 5, Published 28 September, 2020.
In view of Zhang et.al. (hereinafter Zhang) US 11715004 B2. 
	In regard to claim 1: (Currently Amended)
WANG discloses:
-	for determining output data for monitoring and/or controlling a technical device from input sensor data, the method includes the following steps:
In [5.3, Page 108:28]:
Bayesian deep learning can also be applied to the control of nonlinear dynamical systems from raw images. Consider controlling a complex dynamical system according to the live video stream received from a camera. One way of solving this control problem is by iteration between two tasks, perception from raw images and control based on dynamic models.
In [1, Page 108:2]:
consider controlling a complex dynamical system according to the live video stream received from a camera. This problem can be transformed into iteratively performing two tasks, perception from raw images and control based on dynamic models. The perception task of processing raw images can be handled by deep learning while the control task usually needs more sophisticated models such as hidden Markov models and Kalman filters [35, 74]
(BRI: a camera is a sensor. The control system can integrate input from a camera (as a sensor) to control a technical device, and this system can also incorporate data from other input sensors)
-	receiving a training data set, which reflects a dynamic response of a device;
	In [5.3, Page 108:28]:
Consider controlling a complex dynamical system according to the live video stream received from a camera. One way of solving this control problem is by iteration between two tasks, perception from raw images and control based on dynamic models.
In [5.3, Page 108:28]: 
BDL can be applied in the supervised and unsupervised learning settings, respectively. 
In [5.3, Page 108:28]: 
BDL can help representation learning in general, using control as an example application
In [5.3, Page 108:28]:
Reference [125] posed this task as a representation learning problem and proposed a model called Embed to Control to take into account the feedback loop mentioned above during representation learning
(BRI: using the interaction between two control tasks can provide a training dataset that reflects the dynamic response of a device, primarily through mechanisms like transfer of learning and adaptive control feedback loops. 
-	computing an aggregation of at one latent variable of the machine learning system , using a Bayesian inference,  and in view of the training data set being transferred directly into the statistical description of the plurality of latent variables 
In [4.1, Page 108:10]:
 BDL is Bayesian neural networks (BNN) or Bayesian treatments of neural networks. Similar to any Bayesian treatment, BNN imposes a prior on the neural network’s parameters and aims to learn a posterior distribution of these parameters. During the inference phrase, such a distribution is then marginalized out to produce final predictions. In general such a process is called Bayesian model averaging [5] and can be seen as learning an infinite number of (or a distribution over) neural networks and then aggregating the results through ensembling
In [3.2, Page 108:9]:
the process of finding the parameters (e.g., α and β in Figure 4) is called learning and the process of finding the latent variables (e.g., θ and z in Figure 4) given the parameters is called inference.
In [2.4.2, Page 108:8];

    PNG
    media_image1.png
    138
    581
    media_image1.png
    Greyscale

    PNG
    media_image2.png
    77
    1056
    media_image2.png
    Greyscale

In [3.1, Page 108:8]:
Models There are essentially two types of PGM, directed PGM (also known as Bayesian networks) and undirected PGM (also known as Markov random fields) [5]. In this survey, we mainly focus on directed PGM.4 For details on undirected PGM, readers are referred to Reference [5]. A classic example of PGM would be latent Dirichlet allocation (LDA)
In [3.2, Page 108:9]:
various learning and inference algorithms are available for each PGM. Among them, the most cost-effective one is probably maximum a posteriori (MAP), which amounts to maximizing the posterior probability of the latent variable
-	[[and]] generating an a-posteriori predictive distribution for predicting the dynamic response of the device, using the calculated aggregation, and under a condition that the training data set has set in, wherein the training data set includes a first plurality of data points and a second plurality of data points, and the method includes calculating the second plurality of data points, using a given subset of functions from a general, given family of functions, the given subset of functions is calculated on the first plurality of data points
In [5.16, Page 108:23]: 
Recommender systems are a typical use case for BDL in that they often require both thorough understanding of high-dimensional signals (e.g., text and images) and principled reasoning on the conditional dependencies among users/items/ratings. In this regard, CDL, as an instantiation of BDL, is the first hierarchical Bayesian model
In [1, Page 108:2]:
consider controlling a complex dynamical system according to the live video stream received from a camera. This problem can be transformed into iteratively performing two tasks, perception from raw images and control based on dynamic models. The perception task of processing raw images can be handled by deep learning while the control task usually needs more sophisticated models such as hidden Markov models and Kalman filters [35, 74]. The feedback loop is then completed by the fact that actions chosen by the control model can affect the received video stream in turn.
In [5.1.6, Page 108:23]:
Note that BDL-based models above use typical static Bayesian networks as their task-specific components. Although these are often sufficient for most use cases, it is possible for the task specific components to take the form of deep Bayesian networks 
In [5.1.6, Page 108:23]:
One can also use stochastic processes (or dynamic Bayesian networks in general) to explicitly model users purchase or clicking behaviors. 
In [4.1, Page 108:10]:
BNN imposes a prior on the neural network’s parameters and aims to learn a posterior distribution of these parameters. During the inference phrase, such a distribution is then marginalized out to produce final predictions. In general such a process is called Bayesian model averaging [5] and can be seen as learning an infinite number of (or a distribution over) neural networks and then aggregating the results through ensembling.
 (BRI: a Bayesian Neural Network (BNN) that imposes a prior distribution on its weights, which acts as a form of regularization or constraint on the possible functions the network can learn and influencing the final posterior distribution after training. While it's a prior before data, it indirectly conditions the model's behavior and output, helping with uncertainty estimation )

In [4.3.4 , Page 108:14]:
Natural-parameter Networks. Different from vanilla NN, which usually takes deterministic input, NPN [119] is a probabilistic NN taking distributions as input. The input distributions go through layers of linear and nonlinear transformation to produce output distributions. 
In [4.3.4 , Page 108:15]:
As a simple example, in a vanilla linear NN                          
                            
                                    f
                                
                                    w
                                
                    (x) = wx takes a scalar x as input and computes the output based on a scalar parameter w; a corresponding Gaussian NPN would assume w is drawn from a Gaussian distribution N (                        
                            
                                    x
                                
                                    m
                                
                    ,                         
                            
                                    x
                                
                                    s
                                
                    ) and that x is drawn from N (                        
                            
                                    x
                                
                                    m
                                
                    ,                         
                            
                                    x
                                
                                    s
                                
                    ) (                        
                            
                                    x
                                
                                    s
                                
                    is set to 0 when the input is deterministic). 
(BRI: functions that are "vanilla linear with scalar input" represent a subset of the functions that can be represented by a neural network with non-linear activation functions (which is likely what "NPN" refers to in this context)
In [4.3.4 , Page 108:15]:
With θ = (                        
                            
                                    w
                                
                                    m
                                
                    ,                         
                            
                                    w
                                
                                    s
                                
                    ) as a learnable parameter pair, NPN will then compute the mean and variance of the output Gaussian distribution                         
                            
                                    μ
                                
                                    θ
                                     
                     (                        
                            
                                    x
                                
                                    m
                                
                    ,                         
                            
                                    x
                                
                                    s
                                
                    )  and                         
                            
                                    s
                                
                                    θ
                                     
                     (                        
                            
                                    x
                                
                                    m
                                
                    ,                         
                            
                                    x
                                
                                    s
                                
                    )  in closed form (bias terms are ignored for clarity) as

    PNG
    media_image3.png
    93
    795
    media_image3.png
    Greyscale

Hence, the output of this Gaussian NPN is a tuple                         
                            
                                    (
                                    μ
                                
                                    θ
                                     
                     (                        
                            
                                    x
                                
                                    m
                                
                    ,                         
                            
                                    x
                                
                                    s
                                
                    ) ,                         
                            
                                    s
                                
                                    θ
                                     
                     (                        
                            
                                    x
                                
                                    m
                                
                    ,                         
                            
                                    x
                                
                                    s
                                
                    )  representing a Gaussian distribution instead of a single value. Input variance xs to NPN can be set to 0 if not available. Note that since                         
                            
                                    s
                                
                                    θ
                                     
                     (                        
                            
                                    x
                                
                                    m
                                
                    ,                         
                            0
                        
                     )=                         
                            
                                    x
                                
                                    m
                                
                                    2
                                
                     (                        
                            
                                    w
                                
                                    m
                                
                    ,                         
                            
                                    w
                                
                                    s
                                
                            )
                             
                    and                         
                            
                                    w
                                
                                    s
                                
                     can still be learned even if                         
                            
                                    x
                                
                                    s
                                
                    = 0 for all data points. 
(BRI: Gaussian Natural-Parameter Network (NPN) for learning operates by modeling all weights, neurons, and inputs as distributions rather than single, deterministic values. In this framework, the model inherently represents the properties and relationships of a plurality (collection) of data points by capturing their collective mean and variance (and thus uncertainty), not just individual instances) 
-	mapping each pair of the first plurality of data points and of the second plurality of data points from the training data set onto a corresponding latent observation, using a first neural network, and onto an uncertainty of the corresponding latent observation, using a second neural network;
In [3.1, Page 108:9]:
Due to its Bayesian nature, PGM such as LDA is easy to extend to incorporate other information or to perform other tasks. For example, following LDA, different variants of topic models have been proposed.
In [3.2, Page 108:9]:
MAP, as efficient as it is, gives us only point estimates of latent variables (and parameters). To take the uncertainty into account and harness the full power of Bayesian models, one would have to resort to Bayesian treatments such as variational inference and Markov chain Monte Carlo (MCMC). For example, the original LDA uses variational inference to approximate the true posterior with factorized variational distributions 
In [5.3.1, Page 108:29]:
Stochastic Optimal Control. 
we consider the stochastic optimal control of an unknown dynamical system as follows:

    PNG
    media_image4.png
    31
    600
    media_image4.png
    Greyscale

where t indexes the time steps and                          
                            
                                    z
                                
                                    t
                                
                            ∈
                             
                                    R
                                
                                            n
                                        
                                            z
                                        
                    is the latent states.                         
                            
                                    u
                                
                                    t
                                
                            ∈
                             
                                    R
                                
                                            n
                                        
                                            u
                                        
is the applied control at time t and ξ denotes the system noise. Equivalently, the equation above can be written as

    PNG
    media_image5.png
    22
    317
    media_image5.png
    Greyscale

Hence, we need a mapping function to map the corresponding raw image                         
                            
                                    x
                                
                                    t
                                
(observed input) into the latent space

    PNG
    media_image6.png
    33
    458
    media_image6.png
    Greyscale

where ω is the corresponding system noise.
(BRI: With different variants of a model,  that is used for  training them on diverse datasets, can provide a plurality of data. This effectively provides multiple distributions of results (see [3.1, Page 108:9])
-	aggregating a Bayesian a-posteriori distribution for the plurality of latent variables under a condition that the plurality of latent observations has set in, the aggregating being carried out, using Bayesian inference, through which information contained in the training data set is transferred directly into the statistical description of the plurality of latent variables 
In [4.1, Page 108:10]:
BNN imposes a prior on the neural network’s parameters and aims to learn a posterior distribution of these parameters. During the inference phrase, such a distribution is then marginalized out to produce final predictions. In general such a process is called Bayesian model averaging [5] and can be seen as learning an infinite number of (or a distribution over) neural networks and then aggregating the results through ensembling.
In [5.4.5, Page 108:31]:
Time Series Forecasting. Time series forecasting is a long-standing core problem in economics, statistics, and machine learning
In [1, Page 108:3]:
BDL model consists of two components, a perception component that is a Bayesian formulation of a certain type of neural networks and a task-specific component that describes the relationship among different hidden or observed variables using PGM. 
In [4.4.2, Page 108:16]:
 	Bidirectional Inference Networks. Typical Bayesian networks assume “shallow” conditional dependencies among random variables. In the generative process, one random variable (which can be either latent or observed) is usually drawn from a conditional distribution 
parameterized by the linear combination of its parent variables
In [4.4.2, Page 108:16]:
Such “shallow” and linear structures can be replaced with nonlinear or even deep nonlinear structures to form a deep Bayesian network. As an example, bidirectional inference network (BIN) [117] is a class of deep Bayesian networks that enable deep nonlinear structures in each conditional distribution
(BRI: a deep Bayesian network (DBN) with a nonlinear structure can infer observed data from unobserved (latent) variables. This process involves learning the complex, typically nonlinear, relationships between the latent and observed variables and effectively transferring the statistical description (or information) of the latent space to the observed data distribution) 
-	and calculating a plurality of latent observations and a plurality of their uncertainties, wherein aggregating the Bayesian a-posteriori distribution includes implementing a plurality of factored Gaussian distributions, wherein each uncertainty is a variance of a corresponding Gaussian distribution, wherein generating the a-posteriori predictive distribution includes the following further steps:
In [4.3.4 , Page 108:15]:
As a simple example, in a vanilla linear NN                          
                            
                                    f
                                
                                    w
                                
                    (x) = wx takes a scalar x as input and computes the output based on a scalar parameter w; a corresponding Gaussian NPN would assume w is drawn from a Gaussian distribution N (                        
                            
                                    x
                                
                                    m
                                
                    ,                         
                            
                                    x
                                
                                    s
                                
                    ) and that x is drawn from N (                        
                            
                                    x
                                
                                    m
                                
                    ,                         
                            
                                    x
                                
                                    s
                                
                    ) (                        
                            
                                    x
                                
                                    s
                                
                    is set to 0 when the input is deterministic). 
In [4.3.4 , Page 108:15]:
With θ = (                        
                            
                                    w
                                
                                    m
                                
                    ,                         
                            
                                    w
                                
                                    s
                                
                    ) as a learnable parameter pair, NPN will then compute the mean and variance of the output Gaussian distribution                         
                            
                                    μ
                                
                                    θ
                                     
                     (                        
                            
                                    x
                                
                                    m
                                
                    ,                         
                            
                                    x
                                
                                    s
                                
                    )  and                         
                            
                                    s
                                
                                    θ
                                     
                     (                        
                            
                                    x
                                
                                    m
                                
                    ,                         
                            
                                    x
                                
                                    s
                                
                    )  in closed form (bias terms are ignored for clarity) as

    PNG
    media_image3.png
    93
    795
    media_image3.png
    Greyscale

Hence, the output of this Gaussian NPN is a tuple                         
                            
                                    (
                                    μ
                                
                                    θ
                                     
                     (                        
                            
                                    x
                                
                                    m
                                
                    ,                         
                            
                                    x
                                
                                    s
                                
                    ) ,                         
                            
                                    s
                                
                                    θ
                                     
                     (                        
                            
                                    x
                                
                                    m
                                
                    ,                         
                            
                                    x
                                
                                    s
                                
                    )  representing a Gaussian distribution instead of a single value. Input variance                         
                            
                                    x
                                
                                    s
                                
                     to NPN can be set to 0 if not available. Note that since                         
                            
                                    s
                                
                                    θ
                                     
                     (                        
                            
                                    x
                                
                                    m
                                
                    ,                         
                            0
                        
                     )=                         
                            
                                    x
                                
                                    m
                                
                                    2
                                
                     (                        
                            
                                    w
                                
                                    m
                                
                    ,                         
                            
                                    w
                                
                                    s
                                
                            )
                             
                    and                         
                            
                                    w
                                
                                    s
                                
                     can still be learned even if                         
                            
                                    x
                                
                                    s
                                
                    = 0 for all data points. 
In [4.2, Page 108:12]:
As mentioned in Section 1, BDL is a principled probabilistic framework with two seamlessly integrated components: a perception component and a task-specific component
In [4.2, Page 108:12]:
Three Variable Sets: There are three sets of variables in a BDL model: perception variables, hinge variables, and task variables. In this article, we use                         
                            
                                    Ω
                                
                                    p
                                
                    to denote the set of perception variables (e.g., X0, X1, and W1 in Figure 5), which are the variables in the perception component. Usually                         
                            
                                    Ω
                                
                                    p
                                
                    would include the weights and neurons in the probabilistic formulation of a deep learning model.                         
                            
                                    Ω
                                
                                    h
                                
                    is used to denote the set of hinge variables (e.g., H in Figure 5). These variables directly interact with the perception component from the task-specific component. The set of task variables (e.g., A, B, and C in Figure 5), i.e., variables in the task-specific component without direct relation to the perception component, is denoted as                         
                            
                                    Ω
                                
                                    t
                                
In [4.2, Page 108:12]:
Flexibility of Variance for                        
                             
                                    Ω
                                
                                    h
                                
                    : 
In [4.2, Page 108:12]:
one of BDL’s motivations is to model the uncertainty of exchanging information between the perception component and the task-specific component, which boils down to modeling the uncertainty related to                        
                            
                                    Ω
                                
                                    h
                                
                    . 
In [4.2, Page 108:12]:
Hyper-Variance: Hyper-Variance (HV) assumes that uncertainty during the information exchange is defined through hyperparameters. In the example, HV means that                         
                            
                                    σ
                                
                                    p
                                
                                    2
                                
                     is a manually tuned hyperparameter
In [4.2, Page 108:12]:
Learnable Variance: Learnable Variance (LV) uses learnable parameters to represent uncertainty during the information exchange. In the example,                         
                            
                                    σ
                                
                                    p
                                
                                    2
                                
                    is the learnable parameter.
-	generating a second approximate a-posteriori distribution for the plurality of latent variables under a condition that the training data set has set in, the second approximate a-posteriori distribution being further described by a set of parameters, which is parameterized over a parameter common to the training data set iteratively calculating the set of parameters based on the calculated plurality of latent observations and the calculated plurality of their uncertainties;
In [1, Page 108:1]:
consider controlling a complex dynamical system according to the live video stream received from a camera. This problem can be transformed into iteratively performing two tasks, perception from raw images and control based on dynamic models
In [5.2, Page 108:26]:
Learning and Inference: Reference [118] provides an EM-style algorithm for MAP estimation. 
In [5.2, Page 108:26]:
For the E step, the challenge lies in the inference of the relational latent matrix S. We first fix all rows of S except the kth one                         
                            
                                    S
                                
                                    k
                                    *
                                
                    and then update                        
                            
                                    S
                                
                                    k
                                    *
                                
                    . Specifically, we take the gradient of L with respect to                        
                            
                                    S
                                
                                    k
                                    *
                                
                    , set it to 0, and get the following linear system:

    PNG
    media_image7.png
    53
    706
    media_image7.png
    Greyscale

In [5.2, Page 108:26]:
A naive approach is to solve the linear system by setting 

    PNG
    media_image8.png
    37
    257
    media_image8.png
    Greyscale

Unfortunately, the complexity is O(J 3) for one single update. Similar to Reference [67], the steepest descent method [101] is used to iteratively update                         
                            
                                    S
                                
                                    k
                                    *
                                
(BRI: iteratively updating elements of a latent matrix often represents iteratively recalculating parameters for the underlying latent observations. 
In [1, Page 108:3]:
BDL’s major advantages as a principled way of unifying deep learning and PGM: information exchange between the perception task and the inference task, conditional dependencies on high-dimensional data, and effective modeling of uncertainty. In terms of uncertainty, it is worth noting that when BDL is applied to complex tasks, there are three kinds of parameter uncertainty that need to be taken into account: (1) Uncertainty on the neural network parameters. (2) Uncertainty on the task-specific parameters. (3) Uncertainty of exchanging information between the perception component and the task-specific component.
-	iteratively calculating the set of parameters based on the calculated plurality of latent observations and the calculated plurality of their uncertainties
In [1, Page 108:1]:
consider controlling a complex dynamical system according to the live video stream received from a camera. This problem can be transformed into iteratively performing two tasks, perception from raw images and control based on dynamic models
In [1, Page 108:3]:
BDL’s major advantages as a principled way of unifying deep learning and PGM: information exchange between the perception task and the inference task, conditional dependencies on high-dimensional data, and effective modeling of uncertainty. In terms of uncertainty, it is worth noting that when BDL is applied to complex tasks, there are three kinds of parameter uncertainty that need to be taken into account: (1) Uncertainty on the neural network parameters. (2) Uncertainty on the task-specific parameters. (3) Uncertainty of exchanging information between the perception component and the task-specific component.
In [5.2, Page 108:26]:
A naive approach is to solve the linear system by setting 

    PNG
    media_image8.png
    37
    257
    media_image8.png
    Greyscale

Unfortunately, the complexity is O(J 3) for one single update. Similar to Reference [67], the steepest descent method [101] is used to iteratively update                         
                            
                                    S
                                
                                    k
                                    *
                                
(BRI: iteratively updating elements of a latent matrix often represents iteratively recalculating parameters for the underlying latent observations. 
-	receiving another training data set, which includes a third plurality of data points and a fourth plurality of data points;
	In [5.1.1, Page 108:17]:
the recommendation task considered in CDL takes implicit feedback [50] as the training and test data. 
	In [4.3, Page 108:13]:
More recently, generative adversarial networks (GAN) [30] prevail as a new training scheme for training neural networks and have shown promise in generating photo-realistic images. Later on, Bayesian formulations (as well as related theoretical results) for GAN have also been proposed [30,
	In [5.1.1, Page 108:19]:
Seeing from the view of neural networks (NN), when                         
                            
                                    λ
                                
                                    s
                                
                     approaches positive infinity, training of the probabilistic graphical model of CDL in Figure 7 (left) would degenerate to simultaneously training two neural networks overlaid together with a common input layer (the corrupted input) but different output layers, as shown in Figure 8 (left).
In [3.1, Page 108:9]:
Due to its Bayesian nature, PGM such as LDA is easy to extend to incorporate other information or to perform other tasks. For example, following LDA, different variants of topic models have been proposed.
In [5.1.1, Page 108:18]:

    PNG
    media_image9.png
    327
    867
    media_image9.png
    Greyscale

In [5.1.1, Page 108:19]:

    PNG
    media_image10.png
    331
    855
    media_image10.png
    Greyscale

(BRI: training two neural networks simultaneously often involves using different datasets or specialized training methods like GANs (Generative Adversarial Networks) where one network (Generator) creates data that the other (Discriminator) evaluates against real data, effectively learning from separate data streams to improve each other, but they don't automatically get a new dataset; it's about strategic design, like in Transfer Learning or Ensemble Methods)
(BRI: With different variants of a model,  that is used for  training them on diverse datasets, can provide a plurality of data. This effectively provides multiple distributions of results (see [3.1, Page 108:9])
-	calculating the fourth plurality of data points, using the given subset of functions from the general, given family of functions, the given subset of functions is calculated on the third plurality of data points; 
In [4.3.4 , Page 108:15]:
As a simple example, in a vanilla linear NN                          
                            
                                    f
                                
                                    w
                                
                    (x) = wx takes a scalar x as input and computes the output based on a scalar parameter w; a corresponding Gaussian NPN would assume w is drawn from a Gaussian distribution N (                        
                            
                                    x
                                
                                    m
                                
                    ,                         
                            
                                    x
                                
                                    s
                                
                    ) and that x is drawn from N (                        
                            
                                    x
                                
                                    m
                                
                    ,                         
                            
                                    x
                                
                                    s
                                
                    ) (                        
                            
                                    x
                                
                                    s
                                
                    is set to 0 when the input is deterministic). 
(BRI: functions that are "vanilla linear with scalar input" represent a subset of the functions that can be represented by a neural network with non-linear activation functions (which is likely what "NPN" refers to in this context)
In [5.3.1, Page 108:29]:
Stochastic Optimal Control. 
we consider the stochastic optimal control of an unknown dynamical system as follows:

    PNG
    media_image4.png
    31
    600
    media_image4.png
    Greyscale

where t indexes the time steps and                          
                            
                                    z
                                
                                    t
                                
                            ∈
                             
                                    R
                                
                                            n
                                        
                                            z
                                        
                    is the latent states.                         
                            
                                    u
                                
                                    t
                                
                            ∈
                             
                                    R
                                
                                            n
                                        
                                            u
                                        
is the applied control at time t and ξ denotes the system noise.  Equivalently, the equation above can be written as 

    PNG
    media_image11.png
    27
    395
    media_image11.png
    Greyscale

Hence, we need a mapping function to map the corresponding raw image                         
                            
                                    x
                                
                                    t
                                
(observed input) into the latent space. 
(BRI: Mapping an observed input into a latent space over several time steps can indeed provide a set of data points within that dataset)
(BRI: sequence of data points indexed by time steps, where you model changes over time, is the definition of a time series)
-	and generating the a-posteriori predictive distribution further includes generating a third distribution, using a third and fourth neural network, wherein the third distribution is a function of the plurality of latent variables, the set of parameters, task-independent variables, and the other training data set, 
	In [5.3.4, Page 108:30]:
Note that the BDL-based control model discussed above uses a different information exchange mechanism 
	In [5.3.4, Page 108:30]:
it follows the VAE mechanism and uses neural networks to separately parameterize the mean and covariance of hinge variables (e.g., in the encoding model, the hinge variable

    PNG
    media_image12.png
    13
    181
    media_image12.png
    Greyscale

Where                         
                            
                                    μ
                                
                                    t
                                
                     and                         
                            
                                    σ
                                
                                    t
                                
                    are perception variables parameterized as in Equation (19)), which is more flexible (with more free parameters) than models like CDL and CDR in Section 5.1, where Gaussian distributions with fixed variance are also used. Note that this BDL-based control model is an LV model as shown in Table 1, and since the covariance is assumed to be diagonal, the model still meets the independence requirement in Section 4.
In [4, Page 108:10]:

    PNG
    media_image13.png
    639
    627
    media_image13.png
    Greyscale

In [3.1, Page 108:9]:
Due to its Bayesian nature, PGM such as LDA is easy to extend to incorporate other information or to perform other tasks. For example, following LDA, different variants of topic models have been proposed.
(BRI:  The core nature of Probabilistic Graphical Models (PGMs) like Latent Dirichlet Allocation (LDA) lies in their ability to model data generation as a process involving multiple probability distributions, which can be modified or extended in various ways as a result of using variants of topic models)
-	wherein: generating the a-posteriori predictive distribution includes optimizing a likelihood distribution with regard to the task-independent variables and the common parameter;
	In [3.2, Page 108:9]:
Strictly speaking, the process of finding the parameters (e.g., α and β in Figure 4) is called learning and the process of finding the latent variables (e.g., θ and z in Figure 4) given the parameters is called inference. However, given only the observed variables (e.g., w in Figure 4), learning and inference are often intertwined. Usually the learning and inference of LDA would alternate between the updates of latent variables (which correspond to inference) and the updates of the parameters (which correspond to learning). Once the learning and inference of LDA is completed, one could obtain the learned parameters α and β.
MAP, as efficient as it is, gives us only point estimates of latent variables (and parameters). To take the uncertainty into account and harness the full power of Bayesian models, one would have to resort to Bayesian treatments such as variational inference and Markov chain Monte Carlo (MCMC). For example, the original LDA uses variational inference to approximate the true posterior with factorized variational distributions [8]. Learning of the latent variables and parameters then boils down to minimizing the KL-divergence between the variational distributions and the true posterior distributions,
In [5.1.6, Page 108:23]:
CDL-based models use CF as a more complex target in a probabilistic framework
(BRI: minimizing KL-divergence between a posterior distribution is a core optimization strategy, in variational inference to find the best simple distribution (variational) to approximate a complex target distribution, turning inference into an optimization problem solvable with gradient descent methods)
-	optimizing the likelihood distribution includes maximizing the likelihood distribution with regard to the task-independent variables and the maximizing is based on the second approximate a-posteriori distribution generated and on the third distribution generated;
In [3.2,  Page 108:9]:
 various learning and inference algorithms are available for each PGM. Among them, the most cost-effective one is probably maximum a posteriori (MAP), which amounts to maximizing the posterior probability of the latent variable.
In [3.1, Page 108:9]:
Due to its Bayesian nature, PGM such as LDA is easy to extend to incorporate other information or to perform other tasks. For example, following LDA, different variants of topic models have been proposed,
In [ 2.2, Page 108:5]:
we introduce a kind of multilayer denoising AE, known as stacked denoising autoencoders (SDAE), both as an example of AE variants and as background for its applications on BDL-based recommender systems,
In [5.1.1, Page 108:18]:
The output of layer l of the SDAE is denoted by                        
                             
                                    X
                                
                                    l
                                
                    , which is a J-by-                        
                            
                                    K
                                
                                    l
                                
                     matrix. Similar to                        
                            
                                    X
                                
                                    c
                                
, row j of                         
                            
                                    X
                                
                                    l
                                
                    is denoted by                         
                            
                                    X
                                
                                    l
                                    ,
                                    j
                                    *
                                
                                    .
                                     
                                    W
                                
                                    l
                                
                    and                          
                            
                                    b
                                
                                    l
                                
                    are the weight matrix and bias vector, respectively, of layer l,                         
                            
                                    X
                                
                                    l
                                    ,
                                    n
                                    *
                                
                    denotes column n of                         
                            
                                    .
                                     
                                    W
                                
                                    l
                                
                    , and L is the number of layers. For convenience, we use W+ to denote the collection of all layers of weight matrices and biases. 
In [5.1.1, Page 108:19]:
maximizing the posterior probability is equivalent to maximizing the joint log-likelihood of 

    PNG
    media_image14.png
    19
    503
    media_image14.png
    Greyscale

(BRI:  the second and third distribution is within the context of different variants of the model. Maximizing the joint log-likelihood is mathematically equivalent to maximizing the joint likelihood because the logarithm is a monotonically increasing function, as it preserves the order of values, so the parameter setting that yields the highest likelihood)
-	maximizing the likelihood distribution includes calculating an integral over a function of latent variables, which contains respective products of the second approximate a-posteriori distribution and of the third distribution; 
In [3.1, Page 108:9]:
Due to its Bayesian nature, PGM such as LDA is easy to extend to incorporate other information or to perform other tasks. For example, following LDA, different variants of topic models have been proposed,
(BRI: the second and third distribution in the context of different variants of the models)
In [5.3.1, Page 108:29]:
Stochastic Optimal Control. 
we consider the stochastic optimal control of an unknown dynamical system as follows:

    PNG
    media_image4.png
    31
    600
    media_image4.png
    Greyscale

where t indexes the time steps and                          
                            
                                    z
                                
                                    t
                                
                            ∈
                             
                                    R
                                
                                            n
                                        
                                            z
                                        
                    is the latent states.                         
                            
                                    u
                                
                                    t
                                
                            ∈
                             
                                    R
                                
                                            n
                                        
                                            u
                                        
is the applied control at time t and ξ denotes the system noise. 
(BRI: sequence of data points indexed by time steps, where you model changes over time, is the definition of a time series, characterized by its temporal order and dependency between consecutive points, used for analysis and forecasting trends, seasonality, and future values)
In [5.1.6, Page 108:23]:
Recommender systems are a typical use case for BDL in that they often require both thorough understanding of high-dimensional signals (e.g., text and images) and principled reasoning on the conditional dependencies among users/items/ratings. In this regard, CDL, as an instantiation of BDL, is the first hierarchical Bayesian model to bridge the gap between state-of-the-art deep learning models and recommender systems. By performing deep learning collaboratively, CDL and its variants can simultaneously extract an effective deep feature representation from high-dimensional content and capture the similarity and implicit relationship between items (and users).
In [3.2, Page 108:9]:
MAP, as efficient as it is, gives us only point estimates of latent variables (and parameters). To take the uncertainty into account and harness the full power of Bayesian models, one would have to resort to Bayesian treatments such as variational inference and Markov chain Monte Carlo (MCMC). For example, the original LDA uses variational inference to approximate the true posterior with factorized variational distributions [8]. Learning of the latent variables and parameters then boils down to minimizing the KL-divergence between the variational distributions and the true posterior distributions.
In [5.3.2, Page 108:30]:
the posterior distribution                         
                            
                                    P
                                
                                    θ
                                
                     (X |Z) reconstructs the raw images                         
                            
                                    x
                                
                                    t
                                
                     from the latent states                         
                            
                                    z
                                
                                    t
                                
In [5.3.3, Page 108:30]:
Learning Using Stochastic Gradient Variational Bayes. With D= {                        
                            
                                    X
                                
                                    1
                                
                    ,                         
                            
                                    U
                                
                                    1
                                
                    ,                         
                            
                                    X
                                
                                    2
                                
                            )
                        
                    ,..., (                        
                            
                                    X
                                
                                    T
                                    -
                                    1
                                
                    ,                         
                            
                                    U
                                
                                    T
                                    -
                                    1
                                
                    ,                         
                            
                                    X
                                
                                    T
                                
                    )} as the training set, the loss function is as follows:

    PNG
    media_image15.png
    60
    765
    media_image15.png
    Greyscale

where the first term is the variational bound on the marginalized log-likelihood for each data point:
in [4.4.2, Page 108:16]:
Compared to vanilla (shallow) Bayesian networks, deep Bayesian networks such as BIN make it possible to handle deep and nonlinear conditional dependencies effectively and efficiently. Besides, with BNN as building blocks, task-specific components based on deep Bayesian networks can better work with the perception component, which is usually a BNN as well. Figure 6 (right) shows a more complicated case with both observed (shaded nodes) and unobserved (transparent nodes) variables

    PNG
    media_image16.png
    260
    842
    media_image16.png
    Greyscale

In [5.2, Page 108:24]:
novel probabilistic model that seamlessly integrates a hierarchy of latent factors and the relational information available.
(BRI: integration provides a means for calculating an integral over a function of latent variable with the support of MCMC method (see [3.2, Page 108:9]) it generates samples taken from the posterior distribution to approximate the integral.)
(BRI: the marginalized log-likelihood for each data point involves calculating an integral (or sum for discrete variables) over the latent variables, essentially it averages the likelihood across all possible latent states weighted by their probabilities to get the evidence for that data point, which is crucial for Bayesian inference. This process integrates out the unobserved latent variables for determining the probability of the observed data)
-	calculating the integral includes approximating the integral with regard to the plurality of latent variables, using a non-stochastic loss function, which is based on the set of parameters of the second approximate a-posteriori distribution.
In [3.1, Page 108:9]:
Due to its Bayesian nature, PGM such as LDA is easy to extend to incorporate other information or to perform other tasks. For example, following LDA, different variants of topic models have been proposed,
(BRI: the second approximate posterior distribution is within the context of different variants of the models that represents plurality of distributions)
In [4.4, Page 108:15]:
we introduce different forms of task-specific components. The purpose of a task-specific component is to incorporate probabilistic prior knowledge into the BDL model. Such knowledge can be naturally represented using PGM.
a task-specific component can take various forms. For example, it can be a typical Bayesian network (directed PGM) such as LDA, a deep Bayesian network [117], or a stochastic process [51, 94], all of which can be represented in the form of PGM
in [5.1.1, Page 108:19]:
Learning: Based on the CDL model above, all parameters could be treated as random variables so that fully Bayesian methods such as Markov chain Monte Carlo (MCMC) or variational inference [55] may be applied. However, such treatment typically incurs high computational cost. Therefore, CDL uses an EM-style algorithm to obtain the MAP estimates
(BRI: The conventional Expectation-Maximization (EM) algorithm is deterministic in its approach because the E-step and M-step calculations produce a single, predictable sequence of parameters for a given initial value
in [5.1.1, Page 108:19]:
From the perspective of optimization, the third term in the objective function, i.e., Equation (9), above is equivalent to a multi-layer perceptron using the latent item vectors vj as the target while the fourth term is equivalent to an SDAE minimizing the reconstruction error.
(BRI: Probabilistic Graphical Models (PGMs) use non-stochastic (deterministic) loss functions, such as log-likelihood, cross-entropy, and mean squared error (MSE), for both training and inference)
WANG does not explicitly disclose:
-	A  computer-implemented method for generating a computer- implemented machine learning system, the machine learning system being configured f
	However, Zhang discloses:
-	 A computer-implemented method for generating a computer-implemented machine learning system, the machine learning system being configured 
 	In [Col 3, lines 38-39]:
	 there is provided a computer-implemented method of machine learning. 
In [Col 1, lines 29-30]:
 FIG. 1(a) gives a simplified representation of an example neural network 108. 
In [Col 1, lines 32-36]:
In practice, there may be many nodes in each layer, but for simplicity only a few are illustrated. Each node is configured to generate an output by carrying out a function on the values input to that node. 
It would have obvious to one of ordinary skill in the art before the effective filing date of the
present application to combine WANG, and Zhang.
WANG teaches generating an a-posteriori predictive distribution for predicting the
dynamic  response of the device, aggregation,  posterior distribution, and maximizing 
the likelihood distribution for prediction.
Zhang teaches machine learning system being configured 
One of ordinary skill would have motivation to combine WANG and Zhang to provide an 
improved robustness against unanticipated manipulations (Zhang [Col 4, lines 42-44])
In regard to claim 6: (Currently Amended)
WANG discloses:
-	wherein iteratively calculating the set of parameters includes implementing another plurality of factored Gaussian distributions with regard to the latent variables, and the set of parameters corresponds to a plurality of means and variances of the Gaussian distributions.  
In [1, Page 108:1]:
consider controlling a complex dynamical system according to the live video stream received from a camera. This problem can be transformed into iteratively performing two tasks, perception from raw images and control based on dynamic models
In [1, Page 108:3]:
BDL’s major advantages as a principled way of unifying deep learning and PGM: information exchange between the perception task and the inference task, conditional dependencies on high-dimensional data, and effective modeling of uncertainty. In terms of uncertainty, it is worth noting that when BDL is applied to complex tasks, there are three kinds of parameter uncertainty that need to be taken into account: (1) Uncertainty on the neural network parameters. (2) Uncertainty on the task-specific parameters. (3) Uncertainty of exchanging information between the perception component and the task-specific component.
In [5.2, Page 108:26]:
A naive approach is to solve the linear system by setting 

    PNG
    media_image8.png
    37
    257
    media_image8.png
    Greyscale

Unfortunately, the complexity is O(J 3) for one single update. Similar to Reference [67], the steepest descent method [101] is used to iteratively update                         
                            
                                    S
                                
                                    k
                                    *
                                
In [5.4.4, Page 108:31]:
Gaussian mixture VAE [49] uses a Gaussian mixture model as the task-specific component to achieve controllable speech synthesis from text. In terms of speech recognition, recurrent Poisson process units (RPPU) [51] instead adopt a different form of task-specific component; 
In [1, Page 108:3]:
the task-specific component. Ideally, both the first-order and second-order information (e.g., the mean and the variance) should be able to flow back and forth between the two components.
In [3.1, Page 108:9]:
Due to its Bayesian nature, PGM such as LDA is easy to extend to incorporate other information or to perform other tasks. For example, following LDA, different variants of topic models have been proposed.
In regard to claim 12: (Currently Amended) 
WANG discloses:
-	further comprising substituting the task-independent variables derived by the optimization, and the common parameter, in the likelihood distribution, in order to generate the a-posteriori predictive distribution.  
In [5.1.1, Page 108:20]:
Let D be the observed test data. Similar to Reference [112], CDL uses the point estimates of                         
                            
                                    u
                                
                                    i
                                
                    ,, W+ and                        
                             
                                    ϵ
                                
                                    j
                                
                     to calculate the predicted rating:
(BRI: predictive ratings inherently relate to a predictive distribution, as they represent a model's probabilistic forecast of future outcomes, often displayed as a score especially in Bayesian methods)
In [4.1, Page 108:10]:
BNN imposes a prior on the neural network’s parameters and aims to learn a posterior distribution of these parameters. During the inference phrase, such a distribution is then marginalized out to produce final predictions.
(BRI: Bayesian inference learns a posterior distribution for model parameters during training, and to make predictions for new data, this posterior is marginalized (integrated out) to provide posterior predictive distribution)
In regard to claim 14: (Original) 
WANG does not explicitly disclose:
-	wherein the device is a machine
However, Zhang discloses:
-	wherein the device is a machine
In [ Col 6, lines 9-16]:
FIG. 2 illustrates an example computing apparatus 200 for implementing an artificial intelligence (AI) algorithm including a machine-learning model in accordance with embodiments described herein. The computing apparatus 200 may take the form of a user terminal such as a desktop computer, laptop computer, tablet, smartphone, wearable smart device such as a smart watch, or an on-board computer of a vehicle such as car, etc
[BRI: the computing apparatus which is a device can also be  a machine) 
It would have obvious to one of ordinary skill in the art before the effective filing date of 
The present application to combine WANG, and Zhang.
WANG teaches generating an a-posteriori predictive distribution for predicting the
dynamic  response of the device, aggregation,  posterior distribution, and maximizing 
the likelihood distribution for prediction.
Zhang teaches machine learning system being configured for monitoring and/or controlling 
a device from input sensor data. 
One of ordinary skill would have motivation to combine WANG and Zhang to provide an 
improved robustness against unanticipated manipulations (Zhang [Col 4, lines 42-44])
In regard to claim 15: (Original)
WANG does not explicitly disclose:
-	wherein the device is an engine
However, Zhang discloses:
-	wherein the device is an engine
In [ Col 6, lines 9-16]:
FIG. 2 illustrates an example computing apparatus 200 for implementing an artificial intelligence (AI) algorithm including a machine-learning model in accordance with embodiments described herein. The computing apparatus 200 may take the form of a user terminal such as a desktop computer, laptop computer, tablet, smartphone, wearable smart device such as a smart watch, or an on-board computer of a vehicle such as car, etc
[BRI: the computing apparatus which is a device can also be an electronic engine)
In regard to claim 16: (Original)
Zhang discloses:
-	wherein the computer-implemented machine learning system is configured for modeling parameterization of a characteristics map of the device.
In [0163]:
there is provided computer-implemented method of machine learning, the method comprising: receiving a plurality of observed data points 
in [0063]:
each observed data point represents a respective observation of a ground truth as observed in the form of the respective values of the feature vector; and learning parameters of a machine-learning model based on the observed data points, wherein the machine-learning model comprises one or more statistical models arranged to model a causal relationship between the feature vector and a latent vector
in [0018]:
 In embodiments, the training data comprises at least two groups of data points: a first group which does not include the manipulation(s), and a second group which does. E.g. the first group may be used in an initial training phase and the second group may be collected during a testing phase or during actual deployment of the model “in-the-field”.
(BRI: Learning the parameters of a machine learning (ML) model from observed data points can be seen as a form of modeling parameterization used to capture and represent the characteristics or behavior of a device, system, or process in which observed data points are the inputs and outputs (or just observations) collected from the physical device)

It would have obvious to one of ordinary skill in the art before the effective filing date of 
The present application to combine WANG, and Zhang.
WANG teaches generating an a-posteriori predictive distribution for predicting the
dynamic  response of the device, aggregation,  posterior distribution, and maximizing 
the likelihood distribution for prediction.
Zhang teaches machine learning system being configured for monitoring and/or controlling 
a device from input sensor data. 
One of ordinary skill would have motivation to combine WANG and Zhang to provide an 
improved robustness against unanticipated manipulations (Zhang [Col 4, lines 42-44])
In regard to claim 17: (Previously Presented)
WANG does not explicitly disclose:
-	further comprising parameterizing the characteristics map of the device, using the computer-implemented machine learning system generated.  
However, Zhang discloses:
-	further comprising parameterizing the characteristics map of the device, using the computer-implemented machine learning system generated.  
in [0163]:
there is provided computer-implemented method of machine learning, the method comprising: receiving a plurality of observed data points 
in [0063]:
each observed data point represents a respective observation of a ground truth as observed in the form of the respective values of the feature vector; and learning parameters of a machine-learning model based on the observed data points, wherein the machine-learning model comprises one or more statistical models arranged to model a causal relationship between the feature vector and a latent vector
in [0018]:
 In embodiments, the training data comprises at least two groups of data points: a first group which does not include the manipulation(s), and a second group which does. E.g. the first group may be used in an initial training phase and the second group may be collected during a testing phase or during actual deployment of the model “in-the-field”.
(BRI: Learning the parameters of a machine learning (ML) model from observed data points can be seen as a form of modeling parameterization used to capture and represent the characteristics or behavior of a device, system, or process in which observed data points are the inputs and outputs (or just observations) collected from the physical device
In regard to claim 19: (Currently Amended) 
WANG discloses:
-	for determining output data for monitoring and/or controlling a technical device from input sensor data, 
In [5.3, Page 108:28]:
Bayesian deep learning can also be applied to the control of nonlinear dynamical systems from raw images. Consider controlling a complex dynamical system according to the live video stream received from a camera. One way of solving this control problem is by iteration between two tasks, perception from raw images and control based on dynamic models.
In [1, Page 108:2]:
consider controlling a complex dynamical system according to the live video stream received from a camera. This problem can be transformed into iteratively performing two tasks, perception from raw images and control based on dynamic models. The perception task of processing raw images can be handled by deep learning while the control task usually needs more sophisticated models such as hidden Markov models and Kalman filters [35, 74]
(BRI: a camera is a sensor. The control system can integrate input from a camera (as a sensor) to control a technical device, and this system can also incorporate data from other input sensors)
-	receiving a training data set, which reflects a dynamic response of a device;
	In [5.3, Page 108:28]:
Consider controlling a complex dynamical system according to the live video stream received from a camera. One way of solving this control problem is by iteration between two tasks, perception from raw images and control based on dynamic models.
In [5.3, Page 108:28]: 
BDL can be applied in the supervised and unsupervised learning settings, respectively. 
In [5.3, Page 108:28]: 
BDL can help representation learning in general, using control as an example application
In [5.3, Page 108:28]:
Reference [125] posed this task as a representation learning problem and proposed a model called Embed to Control to take into account the feedback loop mentioned above during representation learning
(BRI: using the interaction between two control tasks can provide a training dataset that reflects the dynamic response of a device, primarily through mechanisms like transfer of learning and adaptive control feedback loops. 
-	computing an aggregation of at one latent variable of the machine learning system , using a Bayesian inference,  and in view of the training data set being transferred directly into the statistical description of the plurality of latent variables 
In [4.1, Page 108:10]:
 BDL is Bayesian neural networks (BNN) or Bayesian treatments of neural networks. Similar to any Bayesian treatment, BNN imposes a prior on the neural network’s parameters and aims to learn a posterior distribution of these parameters. During the inference phrase, such a distribution is then marginalized out to produce final predictions. In general such a process is called Bayesian model averaging [5] and can be seen as learning an infinite number of (or a distribution over) neural networks and then aggregating the results through ensembling
In [3.2, Page 108:9]:
the process of finding the parameters (e.g., α and β in Figure 4) is called learning and the process of finding the latent variables (e.g., θ and z in Figure 4) given the parameters is called inference.
In [2.4.2, Page 108:8];

    PNG
    media_image1.png
    138
    581
    media_image1.png
    Greyscale

    PNG
    media_image2.png
    77
    1056
    media_image2.png
    Greyscale

In [3.1, Page 108:8]:
Models There are essentially two types of PGM, directed PGM (also known as Bayesian networks) and undirected PGM (also known as Markov random fields) [5]. In this survey, we mainly focus on directed PGM.4 For details on undirected PGM, readers are referred to Reference [5]. A classic example of PGM would be latent Dirichlet allocation (LDA)
In [3.2, Page 108:9]:
various learning and inference algorithms are available for each PGM. Among them, the most cost-effective one is probably maximum a posteriori (MAP), which amounts to maximizing the posterior probability of the latent variable
-	generating an a-posteriori predictive distribution for predicting the dynamic response of the device, using the calculated aggregation, and under a condition that the training data set has set in, wherein the training data set includes a first plurality of data points and a second plurality of data points, and the method includes calculating the second plurality of data points, using a given subset of functions from a general, given family of functions, the given subset of functions is calculated on the first plurality of data points
In [5.16, Page 108:23]: 
Recommender systems are a typical use case for BDL in that they often require both thorough understanding of high-dimensional signals (e.g., text and images) and principled reasoning on the conditional dependencies among users/items/ratings. In this regard, CDL, as an instantiation of BDL, is the first hierarchical Bayesian model
In [1, Page 108:2]:
consider controlling a complex dynamical system according to the live video stream received from a camera. This problem can be transformed into iteratively performing two tasks, perception from raw images and control based on dynamic models. The perception task of processing raw images can be handled by deep learning while the control task usually needs more sophisticated models such as hidden Markov models and Kalman filters [35, 74]. The feedback loop is then completed by the fact that actions chosen by the control model can affect the received video stream in turn.
In [5.1.6, Page 108:23]:
Note that BDL-based models above use typical static Bayesian networks as their task-specific components. Although these are often sufficient for most use cases, it is possible for the task specific components to take the form of deep Bayesian networks 
In [5.1.6, Page 108:23]:
One can also use stochastic processes (or dynamic Bayesian networks in general) to explicitly model users purchase or clicking behaviors. 
In [4.1, Page 108:10]:
BNN imposes a prior on the neural network’s parameters and aims to learn a posterior distribution of these parameters. During the inference phrase, such a distribution is then marginalized out to produce final predictions. In general such a process is called Bayesian model averaging [5] and can be seen as learning an infinite number of (or a distribution over) neural networks and then aggregating the results through ensembling.
 (BRI: a Bayesian Neural Network (BNN) that imposes a prior distribution on its weights, which acts as a form of regularization or constraint on the possible functions the network can learn and influencing the final posterior distribution after training. While it's a prior before data, it indirectly conditions the model's behavior and output, helping with uncertainty estimation )
In [4.3.4 , Page 108:14]:
Natural-parameter Networks. Different from vanilla NN, which usually takes deterministic input, NPN [119] is a probabilistic NN taking distributions as input. The input distributions go through layers of linear and nonlinear transformation to produce output distributions. 
In [4.3.4 , Page 108:15]:
As a simple example, in a vanilla linear NN                          
                            
                                    f
                                
                                    w
                                
                    (x) = wx takes a scalar x as input and computes the output based on a scalar parameter w; a corresponding Gaussian NPN would assume w is drawn from a Gaussian distribution N (                        
                            
                                    x
                                
                                    m
                                
                    ,                         
                            
                                    x
                                
                                    s
                                
                    ) and that x is drawn from N (                        
                            
                                    x
                                
                                    m
                                
                    ,                         
                            
                                    x
                                
                                    s
                                
                    ) (                        
                            
                                    x
                                
                                    s
                                
                    is set to 0 when the input is deterministic). 
(BRI: functions that are "vanilla linear with scalar input" represent a subset of the functions that can be represented by a neural network with non-linear activation functions (which is likely what "NPN" refers to in this context)
In [4.3.4 , Page 108:15]:
With θ = (                        
                            
                                    w
                                
                                    m
                                
                    ,                         
                            
                                    w
                                
                                    s
                                
                    ) as a learnable parameter pair, NPN will then compute the mean and variance of the output Gaussian distribution                         
                            
                                    μ
                                
                                    θ
                                     
                     (                        
                            
                                    x
                                
                                    m
                                
                    ,                         
                            
                                    x
                                
                                    s
                                
                    )  and                         
                            
                                    s
                                
                                    θ
                                     
                     (                        
                            
                                    x
                                
                                    m
                                
                    ,                         
                            
                                    x
                                
                                    s
                                
                    )  in closed form (bias terms are ignored for clarity) as

    PNG
    media_image3.png
    93
    795
    media_image3.png
    Greyscale

Hence, the output of this Gaussian NPN is a tuple                         
                            
                                    (
                                    μ
                                
                                    θ
                                     
                     (                        
                            
                                    x
                                
                                    m
                                
                    ,                         
                            
                                    x
                                
                                    s
                                
                    ) ,                         
                            
                                    s
                                
                                    θ
                                     
                     (                        
                            
                                    x
                                
                                    m
                                
                    ,                         
                            
                                    x
                                
                                    s
                                
                    )  representing a Gaussian distribution instead of a single value. Input variance xs to NPN can be set to 0 if not available. Note that since                         
                            
                                    s
                                
                                    θ
                                     
                     (                        
                            
                                    x
                                
                                    m
                                
                    ,                         
                            0
                        
                     )=                         
                            
                                    x
                                
                                    m
                                
                                    2
                                
                     (                        
                            
                                    w
                                
                                    m
                                
                    ,                         
                            
                                    w
                                
                                    s
                                
                            )
                             
                    and                         
                            
                                    w
                                
                                    s
                                
                     can still be learned even if                         
                            
                                    x
                                
                                    s
                                
                    = 0 for all data points. 
(BRI: Gaussian Natural-Parameter Network (NPN) for learning operates by modeling all weights, neurons, and inputs as distributions rather than single, deterministic values. In this framework, the model inherently represents the properties and relationships of a plurality (collection) of data points by capturing their collective mean and variance (and thus uncertainty), not just individual instances) 
-	mapping each pair of the first plurality of data points and of the second plurality of data points from the training data set onto a corresponding latent observation, using a first neural network, and onto an uncertainty of the corresponding latent observation, using a second neural network;
In [3.1, Page 108:9]:
Due to its Bayesian nature, PGM such as LDA is easy to extend to incorporate other information or to perform other tasks. For example, following LDA, different variants of topic models have been proposed.
In [3.2, Page 108:9]:
MAP, as efficient as it is, gives us only point estimates of latent variables (and parameters). To take the uncertainty into account and harness the full power of Bayesian models, one would have to resort to Bayesian treatments such as variational inference and Markov chain Monte Carlo (MCMC). For example, the original LDA uses variational inference to approximate the true posterior with factorized variational distributions 
In [5.3.1, Page 108:29]:
Stochastic Optimal Control. 
we consider the stochastic optimal control of an unknown dynamical system as follows:

    PNG
    media_image4.png
    31
    600
    media_image4.png
    Greyscale

where t indexes the time steps and                          
                            
                                    z
                                
                                    t
                                
                            ∈
                             
                                    R
                                
                                            n
                                        
                                            z
                                        
                    is the latent states.                         
                            
                                    u
                                
                                    t
                                
                            ∈
                             
                                    R
                                
                                            n
                                        
                                            u
                                        
is the applied control at time t and ξ denotes the system noise. Equivalently, the equation above can be written as

    PNG
    media_image5.png
    22
    317
    media_image5.png
    Greyscale

Hence, we need a mapping function to map the corresponding raw image                         
                            
                                    x
                                
                                    t
                                
(observed input) into the latent space

    PNG
    media_image6.png
    33
    458
    media_image6.png
    Greyscale

where ω is the corresponding system noise.
(BRI: With different variants of a model,  that is used for  training them on diverse datasets, can provide a plurality of data. This effectively provides multiple distributions of results (see [3.1, Page 108:9])
-	aggregating a Bayesian a-posteriori distribution for the plurality of latent variables under a condition that the plurality of latent observations has set in, the aggregating being carried out, using Bayesian inference, through which information contained in the training data set is transferred directly into the statistical description of the plurality of latent variables 
In [4.1, Page 108:10]:
BNN imposes a prior on the neural network’s parameters and aims to learn a posterior distribution of these parameters. During the inference phrase, such a distribution is then marginalized out to produce final predictions. In general such a process is called Bayesian model averaging [5] and can be seen as learning an infinite number of (or a distribution over) neural networks and then aggregating the results through ensembling.
In [5.4.5, Page 108:31]:
Time Series Forecasting. Time series forecasting is a long-standing core problem in economics, statistics, and machine learning
In [1, Page 108:3]:
BDL model consists of two components, a perception component that is a Bayesian formulation of a certain type of neural networks and a task-specific component that describes the relationship among different hidden or observed variables using PGM. 
In [4.4.2, Page 108:16]:
 	Bidirectional Inference Networks. Typical Bayesian networks assume “shallow” conditional dependencies among random variables. In the generative process, one random variable (which can be either latent or observed) is usually drawn from a conditional distribution 
parameterized by the linear combination of its parent variables
In [4.4.2, Page 108:16]:
Such “shallow” and linear structures can be replaced with nonlinear or even deep nonlinear structures to form a deep Bayesian network. As an example, bidirectional inference network (BIN) [117] is a class of deep Bayesian networks that enable deep nonlinear structures in each conditional distribution
(BRI: a deep Bayesian network (DBN) with a nonlinear structure can infer observed data from unobserved (latent) variables. This process involves learning the complex, typically nonlinear, relationships between the latent and observed variables and effectively transferring the statistical description (or information) of the latent space to the observed data distribution) 
-	and calculating a plurality of latent observations and a plurality of their uncertainties, wherein aggregating the Bayesian a-posteriori distribution includes implementing a plurality of factored Gaussian distributions, wherein each uncertainty is a variance of a corresponding Gaussian distribution, wherein generating the a-posteriori predictive distribution includes the following further steps:
In [4.3.4 , Page 108:15]:
As a simple example, in a vanilla linear NN                          
                            
                                    f
                                
                                    w
                                
                    (x) = wx takes a scalar x as input and computes the output based on a scalar parameter w; a corresponding Gaussian NPN would assume w is drawn from a Gaussian distribution N (                        
                            
                                    x
                                
                                    m
                                
                    ,                         
                            
                                    x
                                
                                    s
                                
                    ) and that x is drawn from N (                        
                            
                                    x
                                
                                    m
                                
                    ,                         
                            
                                    x
                                
                                    s
                                
                    ) (                        
                            
                                    x
                                
                                    s
                                
                    is set to 0 when the input is deterministic). 
In [4.3.4 , Page 108:15]:
With θ = (                        
                            
                                    w
                                
                                    m
                                
                    ,                         
                            
                                    w
                                
                                    s
                                
                    ) as a learnable parameter pair, NPN will then compute the mean and variance of the output Gaussian distribution                         
                            
                                    μ
                                
                                    θ
                                     
                     (                        
                            
                                    x
                                
                                    m
                                
                    ,                         
                            
                                    x
                                
                                    s
                                
                    )  and                         
                            
                                    s
                                
                                    θ
                                     
                     (                        
                            
                                    x
                                
                                    m
                                
                    ,                         
                            
                                    x
                                
                                    s
                                
                    )  in closed form (bias terms are ignored for clarity) as

    PNG
    media_image3.png
    93
    795
    media_image3.png
    Greyscale

Hence, the output of this Gaussian NPN is a tuple                         
                            
                                    (
                                    μ
                                
                                    θ
                                     
                     (                        
                            
                                    x
                                
                                    m
                                
                    ,                         
                            
                                    x
                                
                                    s
                                
                    ) ,                         
                            
                                    s
                                
                                    θ
                                     
                     (                        
                            
                                    x
                                
                                    m
                                
                    ,                         
                            
                                    x
                                
                                    s
                                
                    )  representing a Gaussian distribution instead of a single value. Input variance                         
                            
                                    x
                                
                                    s
                                
                     to NPN can be set to 0 if not available. Note that since                         
                            
                                    s
                                
                                    θ
                                     
                     (                        
                            
                                    x
                                
                                    m
                                
                    ,                         
                            0
                        
                     )=                         
                            
                                    x
                                
                                    m
                                
                                    2
                                
                     (                        
                            
                                    w
                                
                                    m
                                
                    ,                         
                            
                                    w
                                
                                    s
                                
                            )
                             
                    and                         
                            
                                    w
                                
                                    s
                                
                     can still be learned even if                         
                            
                                    x
                                
                                    s
                                
                    = 0 for all data points. 
In [4.2, Page 108:12]:
As mentioned in Section 1, BDL is a principled probabilistic framework with two seamlessly integrated components: a perception component and a task-specific component
In [4.2, Page 108:12]:
Three Variable Sets: There are three sets of variables in a BDL model: perception variables, hinge variables, and task variables. In this article, we use                         
                            
                                    Ω
                                
                                    p
                                
                    to denote the set of perception variables (e.g., X0, X1, and W1 in Figure 5), which are the variables in the perception component. Usually                         
                            
                                    Ω
                                
                                    p
                                
                    would include the weights and neurons in the probabilistic formulation of a deep learning model.                         
                            
                                    Ω
                                
                                    h
                                
                    is used to denote the set of hinge variables (e.g., H in Figure 5). These variables directly interact with the perception component from the task-specific component. The set of task variables (e.g., A, B, and C in Figure 5), i.e., variables in the task-specific component without direct relation to the perception component, is denoted as                         
                            
                                    Ω
                                
                                    t
                                
In [4.2, Page 108:12]:
Flexibility of Variance for                        
                             
                                    Ω
                                
                                    h
                                
                    : 
In [4.2, Page 108:12]:
one of BDL’s motivations is to model the uncertainty of exchanging information between the perception component and the task-specific component, which boils down to modeling the uncertainty related to                        
                            
                                    Ω
                                
                                    h
                                
                    . 
In [4.2, Page 108:12]:
Hyper-Variance: Hyper-Variance (HV) assumes that uncertainty during the information exchange is defined through hyperparameters. In the example, HV means that                         
                            
                                    σ
                                
                                    p
                                
                                    2
                                
                     is a manually tuned hyperparameter
In [4.2, Page 108:12]:
Learnable Variance: Learnable Variance (LV) uses learnable parameters to represent uncertainty during the information exchange. In the example,                         
                            
                                    σ
                                
                                    p
                                
                                    2
                                
                    is the learnable parameter.
-	generating a second approximate a-posteriori distribution for the plurality of latent variables under a condition that the training data set has set in, the second approximate a-posteriori distribution being further described by a set of parameters, which is parameterized over a parameter common to the training data set iteratively calculating the set of parameters based on the calculated plurality of latent observations and the calculated plurality of their uncertainties;
In [1, Page 108:1]:
consider controlling a complex dynamical system according to the live video stream received from a camera. This problem can be transformed into iteratively performing two tasks, perception from raw images and control based on dynamic models
In [5.2, Page 108:26]:
Learning and Inference: Reference [118] provides an EM-style algorithm for MAP estimation. 
In [5.2, Page 108:26]:
For the E step, the challenge lies in the inference of the relational latent matrix S. We first fix all rows of S except the kth one                         
                            
                                    S
                                
                                    k
                                    *
                                
                    and then update                        
                            
                                    S
                                
                                    k
                                    *
                                
                    . Specifically, we take the gradient of L with respect to                        
                            
                                    S
                                
                                    k
                                    *
                                
                    , set it to 0, and get the following linear system:

    PNG
    media_image7.png
    53
    706
    media_image7.png
    Greyscale

In [5.2, Page 108:26]:
A naive approach is to solve the linear system by setting 

    PNG
    media_image8.png
    37
    257
    media_image8.png
    Greyscale

Unfortunately, the complexity is O(J 3) for one single update. Similar to Reference [67], the steepest descent method [101] is used to iteratively update                         
                            
                                    S
                                
                                    k
                                    *
                                
(BRI: iteratively updating elements of a latent matrix often represents iteratively recalculating parameters for the underlying latent observations. 
In [1, Page 108:3]:
BDL’s major advantages as a principled way of unifying deep learning and PGM: information exchange between the perception task and the inference task, conditional dependencies on high-dimensional data, and effective modeling of uncertainty. In terms of uncertainty, it is worth noting that when BDL is applied to complex tasks, there are three kinds of parameter uncertainty that need to be taken into account: (1) Uncertainty on the neural network parameters. (2) Uncertainty on the task-specific parameters. (3) Uncertainty of exchanging information between the perception component and the task-specific component.
-	iteratively calculating the set of parameters based on the calculated plurality of latent observations and the calculated plurality of their uncertainties
In [1, Page 108:1]:
consider controlling a complex dynamical system according to the live video stream received from a camera. This problem can be transformed into iteratively performing two tasks, perception from raw images and control based on dynamic models
In [1, Page 108:3]:
BDL’s major advantages as a principled way of unifying deep learning and PGM: information exchange between the perception task and the inference task, conditional dependencies on high-dimensional data, and effective modeling of uncertainty. In terms of uncertainty, it is worth noting that when BDL is applied to complex tasks, there are three kinds of parameter uncertainty that need to be taken into account: (1) Uncertainty on the neural network parameters. (2) Uncertainty on the task-specific parameters. (3) Uncertainty of exchanging information between the perception component and the task-specific component.
In [5.2, Page 108:26]:
A naive approach is to solve the linear system by setting 

    PNG
    media_image8.png
    37
    257
    media_image8.png
    Greyscale

Unfortunately, the complexity is O(J 3) for one single update. Similar to Reference [67], the steepest descent method [101] is used to iteratively update                         
                            
                                    S
                                
                                    k
                                    *
                                
(BRI: iteratively updating elements of a latent matrix often represents iteratively recalculating parameters for the underlying latent observations. 
-	receiving another training data set, which includes a third plurality of data points and a fourth plurality of data points;
	In [5.1.1, Page 108:17]:
the recommendation task considered in CDL takes implicit feedback [50] as the training and test data. 
	In [4.3, Page 108:13]:
More recently, generative adversarial networks (GAN) [30] prevail as a new training scheme for training neural networks and have shown promise in generating photo-realistic images. Later on, Bayesian formulations (as well as related theoretical results) for GAN have also been proposed [30,
	In [5.1.1, Page 108:19]:
Seeing from the view of neural networks (NN), when                         
                            
                                    λ
                                
                                    s
                                
                     approaches positive infinity, training of the probabilistic graphical model of CDL in Figure 7 (left) would degenerate to simultaneously training two neural networks overlaid together with a common input layer (the corrupted input) but different output layers, as shown in Figure 8 (left).
In [3.1, Page 108:9]:
Due to its Bayesian nature, PGM such as LDA is easy to extend to incorporate other information or to perform other tasks. For example, following LDA, different variants of topic models have been proposed.
In [5.1.1, Page 108:18]:

    PNG
    media_image9.png
    327
    867
    media_image9.png
    Greyscale

In [5.1.1, Page 108:19]:

    PNG
    media_image10.png
    331
    855
    media_image10.png
    Greyscale

(BRI: training two neural networks simultaneously often involves using different datasets or specialized training methods like GANs (Generative Adversarial Networks) where one network (Generator) creates data that the other (Discriminator) evaluates against real data, effectively learning from separate data streams to improve each other, but they don't automatically get a new dataset; it's about strategic design, like in Transfer Learning or Ensemble Methods)
(BRI: With different variants of a model,  that is used for  training them on diverse datasets, can provide a plurality of data. This effectively provides multiple distributions of results (see [3.1, Page 108:9])
-	calculating the fourth plurality of data points, using the given subset of functions from the general, given family of functions, the given subset of functions is calculated on the third plurality of data points; 
In [4.3.4 , Page 108:15]:
As a simple example, in a vanilla linear NN                          
                            
                                    f
                                
                                    w
                                
                    (x) = wx takes a scalar x as input and computes the output based on a scalar parameter w; a corresponding Gaussian NPN would assume w is drawn from a Gaussian distribution N (                        
                            
                                    x
                                
                                    m
                                
                    ,                         
                            
                                    x
                                
                                    s
                                
                    ) and that x is drawn from N (                        
                            
                                    x
                                
                                    m
                                
                    ,                         
                            
                                    x
                                
                                    s
                                
                    ) (                        
                            
                                    x
                                
                                    s
                                
                    is set to 0 when the input is deterministic). 
(BRI: functions that are "vanilla linear with scalar input" represent a subset of the functions that can be represented by a neural network with non-linear activation functions (which is likely what "NPN" refers to in this context)
In [5.3.1, Page 108:29]:
Stochastic Optimal Control. 
we consider the stochastic optimal control of an unknown dynamical system as follows:

    PNG
    media_image4.png
    31
    600
    media_image4.png
    Greyscale

where t indexes the time steps and                          
                            
                                    z
                                
                                    t
                                
                            ∈
                             
                                    R
                                
                                            n
                                        
                                            z
                                        
                    is the latent states.                         
                            
                                    u
                                
                                    t
                                
                            ∈
                             
                                    R
                                
                                            n
                                        
                                            u
                                        
is the applied control at time t and ξ denotes the system noise.  Equivalently, the equation above can be written as 

    PNG
    media_image11.png
    27
    395
    media_image11.png
    Greyscale

Hence, we need a mapping function to map the corresponding raw image                         
                            
                                    x
                                
                                    t
                                
(observed input) into the latent space. 
(BRI: Mapping an observed input into a latent space over several time steps can indeed provide a set of data points within that dataset)
(BRI: sequence of data points indexed by time steps, where you model changes over time, is the definition of a time series)
-	and generating the a-posteriori predictive distribution further includes generating a third distribution, using a third and fourth neural network, wherein the third distribution is a function of the plurality of latent variables, the set of parameters, task-independent variables, and the other training data set, 
	In [5.3.4, Page 108:30]:
Note that the BDL-based control model discussed above uses a different information exchange mechanism 
	In [5.3.4, Page 108:30]:
it follows the VAE mechanism and uses neural networks to separately parameterize the mean and covariance of hinge variables (e.g., in the encoding model, the hinge variable

    PNG
    media_image12.png
    13
    181
    media_image12.png
    Greyscale

Where                         
                            
                                    μ
                                
                                    t
                                
                     and                         
                            
                                    σ
                                
                                    t
                                
                    are perception variables parameterized as in Equation (19)), which is more flexible (with more free parameters) than models like CDL and CDR in Section 5.1, where Gaussian distributions with fixed variance are also used. Note that this BDL-based control model is an LV model as shown in Table 1, and since the covariance is assumed to be diagonal, the model still meets the independence requirement in Section 4.
In [4, Page 108:10]:

    PNG
    media_image13.png
    639
    627
    media_image13.png
    Greyscale

In [3.1, Page 108:9]:
Due to its Bayesian nature, PGM such as LDA is easy to extend to incorporate other information or to perform other tasks. For example, following LDA, different variants of topic models have been proposed.
(BRI:  The core nature of Probabilistic Graphical Models (PGMs) like Latent Dirichlet Allocation (LDA) lies in their ability to model data generation as a process involving multiple probability distributions, which can be modified or extended in various ways as a result of using variants of topic models)
-	optimizing the likelihood distribution includes maximizing the likelihood distribution with regard to the task-independent variables and the maximizing is based on the second approximate a-posteriori distribution generated and on the third distribution generated;
In [3.2,  Page 108:9]:
 various learning and inference algorithms are available for each PGM. Among them, the most cost-effective one is probably maximum a posteriori (MAP), which amounts to maximizing the posterior probability of the latent variable.
In [3.1, Page 108:9]:
Due to its Bayesian nature, PGM such as LDA is easy to extend to incorporate other information or to perform other tasks. For example, following LDA, different variants of topic models have been proposed,
In [ 2.2, Page 108:5]:
we introduce a kind of multilayer denoising AE, known as stacked denoising autoencoders (SDAE), both as an example of AE variants and as background for its applications on BDL-based recommender systems,
In [5.1.1, Page 108:18]:
The output of layer l of the SDAE is denoted by                        
                             
                                    X
                                
                                    l
                                
                    , which is a J-by-                        
                            
                                    K
                                
                                    l
                                
                     matrix. Similar to                        
                            
                                    X
                                
                                    c
                                
, row j of                         
                            
                                    X
                                
                                    l
                                
                    is denoted by                         
                            
                                    X
                                
                                    l
                                    ,
                                    j
                                    *
                                
                                    .
                                     
                                    W
                                
                                    l
                                
                    and                          
                            
                                    b
                                
                                    l
                                
                    are the weight matrix and bias vector, respectively, of layer l,                         
                            
                                    X
                                
                                    l
                                    ,
                                    n
                                    *
                                
                    denotes column n of                         
                            
                                    .
                                     
                                    W
                                
                                    l
                                
                    , and L is the number of layers. For convenience, we use W+ to denote the collection of all layers of weight matrices and biases. 
In [5.1.1, Page 108:19]:
maximizing the posterior probability is equivalent to maximizing the joint log-likelihood of 

    PNG
    media_image14.png
    19
    503
    media_image14.png
    Greyscale

(BRI:  the second and third distribution is within the context of different variants of the model. Maximizing the joint log-likelihood is mathematically equivalent to maximizing the joint likelihood because the logarithm is a monotonically increasing function, as it preserves the order of values, so the parameter setting that yields the highest likelihood)
-	maximizing the likelihood distribution includes calculating an integral over a function of latent variables, which contains respective products of the second approximate a-posteriori distribution and of the third distribution; 
In [3.1, Page 108:9]:
Due to its Bayesian nature, PGM such as LDA is easy to extend to incorporate other information or to perform other tasks. For example, following LDA, different variants of topic models have been proposed,
(BRI: the second and third distribution in the context of different variants of the models)
In [5.3.1, Page 108:29]:
Stochastic Optimal Control. 
we consider the stochastic optimal control of an unknown dynamical system as follows:

    PNG
    media_image4.png
    31
    600
    media_image4.png
    Greyscale

where t indexes the time steps and                          
                            
                                    z
                                
                                    t
                                
                            ∈
                             
                                    R
                                
                                            n
                                        
                                            z
                                        
                    is the latent states.                         
                            
                                    u
                                
                                    t
                                
                            ∈
                             
                                    R
                                
                                            n
                                        
                                            u
                                        
is the applied control at time t and ξ denotes the system noise. 
(BRI: sequence of data points indexed by time steps, where you model changes over time, is the definition of a time series, characterized by its temporal order and dependency between consecutive points, used for analysis and forecasting trends, seasonality, and future values)
In [5.1.6, Page 108:23]:
Recommender systems are a typical use case for BDL in that they often require both thorough understanding of high-dimensional signals (e.g., text and images) and principled reasoning on the conditional dependencies among users/items/ratings. In this regard, CDL, as an instantiation of BDL, is the first hierarchical Bayesian model to bridge the gap between state-of-the-art deep learning models and recommender systems. By performing deep learning collaboratively, CDL and its variants can simultaneously extract an effective deep feature representation from high-dimensional content and capture the similarity and implicit relationship between items (and users).
In [3.2, Page 108:9]:
MAP, as efficient as it is, gives us only point estimates of latent variables (and parameters). To take the uncertainty into account and harness the full power of Bayesian models, one would have to resort to Bayesian treatments such as variational inference and Markov chain Monte Carlo (MCMC). For example, the original LDA uses variational inference to approximate the true posterior with factorized variational distributions [8]. Learning of the latent variables and parameters then boils down to minimizing the KL-divergence between the variational distributions and the true posterior distributions.
In [5.3.2, Page 108:30]:
the posterior distribution                         
                            
                                    P
                                
                                    θ
                                
                     (X |Z) reconstructs the raw images                         
                            
                                    x
                                
                                    t
                                
                     from the latent states                         
                            
                                    z
                                
                                    t
                                
In [5.3.3, Page 108:30]:
Learning Using Stochastic Gradient Variational Bayes. With D= {                        
                            
                                    X
                                
                                    1
                                
                    ,                         
                            
                                    U
                                
                                    1
                                
                    ,                         
                            
                                    X
                                
                                    2
                                
                            )
                        
                    ,..., (                        
                            
                                    X
                                
                                    T
                                    -
                                    1
                                
                    ,                         
                            
                                    U
                                
                                    T
                                    -
                                    1
                                
                    ,                         
                            
                                    X
                                
                                    T
                                
                    )} as the training set, the loss function is as follows:

    PNG
    media_image15.png
    60
    765
    media_image15.png
    Greyscale

where the first term is the variational bound on the marginalized log-likelihood for each data point:
in [4.4.2, Page 108:16]:
Compared to vanilla (shallow) Bayesian networks, deep Bayesian networks such as BIN make it possible to handle deep and nonlinear conditional dependencies effectively and efficiently. Besides, with BNN as building blocks, task-specific components based on deep Bayesian networks can better work with the perception component, which is usually a BNN as well. Figure 6 (right) shows a more complicated case with both observed (shaded nodes) and unobserved (transparent nodes) variables

    PNG
    media_image16.png
    260
    842
    media_image16.png
    Greyscale

In [5.2, Page 108:24]:
novel probabilistic model that seamlessly integrates a hierarchy of latent factors and the relational information available.
(BRI: integration provides a means for calculating an integral over a function of latent variable with the support of MCMC method (see [3.2, Page 108:9]) it generates samples taken from the posterior distribution to approximate the integral.)
(BRI: the marginalized log-likelihood for each data point involves calculating an integral (or sum for discrete variables) over the latent variables, essentially it averages the likelihood across all possible latent states weighted by their probabilities to get the evidence for that data point, which is crucial for Bayesian inference. This process integrates out the unobserved latent variables for determining the probability of the observed data)
-	calculating the integral includes approximating the integral with regard to the plurality of latent variables, using a non-stochastic loss function, which is based on the set of parameters of the second approximate a-posteriori distribution.
In [3.1, Page 108:9]:
Due to its Bayesian nature, PGM such as LDA is easy to extend to incorporate other information or to perform other tasks. For example, following LDA, different variants of topic models have been proposed,
(BRI: the second approximate posterior distribution is within the context of different variants of the models that represents plurality of distributions)
In [4.4, Page 108:15]:
we introduce different forms of task-specific components. The purpose of a task-specific component is to incorporate probabilistic prior knowledge into the BDL model. Such knowledge can be naturally represented using PGM.
a task-specific component can take various forms. For example, it can be a typical Bayesian network (directed PGM) such as LDA, a deep Bayesian network [117], or a stochastic process [51, 94], all of which can be represented in the form of PGM
(BRI: Probabilistic Graphical Models (PGMs) use non-stochastic (deterministic) loss functions, such as log-likelihood, cross-entropy, and mean squared error (MSE), for both training and inference)
WANG does not explicitly disclose:
-	A computer-implemented system for generating and/or using a computer-implemented machine learning system, 	
However, Zhang discloses:
-	A computer-implemented system for generating and/or using a computer-implemented machine learning system, the machine learning system being configured 
In [Col 3, lines 38-39]:
	 there is provided a computer-implemented method of machine learning. 
In [Col 1, lines 29-30]:
 FIG. 1(a) gives a simplified representation of an example neural network 108. 
In [Col 1, lines 32-36]:
In practice, there may be many nodes in each layer, but for simplicity only a few are illustrated. Each node is configured to generate an output by carrying out a function on the values input to that node. 
-	the computer-implemented learning system being trained by:
In [Col 1, lines 54-56]:
 The network learns by operating on data input at the input layer, and, based on the input data, adjusting the weights applied by some or all of the nodes in the network
In [Col 2, lines 8-13]:
FIG. 1(c) shows a simple arrangement in which a neural network is arranged to predict a classification based on an input feature vector. During a training phase, experience data comprising a large number of input data points is supplied to the neural network, each data point comprising an example set of values for the feature vector
It would have obvious to one of ordinary skill in the art before the effective filing date of the
present application to combine WANG, and Zhang.
WANG teaches generating an a-posteriori predictive distribution for predicting the
dynamic  response of the device, aggregation,  posterior distribution, and maximizing 
the likelihood distribution for prediction.
Zhang teaches machine learning system being configured 
One of ordinary skill would have motivation to combine WANG and Zhang to provide an 
improved robustness against unanticipated manipulations (Zhang [Col 4, lines 42-44])
Claims 13 and 18 are rejected under 35 U.S.C. 103  as being unpatentable over 
In view of HAO WANG et.al. (hereinafter WANG)  A Survey on Bayesian Deep Learning, ACM Computing Surreys (CSUR), Volume 53, Issue 5, Published 28 September, 2020.
In view of Zhang et.al. (hereinafter Zhang) US 11715004 B2. 
further  in view of Zhou et.al. (hereinafter Zhou) US 9540928 B2.
In regard to claim 13: (Original) 
WANG discloses:
-	generating the computer-implemented machine learning system includes mapping an input vector of a dimension to an output vector of a second dimension, the input vector represents elements of a time series for at least one measured input state variable of the device, and the output vector represents at least one estimated output state variable of the device, which is predicted using the a-posteriori predictive distribution generated.
In [2.1, Page 108:4]:
Essentially, a multilayer perceptron is a sequence of parametric nonlinear transformations. Suppose we want to train a multilayer perceptron to perform a regression task that maps a vector of M dimensions to a vector of D dimensions. We denote the input as a matrix                 
                    
                            X
                        
                            0
                        
            (0 means it is the 0th layer of the perceptron). The j th row of                 
                    
                            X
                        
                            0
                        
            denoted as                 
                    
                            X
                        
                            0
                            ,
                             
                            j
                            *
                        
            is an M-dimensional vector representing one data point. The target (the output we want to fit) is denoted as Y. Similarly,                 
                    
                            Y
                        
                            j
                            *
                        
            denotes a D-dimensional row vector
(BRI: A task mapped to M dimensions to D dimensions represents mapping to second dimension)
In [5.3.1, Page 108:29]:
Stochastic Optimal Control. 
we consider the stochastic optimal control of an unknown dynamical system as follows:

    PNG
    media_image4.png
    31
    600
    media_image4.png
    Greyscale

where t indexes the time steps and                  
                    
                            z
                        
                            t
                        
                    ∈
                     
                            R
                        
                                    n
                                
                                    z
                                
            is the latent states.                 
                    
                            u
                        
                            t
                        
                    ∈
                     
                            R
                        
                                    n
                                
                                    u
                                
is the applied control at time t and ξ denotes the system noise. 
SGNHT, SGLD, and SGHMC all belong to a larger class of sampling algorithms called hybrid Monte Carlo (HMC) [5]. The idea is to leverage an analogy with physical systems to guide transitions of system states. Compared to the Metropolis algorithm, HMC can make much larger changes to system states while keeping a small rejection probability.
(BRI: physical system states inherently represent a device's control state)
(BRI: changes to system states while keeping a small rejection where proposed changes to the system state are evaluated and accepted or rejected with a high likelihood of acceptance (small rejection probability). Sequence of data points indexed by time steps, where you model changes over time, is the definition of a time series, characterized by its temporal order and dependency between consecutive points, used for analysis and forecasting trends, seasonality, and future values)
In [4.3.2, Page 108:14]:
Maximization of the posterior probability is equivalent to minimization of the reconstruction error with weight decay taken into consideration
(BRI: maximizing the posterior probability is a key method for making predictions, known as Maximum A Posteriori (MAP) estimation)
In [4.4.3, Page  108:16]:
Stochastic Processes. Besides vanilla Bayesian networks and deep Bayesian networks, a task-specific component can also take the form of a stochastic process [94]. For example, a Wiener process can naturally describe a continuous-time Brownian motion model 

    PNG
    media_image17.png
    26
    177
    media_image17.png
    Greyscale

where                 
                    
                            x
                        
                            t
                            +
                            u
                        
             and                 
                    
                            x
                        
                            t
                        
             are the states at time t and t + u, respectively.
In [5.3, Page 108:28]:
To enable an effective iterative process between the perception task and the control task, we need two-way information exchange between them. The perception component would be the basis on which the control component estimates its states and, however, the control component with a dynamic model built in would be able to predict the future trajectory (images) by reversing the perception process
(BRI: a control system that uses an a-posteriori predictive distribution to estimate future states of a device is inherently using this distribution to predict at least one estimated output state variable. A physical system's trajectory is fundamentally a representation of its state evolving over time)
In regard to claim 18: (Original)
WANG and Zhang do not explicitly disclose:
-	wherein the training data sets includes input variables measured on the device and/or calculated for the device, the at least one input variable of the device includes at least one of a rotational speed, or a temperature, or a mass flow rate, and the at least one estimated output state variable of the device includes at least one of a torque, or an efficiency, or a compression ratio.   
However, Zhou discloses:
-	wherein the training data sets includes input variables measured on the device and/or calculated for the device, the at least one input variable of the device includes at least one of a rotational speed, or a temperature, or a mass flow rate, and the at least one estimated output state variable of the device includes at least one of a torque, or an efficiency, or a compression ratio.  
in [Col 8, lines 52-57]:
1) From the machine learning point of view, the approach described herein successfully combines supervised learning (GP regression) with unsupervised learning (unsupervised clustering) to learn a complex model on rock types from the MWD data, where there is not a direct connection between the input (MWD data) and the output (rock types),
in at least [Col 8, lines 61-65] 2) Looking from the application side, the methods described herein propose a continuous 3D rock type distribution model across the bench by applying Gaussian Process regression on the corrected drilling measurements (“Adjusted Penetration Rate”) from multiple holes,  
in at least [Col 8, line 67] , [Col 9, lines 1-2]:
 2) rock types are identified solely on discretely distributed individual holes from the corresponding drill performance data, 
In [Col 6, lines 53-54]:
 The computing system 101 can be in the form of an onboard computing system located in the drilling vehicle 10
in [Col 6, lines 62-64]:
 downloading sensor data from the drill rig vehicle 10 and inputting it to the computing system 101.
[BRI: a sensor data represents a “input state”]
in [Col 8, lines 9-11]:
 Rock recognition relates the MWD data, which is a reflection of the drill performance, in a meaningful way to the physical properties of the rocks being drilled. 
in [Col 8, lines 9-11]:
 The type of MWD measurements used for rock recognition in this work include: 1. Rotation Speed (RS) 2. Penetration Rate (PR) 3. Rotation Pressure (RP) 4. Pull-down Pressure (PP) 5. Bit Air Pressure (BAP)
in [Col 10, lines 16-20]:
  Penetration rate can be used as a key measurement on rock hardness, and that pull down pressure as well as rotation pressure are the major applied forces on changing the penetration rate. Therefore, it is assumed that under the same pull down pressure and rotation pressure, the penetration rate reflects the rock hardness.
[BRI: It is known in the art  "rotational force" refers to the force that causes an object to rotate around an axis, also known as "torque," while "rotational pressure" is not a standard physics term, but could be interpreted as the pressure exerted by a rotating fluid or gas, which is related to the forces acting on it due to its rotational motion]
It would have obvious to one of ordinary skill in the art before the effective filing date of the
present application to combine WANG, Zhang, and Zhou.
WANG teaches generating an a-posteriori predictive distribution for predicting the dynamic 
response of the device, aggregation,  posterior distribution, and maiming the likelihood 
distribution for prediction.
Zhang teaches machine learning system being configured.
Zhou teaches input variables for the device.
One of ordinary skill would have motivation to combine WANG,  Zhang, and Zhou  
to provide an improved distinction between different rock types (Zhou [Col 8, lines 37-39])
Conclusion
Any inquiry concerning this communication or earlier communications from the
examiner should be directed to TIRUMALE KRISHNASWAMY RAMESH whose telephone number is (571)272-4605. The examiner can normally be reached by phone.
Examiner interviews are available via telephone, in-person, and video conferencing
using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at
http://www.uspto.gov/interviewpractice. 
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached on phone (571-272-3768). The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be
obtained from Patent Center. Unpublished application information in Patent Center is available to registered users.  To file and manage patent submissions in Patent Center, visit:
https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for
information about filing in DOCX format.  For additional questions, contact the Electronic
Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO
Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/TIRUMALE K RAMESH/Examiner, Art Unit 2121                                                                                                                                                                                                        
/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121
Read full office action
Prosecution Timeline

Sep 01, 2021
Application Filed
Oct 29, 2024
Non-Final Rejection — §101, §103
Feb 20, 2025
Response Filed
Mar 19, 2025
Final Rejection — §101, §103
Jul 24, 2025
Response after Non-Final Action
Sep 22, 2025
Request for Continued Examination
Sep 30, 2025
Response after Non-Final Action
Jan 30, 2026
Non-Final Rejection — §101, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

16/739,694
Patent 12518153
TRAINING MACHINE LEARNING SYSTEMS
2y 5m to grant Granted Jan 06, 2026
17/136,054
Patent 12293284
META COOPERATIVE TRAINING PARADIGMS
2y 5m to grant Granted May 06, 2025
17/064,561
Patent 12229651
BLOCK-BASED INFERENCE METHOD FOR MEMORY-EFFICIENT CONVOLUTIONAL NEURAL NETWORK IMPLEMENTATION AND SYSTEM THEREOF
2y 5m to grant Granted Feb 18, 2025
17/039,178
Patent 12131244
HARDWARE-OPTIMIZED NEURAL ARCHITECTURE SEARCH
2y 5m to grant Granted Oct 29, 2024
16/844,335
Patent 11803745
TERMINAL DEVICE AND METHOD FOR ESTIMATING FIREFIGHTING DATA
2y 5m to grant Granted Oct 31, 2023
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds
Prosecution Projections

3-4
Expected OA Rounds
18%
Grant Probability
20%
With Interview (+2.1%)
4y 5m
Median Time to Grant
High
PTA Risk
Based on 40 resolved cases by this examiner. Grant probability derived from career allow rate.
BAYESIAN CONTEXT AGGREGATION FOR NEURAL PROCESSES

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

BAYESIAN CONTEXT AGGREGATION FOR NEURAL PROCESSES

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email