Office Action Analysis: 18128454 — Machine Learning Using Robust Stochastic Multi-Armed Bandits with Historical Data

Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . This action is in response to an application filed on March 30th, 2023. Claims 1-20 are pending in the current application.


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-6, 9-16, 19 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Dimakopoulou et al. (Herein referred to as Dimakopoulou) (Balanced Linear Contextual Bandits) (Already cited in the IDS) in view of Dirac et al. (Herein referred to as Dirac) (U.S. Patent Application No. US 20150379424 A1) in further view of Rashidinejad et al. (Herein referred to as Rashidinejad) (Bridging Offline Reinforcement Learning and Imitation Learning: A Tale of Pessimism) and in further view of Liu et al. (Herein referred to as Liu) (Robust Linear Regression Against Training Data Poisoning) (Already cited in the IDS)

	Regarding claim 1, Dimakopoulou teaches executing a first initialization of machine learning training logic based on a determination of propensity scores for each output, of a plurality of predetermined outputs, of a machine learning computer model, wherein the propensity scores are determined from historical data; (“We focus on the method of inverse propensity weighting (Imbens and Rubin 2015). The idea is that at every time t, the linear contextual bandit weighs each observation (xτ , aτ , rτ (aτ )), τ = 1, . . . , t in the history up to time t by the inverse probability of context xτ being assigned to arm aτ . This probability is called propensity score and is denoted as paτ (xτ ).”, pg. 3, right column, under “Linear Contextual Bandits with Balancing”) (The history of weights for the linear contextual bandit at each observation corresponds to historical data.) executing the initialized machine learning training logic on the machine learning computer model to train the machine learning computer model to generate a trained machine learning computer model; (“…training the reward model for the first time, the estimated reward of arm a = 2 (blue) is the highest, the one of arm a = 1 (yellow) is the second highest and the one of arm a = 0 (red) is the lowest across the context space.”, pg. 6, left column, first paragraph)
	However, Dimakopoulou does not explicitly teach the first initialization is done in an offline learning phase of operation, nor executing a second initialization, during the offline learning phase of operation, of the machine learning training logic by performing a trimmed optimization of the machine learning training logic, based on the historical data, to estimate initial parameters of the machine learning computer model nor deploying the trained machine learning computer model to a hosting computing system for online phase operation.
Dirac teaches deploying the trained machine learning computer model to a hosting computing system for online phase operation. (“Networks set up by an entity such as a company or a public sector organization to provide one or more services (such as various types of multi-tenant and/or single-tenant cloud-based computing or storage services) accessible via the Internet and/or other networks to a distributed set of clients may be termed provider networks herein.”, Paragraph 37)
Therefore, it would have been considered obvious to one of ordinary skill in the art, prior to the current application’s filing date, to combine the contextual bandit of Dimakopoulou, with the distribution system of Dirac. One would be motivated to combine the teachings, prior to the filing date of the current application, as Dirac’s teaching allow for model distribution for the benefit of larger audiences, as disclosed in Dirac. (“Such clean separation of roles and capabilities with respect to model development and use may allow larger audiences within a business organization to benefit from machine learning models than simply those skilled enough to develop the models.” Paragraph 30)
	Rashidinejad teaches initialization done during the offline learn phase of operation. (“It turns out that such a simple algorithm—fully agnostic to the data composition—achieves almost optimal performance in multi-armed bandits and MDPs, and optimally solves the offline learning problem in contextual bandits.”, pg. 3, under “Question 2”) (The method of offline reinforcement learning as disclosed by Rashidinejad is data agnostic, and so in combination with the contextual bandit of Dimakopoulou, teaches initializing steps done during the offline learning phase of operation.)
Therefore, it would have been considered obvious to one of ordinary skill in the art, prior to the current application’s filing date, to combine the contextual bandit of Dimakopoulou, with the offline learning of Rashidinejad. One would be motivated to combine the two teachings, prior to the filing date of the current application, as offline reinforcement learning allows for an optimal policy from a fixed dataset without active data collection as disclosed in Rashidinejad. (“Offline (or batch) reinforcement learning (RL) algorithms seek to learn an optimal policy from a fixed dataset without active data collection”, pg. 1, Abstract)
However, the combination does not teach executing a second initialization, of the machine learning training logic by performing a trimmed optimization of the machine learning training logic, based on the historical data, to estimate initial parameters of the machine learning computer model wherein a result of the combination of the first initialization and second initialization is initialized machine learning training logic
	Liu teaches executing a second initialization, of the machine learning training logic by performing a trimmed optimization of the machine learning training logic, based on the historical data, to estimate initial parameters of the machine learning computer model wherein a result of the combination of the first initialization and second initialization is initialized machine learning training logic (“first, we develop a novel robust matrix factorization algorithm which correctly recovers the subspace whenever this is possible, and second, a trimmed principle component regression, which uses the recovered basis and trimmed optimization to estimate linear model parameters… Our theoretical results demonstrate that the combined approach is an (f ,l, δ)-tolerant learning algorithm” Pg. 3, left column, first paragraph; pg. 4 left column, bottom paragraph; See also Algorithms 3 and 4 on pgs. 5 and 6.)
	Therefore, it would have been considered obvious to one of ordinary skill in the art, prior to the current application’s filing date, to combine the bandit of Dimakopoulou, modify it with the teaching of Rashidinejad, and further modify it with the “TrimmedOptimization” algorithm of Liu. One would be motivated to combine the teachings of Dimakopoulou and Rashidinejad as combined above, with the teachings of Liu, prior to the filing date of the current application, as Liu’s algorithms “significantly outperform state-of-the-art robust regression both in running time and prediction error”, and “Trimmed Optimization” yield near optimal solutions, as disclosed in Liu. (“…our methods significantly outperform state-of-the-art robust regression both in running time and prediction error… While this algorithm is not guaranteed to converge to a global optimal, in our evaluation, we observe that a random start of τ typically yields near-optimal solutions.”, pg. 1, Abstract; pg. 6, right column, fourth paragraph)

	Regarding claim 11, Dimakopoulou teaches to execute a first initialization of machine learning training logic based on a determination of propensity scores for each output, of a plurality of predetermined outputs, of a machine learning computer model, wherein the propensity scores are determined from historical data; (“We focus on the method of inverse propensity weighting (Imbens and Rubin 2015). The idea is that at every time t, the linear contextual bandit weighs each observation (xτ , aτ , rτ (aτ )), τ = 1, . . . , t in the history up to time t by the inverse probability of context xτ being assigned to arm aτ . This probability is called propensity score and is denoted as paτ (xτ ).”, pg. 3, right column, under “Linear Contextual Bandits with Balancing”) (The history of weights for the linear contextual bandit at each observation corresponds to historical data.) execute the initialized machine learning training logic on the machine learning computer model to train the machine learning computer model to generate a trained machine learning computer model; (“…training the reward model for the first time, the estimated reward of arm a = 2 (blue) is the highest, the one of arm a = 1 (yellow) is the second highest and the one of arm a = 0 (red) is the lowest across the context space.”, pg. 6, left column, first paragraph) and deploy the trained machine learning computer model to a hosting computing system for online phase operation. (Any computer that would be able to run the algorithms of the disclosure would implicitly deploy the trained model and be a "hosting computing system for online phase operation.")
However, Dimakopoulou does not explicitly teach to a computer program product comprising a computer readable storage medium having a computer readable program stored therein, nor to execute a second initialization, during the offline learning phase of operation, of the machine learning training logic by performing a trimmed optimization of the machine learning training logic, based on the historical data, to estimate initial parameters of the machine learning computer model, wherein a result of the combination of the first initialization and second initialization is initialized machine learning training logic; nor to deploy the trained machine learning computer model to a hosting computing system for online phase operation nor that the first initialization is done in an offline learning phase of operation
Dirac teaches a computer program product comprising a computer readable storage medium having a computer readable program stored therein, (“In some embodiments, system memory 9020 may be one embodiment of a computer-accessible medium configured to store program instructions and data”, Paragraph 108) and deploy the trained machine learning computer model to a hosting computing system for online phase operation. (“Networks set up by an entity such as a company or a public sector organization to provide one or more services (such as various types of multi-tenant and/or single-tenant cloud-based computing or storage services) accessible via the Internet and/or other networks to a distributed set of clients may be termed provider networks herein.”, Paragraph 37)
Therefore, it would have been considered obvious to one of ordinary skill in the art, prior to the current application’s filing date, to combine the contextual bandit of Dimakopoulou, with the distribution system of Dirac. One would be motivated to combine the teachings, prior to the filing date of the current application, as Dirac’s teaching allow for model distribution for the benefit of larger audiences, as disclosed in Dirac. (“Such clean separation of roles and capabilities with respect to model development and use may allow larger audiences within a business organization to benefit from machine learning models than simply those skilled enough to develop the models.” Paragraph 30)
However, Dimakopoulou, as modified by Dirac does not explicitly teach to execute a second initialization, during the offline learning phase of operation, of the machine learning training logic by performing a trimmed optimization of the machine learning training logic, based on the historical data, to estimate initial parameters of the machine learning computer model, wherein a result of the combination of the first initialization and second initialization is initialized machine learning training logic; nor that the first initialization is done in an offline learning phase of operation.
Rashidinejad teaches initialization done during the offline learn phase of operation. (“It turns out that such a simple algorithm—fully agnostic to the data composition—achieves almost optimal performance in multi-armed bandits and MDPs, and optimally solves the offline learning problem in contextual bandits.”, pg. 3, under “Question 2”) (The method of offline reinforcement learning as disclosed by Rashidinejad is data agnostic, and so in combination with the contextual bandit of Dimakopoulou, teaches initializing steps done during the offline learning phase of operation.)
Therefore, it would have been considered obvious to one of ordinary skill in the art, prior to the current application’s filing date, to combine the contextual bandit of Dimakopoulou, as modified by Dirac, with the offline learning of Rashidinejad. One would be motivated to combine the two teachings, prior to the filing date of the current application, as offline reinforcement learning allows for an optimal policy from a fixed dataset without active data collection as disclosed in Rashidinejad. (“Offline (or batch) reinforcement learning (RL) algorithms seek to learn an optimal policy from a fixed dataset without active data collection”, pg. 1, Abstract)
However, the combination does not explicitly teach to execute a second initialization, during the offline learning phase of operation, of the machine learning training logic by performing a trimmed optimization of the machine learning training logic, based on the historical data, to estimate initial parameters of the machine learning computer model, wherein a result of the combination of the first initialization and second initialization is initialized machine learning training logic;
Liu teaches to execute a second initialization, during the offline learning phase of operation, of the machine learning training logic by performing a trimmed optimization of the machine learning training logic, based on the historical data, to estimate initial parameters of the machine learning computer model, wherein a result of the combination of the first initialization and second initialization is initialized machine learning training logic; (“first, we develop a novel robust matrix factorization algorithm which correctly recovers the subspace whenever this is possible, and second, a trimmed principle component regression, which uses the recovered basis and trimmed optimization to estimate linear model parameters… Our theoretical results demonstrate that the combined approach is an (f ,l, δ)-tolerant learning algorithm” Pg. 3, left column, first paragraph; pg. 4 left column, bottom paragraph; See also Algorithms 3 and 4 on pgs. 5 and 6.)
	Therefore, it would have been considered obvious to one of ordinary skill in the art, prior to the current application’s filing date, to combine the bandit of Dimakopoulou, modify it with the teachings of Dirac and Rashidinejad, and further modify it with the “TrimmedOptimization” algorithm of Liu. One would be motivated to combine the combine the teachings of Dimakopoulou and Rashidinejad as combined above, with the teachings of Liu, prior to the filing date of the current application, as Liu’s algorithms “significantly outperform state-of-the-art robust regression both in running time and prediction error”, and “Trimmed Optimization” yield near optimal solutions, as disclosed in Liu. (“…our methods significantly outperform state-of-the-art robust regression both in running time and prediction error… While this algorithm is not guaranteed to converge to a global optimal, in our evaluation, we observe that a random start of τ typically yields near-optimal solutions.”, pg. 1, Abstract; pg. 6, right column, fourth paragraph)

Regarding claim 20, Dimakopoulou teaches an apparatus comprising: at least one processor; and at least one memory coupled to the at least one processor, (While these components are never explicitly taught, you would implicitly need those components to run the algorithms of Dimakopoulou) wherein the at least one memory comprises instructions which, when executed by the at least one processor, cause the at least one processor to: execute a first initialization of machine learning training logic based on a determination of propensity scores for each output, of a plurality of predetermined outputs, of a machine learning computer model, wherein the propensity scores are determined from historical data; (“We focus on the method of inverse propensity weighting (Imbens and Rubin 2015). The idea is that at every time t, the linear contextual bandit weighs each observation (xτ , aτ , rτ (aτ )), τ = 1, . . . , t in the history up to time t by the inverse probability of context xτ being assigned to arm aτ . This probability is called propensity score and is denoted as paτ (xτ ).”, pg. 3, right column, under “Linear Contextual Bandits with Balancing”) (The history of weights for the linear contextual bandit at each observation corresponds to historical data) execute the initialized machine learning training logic on the machine learning computer model to train the machine learning computer model to generate a trained machine learning computer model; (“…training the reward model for the first time, the estimated reward of arm a = 2 (blue) is the highest, the one of arm a = 1 (yellow) is the second highest and the one of arm a = 0 (red) is the lowest across the context space.”, pg. 6, left column, first paragraph)
However, Dimakopoulou does not explicitly teach to execute a second initialization, during the offline learning phase of operation, of the machine learning training logic by performing a trimmed optimization of the machine learning training logic, based on the historical data, to estimate initial parameters of the machine learning computer model, wherein a result of the combination of the first initialization and second initialization is initialized machine learning training logic; nor to deploy the trained machine learning computer model to a hosting computing system for online phase operation nor that the first initialization is done in an offline learning phase of operation.
Dirac teaches to deploy the trained machine learning computer model to a hosting computing system for online phase operation. (“Networks set up by an entity such as a company or a public sector organization to provide one or more services (such as various types of multi-tenant and/or single-tenant cloud-based computing or storage services) accessible via the Internet and/or other networks to a distributed set of clients may be termed provider networks herein.”, Paragraph 37)
Therefore, it would have been considered obvious to one of ordinary skill in the art, prior to the current application’s filing date, to combine the contextual bandit of Dimakopoulou, with the distribution system of Dirac. One would be motivated to combine the teachings, prior to the filing date of the current application, as Dirac’s teaching allow for model distribution for the benefit of larger audiences, as disclosed in Dirac. (“Such clean separation of roles and capabilities with respect to model development and use may allow larger audiences within a business organization to benefit from machine learning models than simply those skilled enough to develop the models.” Paragraph 30)
Rashidinejad teaches initialization done during the offline learn phase of operation. (“It turns out that such a simple algorithm—fully agnostic to the data composition—achieves almost optimal performance in multi-armed bandits and MDPs, and optimally solves the offline learning problem in contextual bandits.”, pg. 3, under “Question 2”) (The method of offline reinforcement learning as disclosed by Rashidinejad is data agnostic, and so in combination with the contextual bandit of Dimakopoulou, teaches initializing steps done during the offline learning phase of operation.)
Therefore, it would have been considered obvious to one of ordinary skill in the art, prior to the current application’s filing date, to combine the contextual bandit of Dimakopoulou, with the offline learning of Rashidinejad. One would be motivated to combine the two teachings, prior to the filing date of the current application, as offline reinforcement learning allows for an optimal policy from a fixed dataset without active data collection as disclosed in Rashidinejad. (“Offline (or batch) reinforcement learning (RL) algorithms seek to learn an optimal policy from a fixed dataset without active data collection”, pg. 1, Abstract)
However, the combination does not explicitly teach to execute a second initialization, during the offline learning phase of operation, of the machine learning training logic by performing a trimmed optimization of the machine learning training logic, based on the historical data, to estimate initial parameters of the machine learning computer model, wherein a result of the combination of the first initialization and second initialization is initialized machine learning training logic;
Liu teaches to execute a second initialization, during the offline learning phase of operation, of the machine learning training logic by performing a trimmed optimization of the machine learning training logic, based on the historical data, to estimate initial parameters of the machine learning computer model, wherein a result of the combination of the first initialization and second initialization is initialized machine learning training logic; (“first, we develop a novel robust matrix factorization algorithm which correctly recovers the subspace whenever this is possible, and second, a trimmed principle component regression, which uses the recovered basis and trimmed optimization to estimate linear model parameters… Our theoretical results demonstrate that the combined approach is an (f ,l, δ)-tolerant learning algorithm” Pg. 3, left column, first paragraph; pg. 4 left column, bottom paragraph; See also Algorithms 3 and 4 on pgs. 5 and 6.)
	Therefore, it would have been considered obvious to one of ordinary skill in the art, prior to the current application’s filing date, to combine the bandit of Dimakopoulou, modify it with the teaching of Rashidinejad, and further modify it with the “TrimmedOptimization” algorithm of Liu. One would be motivated to combine the combine the teachings of Dimakopoulou and Rashidinejad as combined above, with the teachings of Liu, prior to the filing date of the current application, as Liu’s algorithms “significantly outperform state-of-the-art robust regression both in running time and prediction error”, and “Trimmed Optimization” yield near optimal solutions, as disclosed in Liu. (“…our methods significantly outperform state-of-the-art robust regression both in running time and prediction error… While this algorithm is not guaranteed to converge to a global optimal, in our evaluation, we observe that a random start of τ typically yields near-optimal solutions.”, pg. 1, Abstract; pg. 6, right column, fourth paragraph)

Regarding claims 2 and 12, Dimakopoulou, as modified by Dirac, Rashidinejad and Liu, teaches the method and product of claims 1 and 11 respectively, wherein the first initialization of the machine learning training logic is performed via a weighted Ridge regression, where the weights are inversely proportional to the propensity scores. (“At time t, LinTS and LinUCB apply ridge regression with regularization parameter λ to the history of observations (Xa, ra) for each arm a ∈ A, in order to obtain an estimate ˆθa and its variance Va( ˆθa)… We focus on the method of inverse propensity weighting (Imbens and Rubin 2015). The idea is that at every time t, the linear contextual bandit weighs each observation (xτ , aτ , rτ (aτ )), τ = 1, . . . , t in the history up to time t by the inverse probability of context xτ being assigned to arm aτ . This probability is called propensity score and is denoted as paτ (xτ ).”, pg. 3, left column, bottom paragraph; pg. 3, right column, bottom paragraph (Dimakopoulou))

Regarding claims 3 and 13, Dimakopoulou, as modified by Dirac, Rashidinejad and Liu, teaches the method and product of claims 1 and 11 respectively, wherein the machine learning training logic implements a multi-arm bandit training algorithm in which the machine learning computer model estimates a reward for each of a plurality of possible arms and selects an arm from the plurality of possible arms based on the estimated rewards, wherein each arm is a possible classification or prediction of the machine learning computer model. (“Therefore, when training the reward model for the first time, the estimated reward of arm a = 2 (blue) is the highest, the one of arm a = 1 (yellow) is the second highest and the one of arm a = 0 (red) is the lowest across the context space… The goal is to find a classifier π : X → {1, 2, . . . , K} that minimizes the classification error E(x,c)∼D1 {π(x) 6= c}. The classifier can be seen as an arm-selection policy and the classification error is the policy’s expected regret. Further, if only the loss associated with the policy’s chosen arm is revealed, this becomes a contextual bandit setting. So, at time t, context xt is sampled from the dataset, the contextual bandit selects arm at ∈ {1, 2, . . . , K} and observes reward rt(at) = 1 {at = ct}, where ct is the unknown, true class of xt.”, pg. 6, left column, first paragraph; pg. 7, right column, first paragraph (Dimakopoulou))

Regarding claims 4 and 14, Dimakopoulou, as modified by Dirac, Rashidinejad and Liu, teaches the method and product of claims 3 and 13 respectively, wherein the multi-arm bandit training algorithm is a contextual multi-arm bandit training algorithm with linear payoff and frequentist uncertainty. (“We develop algorithms for contextual bandits with linear payoffs… We use the standard (frequentist) regret criterion and standard assumptions on the regularity of the distributions.”, pg. 1, Abstract; pg. 4, left column, bottom two lines (Dimakopoulou)) (“Frequentist regret criterion corresponds to frequentist uncertainty)

Regarding claims 5 and 15, Dimakopoulou, as modified by Dirac, Rashidinejad and Liu, teaches the method and product of claims 3 and 13 respectively, wherein the machine learning computer model receives, at each iteration of machine learning training by the machine learning training logic, additional contextual information and estimates a reward for each possible arm as a linear function of the context and an unknown parameter vector specific to each arm. (“In the stochastic contextual bandit setting, there is a finite set of arms, a ∈ A, with cardinality K. At every time t, the environment produces (xt, rt) ∼ D, where xt is a d-dimensional context vector xt and rt = (rt(1), . . . , rt(K)) is the reward associated with each arm in A.”, pg. 2, right column, under “Contextual Bandit Setting”; See also Algorithm 2 for additional parameters (Dimakopoulou))

Regarding claims 6 and 16, Dimakopoulou, as modified by Dirac, Rashidinejad and Liu, teaches the method and product of claim 1 and 11 respectively, wherein the first initialization is performed by an offline balancing (OB) Historical Linear Upper Confidence Bound (HLinUCB) engine of a machine learning computer model service executing on a remote computing device from the hosting computing system. (Dimakopoulou teaches a Balanced Linear Upper Confidence Bound Algorithm which uses historical data, as shown in Algorithm 2 on pg. 3. The “offline” component comes from the combination of Dimakopoulou, which teaches the HLinUCB algorithm and Rashidinejad, which teaches offline reinforcement learning. To run these algorithms and perform offline learning, one would implicitly need an engine to run it. Any generic computer able to run the algorithms, use the engine, and receive data from a hosting computing system could correspond to a remote computing device.)

Regarding claims 9 and 19, Dimakopoulou, as modified by Dirac, Rashidinejad and Liu, teaches the method and product of claims 8 and 18 respectively, wherein the output selected by the trained machine learning computer model is an output that maximizes an upper confidence bound. (“inUCB uses the estimate ˆθa and its variance to compute upper confidence bounds for the expected reward µa(xt) of context xt associated with each arm a ∈ A and assigns the context to the arm with the highest upper confidence bound,”, pg. 3, left column, bottom paragraph (Dimakopoulou))


Claims 7, 8, 10, 17, and 18,  are rejected under 35 U.S.C. 103 as being unpatentable over Dimakopoulou et al. in view of Dirac, in further view of Rashidinejad, in further view of Liu and in further view of Dimakopoulou et al. (Herein referred to as Dimakopoulou 2) (Online Multi-Armed Bandits with Adaptive Inference)


Regarding claims 7 and 17, Dimakopoulou, as modified by Dirac, Rashidinejad and Liu, teaches the method and product of claim 1 and 11 respectively, but does not explicitly teach during execution of the trained machine learning computer model in the online phase operation, collecting online data representing an operation of the trained machine learning computer model for a new context; and updating, during an online learning phase of operation, the training of the trained machine learning computer model based on the collected online data.
Dimakopoulou 2 teaches collecting online data representing an operation of the trained machine learning computer model for a new context during execution of the trained machine learning computer model in the online phase operation (“At each time step t ∈ [T], where T is the learning horizon, DATS forms a reward sampling distribution for each arm a ∈ A based on the history of observations Ht−1 collected so far.”, pg. 5, under 3.2 Algorithm) (Data is collected online at time t.) and updating, during an online learning phase of operation, the training of the trained machine learning computer model based on the collected online data. (See algorithm 1 on pg. 6.) (Sample counts and averages, as well as sampling distributions are updated which affect and update the training of the model)
Therefore, it would have been considered obvious to one of ordinary skill in the art, prior to the current application’s filing date, to combine the contextual bandit of Dimakopoulou, as modified by Rashidinejad and Liu, with the adaptive learning of Dimakopoulou 2. One would be motivated to combine the teachings, prior to the filing date of the current application, as online learning using adaptive inference leads to better performance, as disclosed by Dimakopoulou 2. (“Our goal in this paper is to leverage recent advances in the causal inference literature that tackle the problem of unbiased and asymptotically normal inference from previously collected, offline bandit data and obtain practically better online performance by harnessing the strengths of these adaptive inference estimators (originally designed for offline, post-experiment analyses of adaptive experiments) and the effective exploration-exploitation balance provided by TS, thereby designing a more effective online sequential decision making algorithm.”, pg. 2, third paragraph)

Regarding claims 8 and 18, Dimakopoulou, as modified by Dirac, Rashidinejad, Liu, and Dimakopoulou 2 teaches the method and product of claims 7 and 17 respectively, wherein the online data comprises a selection, by the trained machine learning computer model, of an output from the plurality of predetermined outputs given the new context, and a reward received from an environment, (“UCB algorithms compute confidence bounds of the estimated mean, construct the index for each arm by adding the confidence bound to the mean (as the best statistically plausible mean reward) and select the arm with the highest index… the data is adaptively collected and consequently… In stochastic MABs, there is a finite set A of arms a ∈ A with |A| = K. At every time t, the environment generates–in an iid manner–a reward rt(a) for every arm.”, pgs. 1-2; pg. 3, under “2 Problem Formulation” (Dimakopoulou 2)) (The new context corresponds to the adaptively collected data, with the reward coming from the environment at certain times. ) and wherein updating the training of the machine learning computer model during the online learning phase of operation comprises modifying at least one parameter of the trained machine learning computer model based on the selection and the reward. (See Algorithm 1 on pg. 5 of Dimakopoulou 2) (The algorithm on pg. 5 shows the modification of parameters in the “updating” steps of the algorithm, teaching the limitation.)

Regarding claim 10, Dimakopoulou, as modified by Dirac, Rashidinejad, Liu, and Dimakopoulou 2 teaches the method of claim 7, wherein updating the training of the machine learning computer model during the online learning phase of operation is performed by a robust (R)-HLinUCB engine of a machine learning computer model service (The “Robust” teaching comes from Liu, which teaches a robust subspace recovery that accounts for noise and adversarial data. To run the combined algorithms of Dimakopoulou and Liu in an online setting, one would implicitly need an engine to run it.) and executing a computer model service on a remote computing device from the hosting computing system. (“In real-time mode, a network endpoint (e.g., an IP address) may be assigned as a destination to which input data records for a specified model are to be submitted, and model predictions may be generated on groups of streaming data records as the records are received. In local mode, clients may receive executable representations of a specified model that has been trained and validated at the MLS, and the clients may run the models on computing devices of their choice (e.g., at devices located in client networks rather than in the provider network where the MLS is implemented).”, Paragraph 65 (Dirac)) (The local mode has a client device which can run the model, which teaches the limitation.) 


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Tyler E Iles whose telephone number is (571)272-5442. The examiner can normally be reached 9:00am - 5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached at (571) 272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/T.E.I./             Patent Examiner, Art Unit 2122                                                                                                                                                                                           
/KAKALI CHAKI/             Supervisory Patent Examiner, Art Unit 2122
Read full office action
Machine Learning Using Robust Stochastic Multi-Armed Bandits with Historical Data

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Machine Learning Using Robust Stochastic Multi-Armed Bandits with Historical Data

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email