Prosecution Insights
Last updated: April 19, 2026
Application No. 18/306,449

WARM STARTING AN ONLINE BANDIT LEARNER MODEL UTILIZING RELEVANT OFFLINE MODELS

Non-Final OA §103
Filed
Apr 25, 2023
Examiner
ZHAO, DON GORDON
Art Unit
2493
Tech Center
2400 — Computer Networks
Assignee
Adobe Inc.
OA Round
1 (Non-Final)
87%
Grant Probability
Favorable
1-2
OA Rounds
2y 5m
To Grant
99%
With Interview

Examiner Intelligence

Grants 87% — above average
87%
Career Allow Rate
674 granted / 774 resolved
+29.1% vs TC avg
Strong +17% interview lift
Without
With
+16.9%
Interview Lift
resolved cases with interview
Typical timeline
2y 5m
Avg Prosecution
21 currently pending
Career history
795
Total Applications
across all art units

Statute-Specific Performance

§101
11.0%
-29.0% vs TC avg
§103
41.0%
+1.0% vs TC avg
§102
4.5%
-35.5% vs TC avg
§112
27.8%
-12.2% vs TC avg
Black line = Tech Center average estimate • Based on career data from 774 resolved cases

Office Action

§103
DETAILED ACTION Claims 1-20 are presented on 04/25/2023 for examination on merits. Claims 1, 8, and 15 are independent base claims. Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Examiner's Instructions for filing Response to this Office Action When the Applicant submits amendments regarding to the claims in response the Office Action, the Examiner would appreciate Applicant if a clean copy of the claims is provided to facilitate the prosecution which otherwise requires extra time for editing the marked-up claims from OCR. Please submit two sets of claims: Set #1 as in a typical filing which includes indicators for the status of claim and all marked amendments to the claims; and Set #2 as an appendix to the Arguments/Remarks for a clean version of the claims which has all the markups removed for entry by the Examiner. Examiner’s Note The instant application claims a method of determining, based on the identified reward estimates, entropy reductions for the set of computer-implemented tasks and selecting a computer-implemented task to perform tasks using an offline model as opposed to an online model. The claimed subject matter relates to starting online bandit learner models which is a framework used in machine learning and reinforcement learning to make decisions under uncertainty. It is noted that entropy in bandit learner models (and general machine learning) is directly related to network security, particularly for real-time anomaly detection. As such, the claims are patent eligible under 35 USC § 101. Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows: 1. Determining the scope and contents of the prior art. 2. Ascertaining the differences between the prior art and the claims at issue. 3. Resolving the level of ordinary skill in the pertinent art. 4. Considering objective evidence present in the application indicating obviousness or nonobviousness. In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. Claims 1-5, 8-12 and 15-19 are rejected under 35 U.S.C. 103 as being unpatentable over Chari (US 20150082377 A1) in view of Pascanu (US 20200090048 A1; hereinafter “Pasca.” Note that the Provisional Application US 62508991, dated 2017-05-19, is relied upon for the date of reference) As per claim 1, Carna teaches a method comprising to: determining an initial entropy of an environment based on an observation history for the environment (Chari, par. 0023-0024 and 0059: logs are monitored for the attribute and policy specifications which can be noisy and contain errors … in the environments where tasks are performed by remote processing devices; see par. 0068-0069; par. 0032: calculates entropy of the obtained user identifiers [as] an initial entropy); selecting, based on the entropy reductions for the set of computer-implemented tasks, a computer-implemented task to perform from the set of computer-implemented tasks using the offline model (Chari, par. 0033: selecting or identifying attributes that are relevant for determining authorizations. It should be noted that there is a sub-goal to produce ABAC policies that are simple, containing few rules or short rules; par. 0034-0039: Drop any attribute with high entropy (entropy proportional to the user identifiers). par. 0047: for example, selecting an attribute with the highest entropy). While Chari discloses calculate the entropy reduction for an attribute X (par. 0041-0042), Chari does not explicitly disclose determining reward estimates associated with performing user or environment tasks. This aspect of the claim is identified as a further difference. In a related art, Pasca teaches: identifying, using an offline model, reward estimates associated with performing a set of computer-implemented tasks corresponding to the environment (Pasca, par. 0056-0059: determining rewards for maximizing expected returns; par. 0062-0064: locally maximizes expected returns means using an offline model; see par. 0055-0056 for the log π.sub.0 (a.sub.t|s.sub.t) term [treated] as a reward); determining, based on the reward estimates, entropy reductions for the set of computer-implemented tasks (Pasca par. 0055-0056: the entropy term is calculated by −log π.sub.i(a.sub.t|s.sub.t); par. 0059: an entropy regularized expected return with a redefined (regularized) reward; par. 0064-0065: reduces to an entropy regularized expected returns with entropy regularization factor; par. 0068-0071: the reward term … causes the task network 50i to diverge from the policy network 50 to increase the expected reward. As such, Pasca discloses determining, based on the reward estimates, entropy reductions the multitask policy and all the task policies can converge to the one that solves the easy task; par. 0065); Chari and Pasca are analogous art to the claimed invention in the same field of endeavor as the claimed invention, or reasonably pertinent to the problem faced by the inventor, which may be in a different field. Thus, it would have been obvious to one of ordinary in the art, before the effective filing date of the claimed invention, to modify Chari’s system with Pasca’s teachings of determining entropy reductions based on the reward estimates. For this combination, the motivation would have been to improve the task policies with reduced entropy. As per claim 2, the references as combined above teach the method as recited in claim 1, wherein determining the entropy reductions comprises: determining, for a given computer-implemented task of the set of computer-implemented tasks, a new entropy of the environment based on a reward estimate associated with performing the computer-implemented task (Pasca, par. 0020-0021: determining … a first entropy term measuring a difference between a distribution of the task policy and a distribution of the multitask policy; In a reinforcement learning system the task policy distribution and the multitask policy distribution may comprise state-action distributions); and determining an entropy reduction for the computer-implemented task of the set of computer-implemented tasks by comparing the new entropy to the initial entropy of the environment (Pasca, par. 0063-0065: [reducing] an entropy regularized expected returns with entropy regularization; consider the simple scenario of only n=1 task. Then (5) is maximized when the multitask policy π.sub.0 and the task policy π.sub.i are equal, and the KL regularization term is 0. Thus the objective function reduces to an unregularized expected return). As per claim 3, the references as combined above teach the method as recited in claim 2, and Pasca also teaches: wherein selecting the computer-implemented task to perform comprises determining that the entropy reduction for the computer-implemented task has a highest entropy reduction in the set of computer-implemented tasks (Chari, par. 0042: max entropy; select one attribute from each group based on some criteria, such as max entropy; par. 0047: selecting an attribute with the highest entropy; see also par. 0039-0041 for dropping any attribute with high entropy). As per claim 4, the references as combined above teach the method as recited in claim 1, further comprising updating the observation history for the environment by adding an observation of a reward associated with performing the selected computer-implemented task to the observation history (Chari, par. 0057-0059: adding new attributes for reducing the entropy of the leaf nodes wherein the logs can be monitored and policy changes suggested. Chari discloses that the process is scalable and parallelizable, which can assist with past and future provisioning, and can identify error, i.e., updating the observation history for the environment with a reward or entropy reduction). As per claim 5, the references as combined above teach the method as recited in claim 1, further comprising setting, for an identified computer-implemented task of the set of computer-implemented tasks, an entropy reduction at a given time as an exploration weight on the identified computer-implemented task and a reward estimate as an exploitation weight on the identified computer-implemented task (Pasca, par. 0065-0068: consider a scenario where one of the tasks is easier and is solved first, while other tasks are harder with much sparser rewards, an entropy is calculated to adjust the expected rewards; par. 0068. Pasca discloses entropy regularization factor β′=β/(1−α)=1/c.sub.Ent for a reward estimate as an exploitation weight on the identified task). Regarding claims 8-12, they are similar to claims 1-5 in terms of recited features, respectively; and thus, claims 8-12 are rejected for the same reasons as above. Regarding claims 15-19, they are similar to claims 1-5 in terms of recited features, respectively; and thus, claims 15-19 are rejected for the same reasons as above. Claims 6-7, 13-14, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Chari and Pasca, as applied to claim 1, and further in view of Shtein (US 11151467 B1). As per claim 6, the references of Chari and Pasca as combined above teach the method as recited in claim 1. While Pasca discloses Neural networks as machine learning models for generating the multitask policy (par. 0053-0055), the references do not explicitly disclose an online bandit learner model is used for generating online reward estimates of the task environment and determining the relevancy of the offline model to the online bandit learner model based on the online reward estimates and the offline reward estimates. These aspects of the claim is identified as a further difference. In a related art, Shtein teaches: generating online reward estimates of the environment for the set of computer-implemented tasks using an online bandit learner model (Shtein, col. 2, lines 64-67 and col. 3, lines 1-21: generating online feedback, which is a reward estimate; automatically arbitrating between the two approaches, by maximizing pre-defined key performance indicators (KPIs), using multi arm bandit techniques); generating offline reward estimates for the set of computer-implemented tasks across a plurality of offline models (Shtein, col. 5, lines: 3-10: logging the request along with the selected option to the training observations and to the multi arm bandit statistics and [generating] at least one valid model that its offline KPIs are good enough); and determining that the offline model is relevant to the online bandit learner model based on the online reward estimates and the offline reward estimates (Shtein, col. 5, lines: 65-67 and col. 6, lines 1-14: To incorporate personalization into the multi arm bandit technique, the system 400 splits (cluster/segment) the whole population into K sub-segments (clusters) and associates a specific multi arm bandit technique for each such sub-population. In real-time, when the system 400 is requested to generate a prediction for a new customer: the system first determines to which segment (cluster) the subscriber is closest. Shtein teaches finding the relevancy of the offline KPIs to the online multi arm bandit model based on the determination that the segment (cluster) of the subscriber is the closest). Stein is analogous art to the claimed invention in the same field of endeavor as the claimed invention, or reasonably pertinent to the problem faced by the inventor, which may be in a different field. Thus, it would have been obvious to one of ordinary in the art, before the effective filing date of the claimed invention, to modify the Chari-Pasca system with Shtein’s teachings of the online bandit learner model wherein the initial multi arm bandit technique compete with machine learning models for an arbitration between the original multi arm bandit technique and the machine learning models. For this combination, the motivation would have been to improve the selection of online bandit model with intelligent adaptive decisions. As per claim 7, the references of Chari, Pasca, and Shtein as combined above teach the method as recited in claim 6, wherein determining that the offline model is relevant to the online bandit learner model comprises: comparing the online reward estimates to the offline reward estimates (Shtein, col. 4, lines 50-67: an automatic model trainer that trains machine learning models based on observations of actual selections, which involves comparing the online reward estimates to the offline reward estimates); and determining that the offline model is relevant to the online bandit learner model based on a difference of the offline reward estimates across the set of computer-implemented tasks relative to the online reward estimates of the online bandit learner model (Shtein, col. 5, lines 3-10: once the offline KPIs are good enough, the system starts to use the multi-arm-bandit based arbitrator to select the model that will produce the best results). Shtein is combined with Chari and Pasca herein for similar reasons and rationale to claim 6. Regarding claims 13-14, they are similar to claims 6-7 in terms of recited features, respectively; and thus, claims 13-14 are rejected for the same reasons. Regarding claim 20, the claim is similar to claim 6 and is therefore rejected using a similar rationale. Conclusion The prior art made of record and not relied upon is considered pertinent to applicant's disclosure as the prior art additionally discloses certain parts of the claim features (See “PTO-892 Notice of Reference Cited”). Any inquiry concerning this communication or earlier communications from the examiner should be directed to DON ZHAO whose telephone number is (571)272.9953. The examiner can normally be reached on Monday to Friday, 7:30 A.M to 5:00 P.M EST. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Carl G Colin can be reached on 571.272.3862. The fax phone number for the organization where this application or proceeding is assigned is 571.273.8300. Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866.217.9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800.786.9199 (IN USA OR CANADA) or 571.272.1000. /Don G Zhao/Primary Examiner, Art Unit 2493 03/03/2026
Read full office action

Prosecution Timeline

Apr 25, 2023
Application Filed
Mar 03, 2026
Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12603879
PROGRESSIVELY INCREASING A LOGIN INFORMATION LENGTH
2y 5m to grant Granted Apr 14, 2026
Patent 12598209
DEVICE VULNERABILITY RISK ASSESSMENT SYSTEM
2y 5m to grant Granted Apr 07, 2026
Patent 12596801
METHOD AND APPARATUS FOR DETECTING COMMAND CONTROL SERVER OF MALICIOUS APPLICATION
2y 5m to grant Granted Apr 07, 2026
Patent 12596802
MALWARE DETECTION TECHNIQUES
2y 5m to grant Granted Apr 07, 2026
Patent 12585735
SYSTEMS AND METHODS FOR GENERATING AND DISTRIBUTING NFTs BASED ON USER INTERACTION
2y 5m to grant Granted Mar 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

1-2
Expected OA Rounds
87%
Grant Probability
99%
With Interview (+16.9%)
2y 5m
Median Time to Grant
Low
PTA Risk
Based on 774 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month