Last updated: April 19, 2026

Application No. 18/306,449

WARM STARTING AN ONLINE BANDIT LEARNER MODEL UTILIZING RELEVANT OFFLINE MODELS

Non-Final OA §103

Filed

Apr 25, 2023

Examiner

ZHAO, DON GORDON

Art Unit

2493

Tech Center

2400 — Computer Networks

Assignee

Adobe Inc.

OA Round

1 (Non-Final)

Interview Optional

— +16.9% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 774 resolved cases, 2023–2026

Examiner Intelligence

ZHAO, DON GORDON View full profile →

Grants 87% — above average

Career Allow Rate

674 granted / 774 resolved

+29.1% vs TC avg

Strong +17% interview lift

Without

With

+16.9%

Interview Lift

resolved cases with interview

Typical timeline

2y 5m

Avg Prosecution

21 currently pending

Career history

795

Total Applications

across all art units

Statute-Specific Performance

§101

11.0%

-29.0% vs TC avg

§103

41.0%

+1.0% vs TC avg

§102

4.5%

-35.5% vs TC avg

§112

27.8%

-12.2% vs TC avg

Black line = Tech Center average estimate • Based on career data from 774 resolved cases

Office Action

§103

DETAILED ACTION
	Claims 1-20 are presented on 04/25/2023 for examination on merits.  Claims 1, 8, and 15 are independent base claims.  

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Examiner's Instructions for filing Response to this Office Action
When the Applicant submits amendments regarding to the claims in response the Office Action, the Examiner would appreciate Applicant if a clean copy of the claims is provided to facilitate the prosecution which otherwise requires extra time for editing the marked-up claims from OCR.
Please submit two sets of claims: 
Set #1 as in a typical filing which includes indicators for the status of claim and all marked amendments to the claims; and 
Set #2 as an appendix to the Arguments/Remarks for a clean version of the claims which has all the markups removed for entry by the Examiner.

Examiner’s Note
	The instant application claims a method of determining, based on the identified reward estimates, entropy reductions for the set of computer-implemented tasks and selecting a computer-implemented task to perform tasks using an offline model as opposed to an online model.  The claimed subject matter relates to starting online bandit learner models which is a framework used in machine learning and reinforcement learning to make decisions under uncertainty. It is noted that entropy in bandit learner models (and general machine learning) is directly related to network security, particularly for real-time anomaly detection.  As such, the claims are patent eligible under 35 USC § 101.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.


In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

Claims 1-5, 8-12 and 15-19 are rejected under 35 U.S.C. 103 as being unpatentable over Chari (US 20150082377 A1) in view of Pascanu (US 20200090048 A1; hereinafter “Pasca.” Note that the Provisional Application US 62508991, dated 2017-05-19, is relied upon for the date of reference)

As per claim 1, Carna teaches a method comprising to: 
determining an initial entropy of an environment based on an observation history for the environment (Chari, par. 0023-0024 and 0059: logs are monitored for the attribute and policy specifications which can be noisy and contain errors … in the environments where tasks are performed by remote processing devices; see par. 0068-0069; par. 0032: calculates entropy of the obtained user identifiers [as] an initial entropy); 
selecting, based on the entropy reductions for the set of computer-implemented tasks, a computer-implemented task to perform from the set of computer-implemented tasks using the offline model (Chari, par. 0033: selecting or identifying attributes that are relevant for determining authorizations. It should be noted that there is a sub-goal to produce ABAC policies that are simple, containing few rules or short rules; par. 0034-0039: Drop any attribute with high entropy (entropy proportional to the user identifiers). par. 0047: for example, selecting an attribute with the highest entropy).
While Chari discloses calculate the entropy reduction for an attribute X (par. 0041-0042), Chari does not explicitly disclose determining reward estimates associated with performing user or environment tasks.  This aspect of the claim is identified as a further difference.
In a related art, Pasca teaches:
identifying, using an offline model, reward estimates associated with performing a set of computer-implemented tasks corresponding to the environment (Pasca, par. 0056-0059: determining rewards for maximizing expected returns; par. 0062-0064: locally maximizes expected returns means using an offline model; see par. 0055-0056 for the log π.sub.0 (a.sub.t|s.sub.t) term [treated] as a reward); 
determining, based on the reward estimates, entropy reductions for the set of computer-implemented tasks (Pasca par. 0055-0056: the entropy term is calculated by −log π.sub.i(a.sub.t|s.sub.t); par. 0059: an entropy regularized expected return with a redefined (regularized) reward; par. 0064-0065: reduces to an entropy regularized expected returns with entropy regularization factor; par. 0068-0071: the reward term … causes the task network 50i to diverge from the policy network 50 to increase the expected reward. As such, Pasca discloses determining, based on the reward estimates, entropy reductions the multitask policy and all the task policies can converge to the one that solves the easy task; par. 0065); 
Chari and Pasca are analogous art to the claimed invention in the same field of endeavor as the claimed invention, or reasonably pertinent to the problem faced by the inventor, which may be in a different field.  Thus, it would have been obvious to one of ordinary in the art, before the effective filing date of the claimed invention, to modify Chari’s system with Pasca’s teachings of determining entropy reductions based on the reward estimates.  For this combination, the motivation would have been to improve the task policies with reduced entropy.

As per claim 2, the references as combined above teach the method as recited in claim 1, wherein determining the entropy reductions comprises: 
determining, for a given computer-implemented task of the set of computer-implemented tasks, a new entropy of the environment based on a reward estimate associated with performing the computer-implemented task (Pasca, par. 0020-0021: determining … a first entropy term measuring a difference between a distribution of the task policy and a distribution of the multitask policy;  In a reinforcement learning system the task policy distribution and the multitask policy distribution may comprise state-action distributions); and 
determining an entropy reduction for the computer-implemented task of the set of computer-implemented tasks by comparing the new entropy to the initial entropy of the environment (Pasca, par. 0063-0065: [reducing] an entropy regularized expected returns with entropy regularization; consider the simple scenario of only n=1 task. Then (5) is maximized when the multitask policy π.sub.0 and the task policy π.sub.i are equal, and the KL regularization term is 0. Thus the objective function reduces to an unregularized expected return).

As per claim 3, the references as combined above teach the method as recited in claim 2, and Pasca also teaches:
wherein selecting the computer-implemented task to perform comprises 
determining that the entropy reduction for the computer-implemented task has a highest entropy reduction in the set of computer-implemented tasks (Chari, par. 0042: max entropy; select one attribute from each group based on some criteria, such as max entropy; par. 0047: selecting an attribute with the highest entropy; see also par. 0039-0041 for dropping any attribute with high entropy).

As per claim 4, the references as combined above teach the method as recited in claim 1, further comprising updating the observation history for the environment by adding an observation of a reward associated with performing the selected computer-implemented task to the observation history (Chari, par. 0057-0059: adding new attributes for reducing the entropy of the leaf nodes wherein the logs can be monitored and policy changes suggested. Chari discloses that the process is scalable and parallelizable, which can assist with past and future provisioning, and can identify error, i.e., updating the observation history for the environment with a reward or entropy reduction).

As per claim 5, the references as combined above teach the method as recited in claim 1, further comprising setting, for an identified computer-implemented task of the set of computer-implemented tasks, an entropy reduction at a given time as an exploration weight on the identified computer-implemented task and a reward estimate as an exploitation weight on the identified computer-implemented task (Pasca, par. 0065-0068: consider a scenario where one of the tasks is easier and is solved first, while other tasks are harder with much sparser rewards, an entropy is calculated to adjust the expected rewards; par. 0068.  Pasca discloses entropy regularization factor β′=β/(1−α)=1/c.sub.Ent for a reward estimate as an exploitation weight on the identified task).

Regarding claims 8-12, they are similar to claims 1-5 in terms of recited features, respectively; and thus, claims 8-12 are rejected for the same reasons as above.

Regarding claims 15-19, they are similar to claims 1-5 in terms of recited features, respectively; and thus, claims 15-19 are rejected for the same reasons as above.


Claims 6-7, 13-14, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Chari and Pasca, as applied to claim 1, and further in view of Shtein (US 11151467 B1).

As per claim 6, the references of Chari and Pasca as combined above teach the method as recited in claim 1.  While Pasca discloses Neural networks as machine learning models for generating the multitask policy (par. 0053-0055), the references do not explicitly disclose an online bandit learner model is used for generating online reward estimates of the task environment and determining the relevancy of the offline model to the online bandit learner model based on the online reward estimates and the offline reward estimates.  These aspects of the claim is identified as a further difference.
In a related art, Shtein teaches:
generating online reward estimates of the environment for the set of computer-implemented tasks using an online bandit learner model (Shtein, col. 2, lines 64-67 and col. 3, lines 1-21: generating online feedback, which is a reward estimate; automatically arbitrating between the two approaches, by maximizing pre-defined key performance indicators (KPIs), using multi arm bandit techniques); 
generating offline reward estimates for the set of computer-implemented tasks across a plurality of offline models (Shtein, col. 5, lines: 3-10: logging the request along with the selected option to the training observations and to the multi arm bandit statistics and [generating] at least one valid model that its offline KPIs are good enough); and 
determining that the offline model is relevant to the online bandit learner model based on the online reward estimates and the offline reward estimates (Shtein, col. 5, lines: 65-67 and col. 6, lines 1-14: To incorporate personalization into the multi arm bandit technique, the system 400 splits (cluster/segment) the whole population into K sub-segments (clusters) and associates a specific multi arm bandit technique for each such sub-population. In real-time, when the system 400 is requested to generate a prediction for a new customer: the system first determines to which segment (cluster) the subscriber is closest.  Shtein teaches finding the relevancy of the offline KPIs to the online multi arm bandit model based on the determination that the segment (cluster) of the subscriber is the closest).
Stein is analogous art to the claimed invention in the same field of endeavor as the claimed invention, or reasonably pertinent to the problem faced by the inventor, which may be in a different field.  Thus, it would have been obvious to one of ordinary in the art, before the effective filing date of the claimed invention, to modify the Chari-Pasca system with Shtein’s teachings of the online bandit learner model wherein the initial multi arm bandit technique compete with machine learning models for an arbitration between the original multi arm bandit technique and the machine learning models.  For this combination, the motivation would have been to improve the selection of online bandit model with intelligent adaptive decisions.

As per claim 7, the references of Chari, Pasca, and Shtein as combined above teach the method as recited in claim 6, wherein determining that the offline model is relevant to the online bandit learner model comprises: 
comparing the online reward estimates to the offline reward estimates (Shtein, col. 4, lines 50-67: an automatic model trainer that trains machine learning models based on observations of actual selections, which involves comparing the online reward estimates to the offline reward estimates); and 
determining that the offline model is relevant to the online bandit learner model based on a difference of the offline reward estimates across the set of computer-implemented tasks relative to the online reward estimates of the online bandit learner model (Shtein, col. 5, lines 3-10: once the offline KPIs are good enough, the system starts to use the multi-arm-bandit based arbitrator to select the model that will produce the best results).
Shtein is combined with Chari and Pasca herein for similar reasons and rationale to claim 6.

Regarding claims 13-14, they are similar to claims 6-7 in terms of recited features, respectively; and thus, claims 13-14 are rejected for the same reasons.

Regarding claim 20, the claim is similar to claim 6 and is therefore rejected using a similar rationale.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure as the prior art additionally discloses certain parts of the claim features (See “PTO-892 Notice of Reference Cited”).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DON ZHAO whose telephone number is (571)272.9953.  The examiner can normally be reached on Monday to Friday, 7:30 A.M to 5:00 P.M EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Carl G Colin can be reached on 571.272.3862.  The fax phone number for the organization where this application or proceeding is assigned is 571.273.8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866.217.9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800.786.9199 (IN USA OR CANADA) or 571.272.1000.


/Don G Zhao/Primary Examiner, Art Unit 2493                                                                                                                                                                                                        03/03/2026

Read full office action

Prosecution Timeline

Apr 25, 2023

Application Filed

Mar 03, 2026

Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/129,309

Patent 12603879

PROGRESSIVELY INCREASING A LOGIN INFORMATION LENGTH

2y 5m to grant Granted Apr 14, 2026

18/471,938

Patent 12598209

DEVICE VULNERABILITY RISK ASSESSMENT SYSTEM

2y 5m to grant Granted Apr 07, 2026

18/543,555

Patent 12596801

METHOD AND APPARATUS FOR DETECTING COMMAND CONTROL SERVER OF MALICIOUS APPLICATION

2y 5m to grant Granted Apr 07, 2026

18/622,022

Patent 12596802

MALWARE DETECTION TECHNIQUES

2y 5m to grant Granted Apr 07, 2026

18/497,468

Patent 12585735

SYSTEMS AND METHODS FOR GENERATING AND DISTRIBUTING NFTs BASED ON USER INTERACTION

2y 5m to grant Granted Mar 24, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

1-2

Expected OA Rounds

87%

Grant Probability

99%

With Interview (+16.9%)

2y 5m

Median Time to Grant

Low

PTA Risk

Based on 774 resolved cases by this examiner. Grant probability derived from career allow rate.

WARM STARTING AN ONLINE BANDIT LEARNER MODEL UTILIZING RELEVANT OFFLINE MODELS

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email