Last updated: April 19, 2026

Application No. 18/275,511

NEURAL NETWORK REINFORCEMENT LEARNING WITH DIVERSE POLICIES

Non-Final OA §101§103

Filed

Aug 02, 2023

Examiner

WENG, PEI YONG

Art Unit

2141

Tech Center

2100 — Computer Architecture & Software

Assignee

Deepmind Technologies Limited

OA Round

1 (Non-Final)

Interview Optional

— +23.1% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 637 resolved cases, 2023–2026

Examiner Intelligence

WENG, PEI YONG View full profile →

Grants 79% — above average

Career Allow Rate

506 granted / 637 resolved

+24.4% vs TC avg

Strong +23% interview lift

Without

With

+23.1%

Interview Lift

resolved cases with interview

Typical timeline

3y 3m

Avg Prosecution

18 currently pending

Career history

655

Total Applications

across all art units

Statute-Specific Performance

§101

12.4%

-27.6% vs TC avg

§103

49.3%

+9.3% vs TC avg

§102

19.2%

-20.8% vs TC avg

§112

8.8%

-31.2% vs TC avg

Black line = Tech Center average estimate • Based on career data from 637 resolved cases

Office Action

§101 §103

Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA. DETAILED ACTION This action is responsive to the following communication: Preliminary Amendment filed Aug . 2 , 202 3 . Claims 1-20 are pending in the case. Claims 1, 1 9 and 20 are independent claims. Claim Rejections - 35 USC § 101 101 Rejection 35 U.S.C. 101 reads as follows: Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title. Claims 1-20 are rejected under 35 USC § 101 because the claimed invention is directed to non-statutory subject matter. Regarding Claim 1: Claim 1 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1 Analysis: Claim 1 is directed to a method, which is directed to a process, one of the statutory categories. Step 2A Prong One Analysis: Claim 1 under its broadest reasonable interpretation is a series of mental processes. For example, but for the generic computer components language, the above limitations in the context of this claim encompass machine learning processing, including the following: obtaining a policy set comprising one or more policies for satisfying the objective; determining a new policy based on the one or more policies, wherein the determining includes one or more optimization steps that aim to maximize a diversity of the new policy relative to the policy set under the condition that the new policy satisfies a minimum performance criterion based on an expected return that would be obtained by following the new policy ( The recited features include observation, evaluation and judgement . ) . Therefore, claim 1 recites an abstract idea which is a judicial exception. Step 2A Prong Two Analysis: Claim 1 recites additional elements “ training a neural network system by reinforcement learning, the neural network system being configured to receive an input observation characterizing a state of an environment interacted with by an agent and to select and output an action in accordance with a policy aiming to satisfy an objective ”. However, these additional features are recited at a high-level of generality, such that they amount to no more than mere instructions to apply the judicial exception. An additional element that merely recites the words “apply it” (or an equivalent) with the judicial exception, or merely includes instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea, does not integrate the judicial exception into a practical application (See MPEP 2106.05(f)). Claim 1 also recites additional elements “ receiving, by one or more processors, a plurality of training datasets corresponding to a plurality of predictive categories ” which amounts to gathering data which is insignificant extra-solution activity (See MPEP 2106.05(g)). Therefore, claim 1 is directed to a judicial exception. Step 2B Analysis: Claim 1 does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the lack of integration of the abstract idea into a practical application, the additional elements recited in claim 1 amount to no more than mere instructions to apply the judicial exception using a generic computer component and insignificant extra-solution activity. The gathering and outputting of data is considered well-understood, routine, and conventional in the art (see MPEP 2106.05(d)(II)(i)). For the reasons above, claim 1 is rejected as being directed to non-patentable subject matter under §101. This rejection applies equally to independent claims 1 9 and 20 , which recite a system and storage media , respectively, as well as to dependent claims 2 -1 8 . Therefore, when considering the elements separately and in combination, they do not add significantly more to the inventive concept . Accordingly, claims 1-20 are rejected under 35 U.S.C. § 101. Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 1 - 4, 8, 11, 14- 20 are rejected under 35 U.S.C. 103 as being unpatentable over “ Multiple Plans are Better than One: Diverse Stochastic Planning ” Ghasemi et al. 2020 in view of Mavrin et al. (hereinafter Mavrin ) U.S. Patent Publication No. 20 21 /0 387330 . With respect to independent claim 1 , Ghasemi teaches a method for training a system by reinforcement learning, the system being configured to receive an input observation characterizing a state of an environment interacted with by an agent and to select and output an action in accordance with a policy aiming to satisfy an objective (See e.g., Abstract: "[ ... ] we formulate a problem, called diverse stochastic planning, that aims to generate a set of representative - small and diverse - behaviors that are near-optimal with respect to the known objective. In particular, the problem aims to compute a set of diverse and near-optimal policies for systems modeled by a Markov decision process.", page 2, col. 2, paragraph 4: "We consider systems whose behavior is modeled by a Markov decision process (MOP). An MOP is a tuple M = (S, A, P, R), where S is a finite set of states, A is a finite set of actions, P : S x A x S - [0, 1] is a probabilistic Form PCT/ISA/237 (Separate Sheet) (Sheet 1) (EPO-April 2005) transition function such that for all s E S and a E A, s OE S P(s O Is, a) = 1, and R : S x A x S - R is a reward function. A stationary stochastic policy TT on an MOP is a mapping from the state space to a probability distribution over the actions, formally defined as TT: S x A - [O, 1].", page 4, col. 1, paragraph 3: "The first optimization method that we consider is projected gradient ascent (PGA)[ ... ]. PGA iteratively applies a gradient update followed by a projection step.") , the method comprising: obtaining a policy set comprising one or more policies for satisfying the objective (See e.g., page 3, col. 2, paragraph 3: "We aim to design an algorithm that can provide a representative set of polices over an MOP that are near-optimal with respect to a known reward function. In particular, given the stated definitions, the objective is to construct k policies that cumulatively, have high reward and diversity.") ; determining a new policy based on the one or more policies, wherein the determining includes one or more optimization steps that aim to maximize a diversity of the new policy relative to the policy set (see e.g., page 3, col. 2, paragraph 3: "We aim to design an algorithm that can provide a representative set of polices over an MOP that are near-optimal with respect to a known reward function. In particular, given the stated definitions, the objective is to construct k policies that cumulatively, have high reward and diversity.") under the condition that the new policy satisfies a minimum performance criterion based on an expected return that would be obtained by following the new policy (see e.g., equation 10, in particular "subject to[ ... ] Pi(s, a) 2:: 0 for all I E [k], s E S, a E A", wherein page 2, col. 2, last paragraph - page 3, col. 1,first paragraph discloses: "The occupancy measure of a policy, p1r(., .), is defined as the distribution induced by the execution of that policy over the state action pairs, asymptotically[ ... ] The long run behavior of a stationary stochastic policy can be represented using its corresponding occupancy measure.", equation 2) . Ghasemi does not expressly show the system is a neural network system. However, Mavrin teaches similar feature (see e.g., Para [10] and Claims 10 and 19 – “the robot includes an RL agent that is configured to learn a first policy to maximize a cumulative reward of a first task. The first policy is learned using a RL algorithm and transition tuples collected in the environment, each transition tuple including state, action, reward of the first policy after taking the action, and next state. In some embodiments, the first policy that is learned using a RL algorithm is modelled as a neural network (referred to hereinafter as policy network) that maps state to action in an action space. In some embodiments, the first policy that is learned using a RL algorithm is the action that maximizes a value function in a given state where the value function is modelled as a neural network (referred to hereinafter as a value network) that maps state and action to value. A feature neural network (hereinafter referred to as feature network) is extracted from the policy network or the value network. The feature network generates features for a given state. The features are then used as pseudo-rewards, called feature rewards. Each feature reward corresponds to the reward of an option policy that the RL agent learns to maximize using the transitions tuples collected in the environment, where each transition tuple includes state, action, feature reward and next state. The RL agent is configured to learn a second policy to maximize a second a future cumulative reward for a second (different) task. The second policy chooses an option policy from the learned option policies. The chosen option policy generates an action for the given state. The second policy is learned using a RL algorithm and transitions tuples collected in the environment, each transition tuple including state, option policy, reward of the second policy after taking the action generated by the option policy, and next state. The second policy is modeled as a neural network (referred to hereinafter as second policy network) that maps state to option policy.”) . Both Ghasemi and Mavrin are directed to robot agent training algorithms . Accordingly, it would have been obvious to the skilled artisan before the effective filing date of the claimed invention having Ghasemi and Mavrin in front of them to modify the system of Ghasemi to include the above feature . The motivation to combine Ghasemi and Mavrin comes from Mavrin . Mavrin discloses the motivation to incorporate neural network into the learning process so that learning can be more efficient ( Para [10] -[13] ). This motivation for combination also applies to the remaining claims which depend on this combination. With respect to dependent claim 2 , the modified Ghasemi teaches the diversity is measured based on an expected state distribution for each of the new policy and the one or more policies in the policy set (see e.g., Page 2 col 2 to Page 3 Col 2) . With respect to dependent claim 3 , the modified Ghasemi teaches determining the new policy comprises defining a diversity reward function that provides a diversity reward for a given state, the diversity reward providing a measure of the diversity of the new policy relative to the policy set; the one or more optimization steps aim to maximize an expected diversity return based on the diversity reward function under the condition that the new policy satisfies the minimum performance criterion (see e.g., Page 3 col 2 to Page 4 Col 1) . With respect to dependent claim 4 , the modified Ghasemi teaches the one or more optimization steps aim to minimize a correlation between successor features of the new policy and successor features of the policy set under the condition that the new policy satisfies the minimum performance criterion (see e.g., Page 1 lines 24-27 and Page 3 col 2 lines 33-35 and equation 10) . With respect to dependent claim 8 , the modified Ghasemi teaches each of the one or more optimization steps comprises: obtaining a sequence of observations of states from the implementation of the new policy; and updating parameters of the new policy to maximize a linear product between the sequence of observations and the diversity reward under the condition that the minimum performance criterion is satisfied (see e.g., Page 4 col. 1 and equation 11) . With respect to dependent claim 11 , the modified Ghasemi teaches the expected return that would be obtained by following the new policy is determined based on extrinsic rewards received from implementing the new policy (see e.g., Page 2 col 2 equation 1). With respect to dependent claim 12 , the modified Ghasemi teaches the minimum performance criterion requires the expected return that would be obtained by following the new policy to be greater than or equal to a threshold (see e.g., equation 10). With respect to dependent claim 14 , the modified Ghasemi teaches obtaining a policy set comprises obtaining a first policy through one or more update steps that update the first policy in order to maximize the expected return of the first policy (see e.g., Pg 4-5 Frank-Wolfe Algorithm and equation 1). With respect to dependent claim 15 , the modified Ghasemi teaches adding the determined new policy to the policy set; and determining a further new policy based on the policy set, wherein the determining includes one or more optimization steps that aim to maximize a diversity of the further new policy relative to the policy set under the condition that the further new policy satisfies a minimum performance criterion based on an expected return that would be obtained by following the further new policy (see e.g., P a g e 4-5 Frank-Wolfe Algorithm and equation 1). With respect to dependent claim 16 , the modified Ghasemi teaches implementing the policy set based on a probability distribution over the policy set, wherein the neural network system is configured to select a policy from the policy set according to the probability distribution and implement the selected policy (see e.g., Page 2 col. 2 paragraph 5 ). With respect to dependent claim 17 , the modified Ghasemi teaches the new policy is determined by solving a constrained Markov decision process (see e.g., Page 2 col. 2 paragraph 4 ). With respect to dependent claim 18 , the modified Ghasemi teaches the agent is a mechanical agent, the environment is a real-world environment, and the actions are actions taken by the mechanical agent in the real-world environment to satisfy the objective (see e.g., Mavrin Fig. 1 Para [9]-[14]) . Claim 19 is rejected for the similar reasons discussed above with respect to claim 1. Claim 20 is rejected for the similar reasons discussed above with respect to claim 1. Conclusion Any inquiry concerning this communication or earlier communications from the examiner should be directed to FILLIN "Examiner name" \* MERGEFORMAT PEIYONG WENG whose telephone number is FILLIN "Phone number" \* MERGEFORMAT (571)270-1660 . The examiner can normally be reached on Mon.-Fri. 8 am to 5 pm. If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, Matthew Ell , can be reached on (571) 270-3264 . The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://portal.uspto.gov/external/portal. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). /PEI YONG WENG/ Primary Examiner, Art Unit 2141

Read full office action

Prosecution Timeline

Aug 02, 2023

Application Filed

Feb 26, 2026

Non-Final Rejection — §101, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/048,884

Patent 12602594

DIRECTED TRAJECTORIES THROUGH COMMUNICATION DECISION TREE USING ITERATIVE ARTIFICIAL INTELLIGENCE

2y 5m to grant Granted Apr 14, 2026

17/789,227

Patent 12579468

TRAINING DATA SCREENING DEVICE, ROBOT SYSTEM, AND TRAINING DATA SCREENING METHOD

2y 5m to grant Granted Mar 17, 2026

17/556,648

Patent 12572845

INTELLIGENT MACHINE-LEARNING MODEL CATALOG

2y 5m to grant Granted Mar 10, 2026

17/832,519

Patent 12561608

APPARATUS AND METHODS FOR PREDICTING SLIPPING EVENTS FOR MICROMOBILITY VEHICLES

2y 5m to grant Granted Feb 24, 2026

17/831,027

Patent 12555665

HOME EXERCISE PLAN PREDICTION

2y 5m to grant Granted Feb 17, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

1-2

Expected OA Rounds

79%

Grant Probability

99%

With Interview (+23.1%)

3y 3m

Median Time to Grant

Low

PTA Risk

Based on 637 resolved cases by this examiner. Grant probability derived from career allow rate.