Last updated: April 19, 2026

Application No. 18/364,601

APPARATUS AND METHOD FOR TRAINING PARAMETRIC POLICY

Non-Final OA §102§103

Filed

Aug 03, 2023

Examiner

NILSSON, ERIC

Art Unit

2151

Tech Center

2100 — Computer Architecture & Software

Assignee

Huawei Technologies Co., Ltd.

OA Round

1 (Non-Final)

Interview Optional

— +18.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 494 resolved cases, 2023–2026

Examiner Intelligence

NILSSON, ERIC View full profile →

Grants 83% — above average

Career Allow Rate

408 granted / 494 resolved

+27.6% vs TC avg

Strong +18% interview lift

Without

With

+18.0%

Interview Lift

resolved cases with interview

Typical timeline

3y 2m

Avg Prosecution

31 currently pending

Career history

525

Total Applications

across all art units

Statute-Specific Performance

§101

25.3%

-14.7% vs TC avg

§103

38.8%

-1.2% vs TC avg

§102

17.5%

-22.5% vs TC avg

§112

8.9%

-31.1% vs TC avg

Black line = Tech Center average estimate • Based on career data from 494 resolved cases

Office Action

§102 §103

CTNF 18/364,601 CTNF 90426 DETAILED ACTION This action is in response to claims filed 29 July 2024 for application 18364601 filed 03 August 2023. Currently claims 1-20 are pending. Notice of Pre-AIA or AIA Status 07-03-aia AIA 15-10-aia The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA. Claim Rejections - 35 USC § 102 07-06 AIA 15-10-15 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 07-07-aia AIA 07-07 The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action: A person shall be entitled to a patent unless – 07-08-aia AIA (a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention. 07-15-aia AIA Claim(s) 1, 3-12, and 14-20 is/are rejected under 35 U.S.C. 102 (A)(1) as being anticipated by Yu et al. (US 20190228309 A1) . Regarding claims 1 and 12 , Yu discloses: An apparatus for training a parametric policy in dependence on a proposal distribution, comprising one or more processors configured to repeatedly perform training operations comprising: forming, in dependence on the proposal distribution, a proposal (“Let {p1, p2, . . . , pr} be a set of r≥2 target distributions and {q1, q2, . . . , qm} a set of m≥2 proposal distributions (which correspond to candidate policies, πp, and behavior policies, πq, in the RL setting, respectively). Note this problem setting is different from traditional single or multiple importance sampling because multiple target distributions (r≥2) are considered. All target and proposal distributions are assumed distinct. For 1≤j≤r, 1≤t≤m, (EQ 7) is the importance sampling estimator for the jth target distribution using the ith sample generated by the tth proposal distribution.” [0413]) ; inputting the proposal to the parametric policy so as to form an output state from the parametric policy responsive to the proposal (“where k* is an optimal way to distribute n samples over m proposal distributions such that the variances of the estimates are most similar (i.e., the average distance between var(μj,k) and their mean be minimized). If the set of target distributions and the set of proposal distributions are both known in advance, computing k* can be solved analytically. However, in the FSI context, the set of promising candidate target distributions to be estimated and evaluated by a safety test are unknown before the collection of a total of n samples from the set of available proposal distributions which are already confirmed by the safety test in the past policy improvement iteration. Under such uncertainty, it is infeasible to make an optimal decision on the sample size for each available proposal distribution according to the objective function in Equation (7). Given the objective is convex, the quality of a solution vector k depends on its distance to an unknown optimal vector k*. The closer the distance, the better uniformity of variances it produces. Lemma 1 below provides a tight upper bound on the distance from a given vector k to any possible solution to the objective in Equation (7).” [0421], see also [0451] policy is parametric) ; estimating a loss between the output state and a preferred state responsive to the proposal ( [0421] difference between vectors) ; forming, by means of an adaptation algorithm and in dependence on the loss, a policy adaption (“We address the challenge of effective exploration while maintaining good performance in policy gradient methods. As a solution, we propose diverse exploration (DE) via conjugate policies. DE learns and deploys a set of conjugate policies which can be conveniently generated as a byproduct of conjugate gradient descent. We provide both theoretical and empirical results showing the effectiveness of DE at achieving exploration, improving policy performance, and the advantage of DE over exploration by random policy perturbations.” [0480]) ; applying the policy adaption to the parametric policy to form an adapted policy (“We address the challenge of effective exploration while maintaining good performance in policy gradient methods. As a solution, we propose diverse exploration (DE) via conjugate policies. DE learns and deploys a set of conjugate policies which can be conveniently generated as a byproduct of conjugate gradient descent. We provide both theoretical and empirical results showing the effectiveness of DE at achieving exploration, improving policy performance, and the advantage of DE over exploration by random policy perturbations.” [0480]) ; forming, by means of the adapted policy, an estimate of variance in the policy adaptation (“Increasing the KL divergence between perturbed policies reduces the variance of the perturbed gradient estimate. Conjugate vectors maximize pairwise KL divergence among a constrained number of perturbations.” [0482]) and adapting the proposal distribution in dependence on the estimate of variance so as to reduce the variance of policy adaptations formed on subsequent iterations of the training operations (“Increasing the KL divergence between perturbed policies reduces the variance of the perturbed gradient estimate. Conjugate vectors maximize pairwise KL divergence among a constrained number of perturbations.” [0482]) . Regarding claims 3 and 14 , Yu discloses: The apparatus according to claim 1, wherein the proposal distribution is a parametric proposal distribution (“In the context of PG methods, a policy π is a distribution over the action space conditioned by the current state and parameterized by a vector θ. That is, an action a is drawn from the distribution a ˜π( ⋅ |s, θ), given state s and parameters θ.” [0451]) . Regarding claims 4 and 15 , Yu discloses: The apparatus according to claim 3, wherein the step of adapting the proposal distribution comprises adapting one or more parameters of the proposal distribution ([0495] discloses perturbations based on parameters) . Regarding claims 5 and 16 , Yu discloses: The apparatus according to claim 1, wherein the training operations further comprise: making a first estimation of noise in the policy adaptation; making a second estimation of the extent to which that noise is dependent on the proposal; and adapting the proposal distribution in dependence on the second estimation ([0021-25) disclose noisy space for exploration of policies) . Regarding claims 6 and 17 , Yu discloses: The apparatus according to claim 1, wherein the proposal distribution is adapted by a gradient variance estimator taking an estimate of variance in the policy adaptation as input (“We address the challenge of effective exploration while maintaining good performance in policy gradient methods. As a solution, we propose diverse exploration (DE) via conjugate policies. DE learns and deploys a set of conjugate policies which can be conveniently generated as a byproduct of conjugate gradient descent. We provide both theoretical and empirical results showing the effectiveness of DE at achieving exploration, improving policy performance, and the advantage of DE over exploration by random policy perturbations.” [0480]) . Regarding claims 7 and 18 , Yu discloses: The apparatus according to claim 6, wherein the gradient variance estimator is a stochastic estimator (“A method of learning and deploying a set of behavior policies for an artificial agent, selected from a set of behavior policies, each having a statistically expected return no worse than a lower bound of policy performance which excludes a portion of the set of behavior policies, comprising iteratively improving a behavior policy for each iteration of policy improvement, employing a diverse exploration strategy which strives for behavior diversity in a space of stochastic policies by deploying a diverse set comprising a plurality of behavior policies which are ensured as being safe during each iteration of policy improvement and assessing performance of the artificial agent.” Claim 1) . Regarding claims 8 and 19 , Yu discloses: The apparatus according to claim 1, wherein the proposal is formed by stochastically sampling the proposal distribution ( Claim 1) . Regarding claims 9 and 20 , Yu discloses: The apparatus according to claim 1, wherein the adaptation algorithm is such as to sample a trajectory in a manner such as to inhibit variance of the adaptation over successive iterations ( PNG media_image1.png 286 318 media_image1.png Greyscale [0439], see also [0482]) . Regarding claim 10 , Yu discloses: The apparatus according to claim 1, wherein the adaptation algorithm is such as to form policy gradients and to form the adaptation by stochastic optimisation of the policy gradients (Claim 1, see also [0499]) . Regarding claim 11 , Yu discloses: The apparatus according to claim 1, wherein the parametric policy comprises a neural network model (“The method according to claim 1, each respective behavior policy represents a trained first artificial neural network, and each respective behavior policy controls an agent comprising a second artificial neural network.” Claim 10) . Claim Rejections - 35 USC § 103 07-06 AIA 15-10-15 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 07-20-aia AIA The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. 07-23-aia AIA The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows: 1. Determining the scope and contents of the prior art. 2. Ascertaining the differences between the prior art and the claims at issue. 3. Resolving the level of ordinary skill in the pertinent art. 4. Considering objective evidence present in the application indicating obviousness or nonobviousness. 07-20-02-aia AIA This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention. 07-21-aia AIA Claim (s) 2 and 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yu in view of Rucksteiss et al (State-Dependent Exploration for Policy Gradient Methods) . Regarding claims 2 and 13 , Yu discloses: The apparatus according to claim 1, wherein the proposal is a sequence of … (see [0413], “Theorem 4 makes clear that randomly generated perturbations will be sub-optimal with high probability with respect to the Equation (17) because the optimal solution is uniquely the top k conjugate vectors. Identifying the top k conjugate vectors in each iteration of policy improvement will require significant computation when the FIM is large. Fortunately, there exist computationally efficient methods of generating sequences of conjugate vectors such as conjugate gradient descent (Wright and Nocedal 1999) (to be discussed), although they may not provide the top k. From Theorem 2, it is observed that when all conjugate vectors have the same F-norm, then any set of k conjugate vectors maximize the Equation (17). If the perturbation radius (the maximum KL divergence a perturbation may have from the main policy) is bounded as in (Plappert et al. 2018), DE achieves a computationally efficient, optimal solution to the Equation (17).” [0508]) . Yu arguably discloses a random sequence, however, Rucksteiss explicitly teaches: a sequence of pseudo-random numbers ( PNG media_image2.png 162 494 media_image2.png Greyscale p239 §3 ¶2). Yu and Rucksteiss are in the same field of endeavor of reinforcement learning and are analogous. Yu discloses a policy adaptation method. Rucksteiss teaches the use of pseudorandom sampling to create a sequence of numbers. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the proposal with randomness as disclosed by Yu to utilize the known pseudo-random sequence generation as taught by Rucksteiss to yield predictable results of exploratory randomness. Conclusion Any inquiry concerning this communication or earlier communications from the examiner should be directed to ERIC NILSSON whose telephone number is (571)272-5246. The examiner can normally be reached M-F: 7-3. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, James Trujillo can be reached at (571)-272-3677. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /ERIC NILSSON/ Primary Examiner, Art Unit 2151 Application/Control Number: 18/364,601 Page 2 Art Unit: 2151 Application/Control Number: 18/364,601 Page 3 Art Unit: 2151 Application/Control Number: 18/364,601 Page 4 Art Unit: 2151 Application/Control Number: 18/364,601 Page 5 Art Unit: 2151 Application/Control Number: 18/364,601 Page 6 Art Unit: 2151 Application/Control Number: 18/364,601 Page 7 Art Unit: 2151 Application/Control Number: 18/364,601 Page 8 Art Unit: 2151 Application/Control Number: 18/364,601 Page 9 Art Unit: 2151 Application/Control Number: 18/364,601 Page 10 Art Unit: 2151

Read full office action

Prosecution Timeline

Aug 03, 2023

Application Filed

Jul 29, 2024

Response after Non-Final Action

Feb 24, 2026

Non-Final Rejection — §102, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

17/955,567

Patent 12602587

MULTI-TASK DEEP LEARNING NETWORK AND GENERATION METHOD THEREOF

2y 5m to grant Granted Apr 14, 2026

17/979,369

Patent 12602615

EVALUATION OF MACHINE LEARNING MODELS USING AGREEMENT SCORES

2y 5m to grant Granted Apr 14, 2026

17/990,375

Patent 12591762

METHOD, SYSTEM FOR ODOR VISUAL EXPRESSION BASED ON ELECTRONIC NOSE TECHNOLOGY, AND STORAGE MEDIUM

2y 5m to grant Granted Mar 31, 2026

17/936,737

Patent 12585942

METHOD AND SYSTEM FOR MACHINE LEARNING AND PREDICTIVE ANALYTICS OF FRACTURE DRIVEN INTERACTIONS

2y 5m to grant Granted Mar 24, 2026

18/376,480

Patent 12585953

RADIO SIGNAL IDENTIFICATION, IDENTIFICATION SYSTEM LEARNING, AND IDENTIFIER DEPLOYMENT

2y 5m to grant Granted Mar 24, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

1-2

Expected OA Rounds

83%

Grant Probability

99%

With Interview (+18.0%)

3y 2m

Median Time to Grant

Low

PTA Risk

Based on 494 resolved cases by this examiner. Grant probability derived from career allow rate.