Last updated: April 19, 2026
Application No. 17/614,433
CAVITY FILTER TUNING USING IMITATION AND REINFORCEMENT LEARNING

Non-Final OA §103
Filed
Nov 26, 2021
Examiner
THOMPSON, KYLE ALLMAN
Art Unit
2125
Tech Center
2100 — Computer Architecture & Software
Assignee
Telefonaktiebolaget Lm Ericsson (Publ)
OA Round
3 (Non-Final)
Interview Optional

— +33.3% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 6 resolved cases, 2023–2026
Examiner Intelligence

THOMPSON, KYLE ALLMAN View full profile →
Grants 83% — above average
Career Allow Rate
5 granted / 6 resolved
+28.3% vs TC avg
Strong +33% interview lift
Without
With
+33.3%
Interview Lift
resolved cases with interview
Typical timeline
4y 3m
Avg Prosecution
22 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
40.5%
+0.5% vs TC avg
§103
43.8%
+3.8% vs TC avg
§102
7.2%
-32.8% vs TC avg
§112
8.5%
-31.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 6 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
	This Office Action is in response to amendment filed on November 12, 2025
	Claims 1 and 11  have been amended.
Claims 4 and 14 have been cancelled.	Claims 24 and 25 have been added.
	The rejections from the prior correspondence that are not restated herein are withdrawn.

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 11/12/2025 has been entered.
 
Response to Arguments
	With respect to 103:
Applicant’s arguments with respect to claims 1, 2, 5 – 12, 15 – 20, 24 and 25 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1, 2, 7, 10 – 12, 17, 20, 24 and 25  are rejected under 35 U.S.C. 103 as being unpatentable over Hasenclever (US 20200104685 A1) in view of Wang (NPL: Reinforcement learning approach to learning human experience in tuning cavity filters) further in view of Baughman (US 20200342307 A1)

	Regarding claim 1, Hasenclever teaches gathering state-action pair data from an expert policy (See e.g. [0010], The state-action Jacobian (e.g. the Jacobian matrix of all first order partial derivatives for the vector-valued function defined by the expert policy)) (See e.g. [0016], Given an expert policy, the mean action of the expert in state s may be written as p.sub.E(s). The nominal trajectory of a policy refers to the sequence of nominal state action pairs {s*.sub.t, a*.sub.t}.sub.1 . . . T obtained by executing μ.sub.E(s) (the mean action of the expert in state s) recursively from an initial point s*.sub.0.)
	applying imitation learning to yield a cloned policy based on the gathered state-action pair data from the expert policy; (See e.g. [0019], A policy known herein as linear-feedback policy cloning (LFPC) [applying imitation learning to yield a cloned policy]) (See e.g. [0016], Given an expert policy, the mean action of the expert in state s may be written as p.sub.E(s). The nominal trajectory of a policy refers to the sequence of nominal state action pairs {s*.sub.t, a*.sub.t}.sub.1 . . . T obtained by executing μ.sub.E(s) (the mean action of the expert in state s) recursively from an initial point s*.sub.0.) (See e.g. [0007], The linear-feedback-stabilized policy may be based on the state-action Jacobian. [based on the gathered state-action pair data from the expert policy])
	initializing a reinforcement learning (RL) agent using the cloned policy; (See e.g. [0083], Linear Feedback Policy Cloning (LFPC). Advantageously, LFPC is able to perform as well as behavioral cloning methods while using considerably fewer expert rollouts.) (See e.g. [0131], Any reinforcement learning system may be used including, for example: a policy-based system)

	Hasenclever does not teach A method for tuning a cavity filter comprising a set of N adjustable screws, where N is an integer greater than 1 and the set of N screws includes at least a first screw and a second screw, the method comprising: for each one of the N screws, receiving from the RL agent information indicating a screw adjustment for the screw, 

	Wang teaches A method for tuning a cavity filter comprising a set of N adjustable screws, where N is an integer greater than 1 and the set of N screws includes at least a first screw and a second screw, the method comprising: (See e.g. [P2148:S4A:C2], The filter example we used was a combiner produced by a Chinese communication equipment manufacturer. The combiner had four channels interacted with each other and we only used one channel which was separated from the others and had seven tuning screws [N is an integer greater than 1 and the set of N screws includes at least a first screw and a second screw] (four resonators and three couplings).)
	for each one of the N screws, receiving from the RL agent information indicating a screw adjustment for the screw, thereby obtaining at least i) first screw adjustment information indicating a screw adjustment for the first screw and ii) second screw adjustment information indicating a screw adjustment for the second screw; (See e.g. [P2147:S3A:C2], The essence of RL is the interaction with the environment. We need the feedback information returned from the environment to determine the next action. For Atari game experiments, one can simulate the training process in the ALE and the scores are easily returned from the environment. But for our tuning task, it is a totally different story. It is the best to develop an advanced robotic tuning system which not only handles the tuning actions smoothly and accurately, but also automatically returns all the information during the tuning process. [information indicating a screw adjustment for the screw]) (See e.g. [P2149:S4A:C1], We used the simulated environment described in Section III to produce the possible state for each action. [for each one of the N screws]) (Examiner’s notes: Wang’s prior art teaches seven tuning screws that all have their own information and adjustment so this would incorporate the first and second screws in the applicant’s limitation)

Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teachings of Hasenclever and
Wang before them, to include Wang’s multiple adjustable screws for selection in Hasenclever’s
provided ML cloned policy. One would have been motivated to make such a combination in
order to increase the effectiveness in finding optimal screw positions, as suggested by Wang
(NPL: Reinforcement learning approach to learning human experience in tuning cavity filters)
(P2149:S4C:C2)

Hasenclever and Wang do not teach obtaining state data; obtaining N reward values each one of the N reward values [being associated with one of the N screws], wherein obtaining the N reward values comprises: a first feeding step that comprises feeding into a neural network the state data and the first [screw adjustment] information [indicating the screw adjustment for the first screw]; after feeding the state data and the first [screw adjustment] information [indicating the screw adjustment for the first screw] into the neural network, obtaining a first output from the neural network, wherein the first output comprises the reward value for the first [screw]; a second feeding step that comprises feeding into the neural network the state data and the second [screw adjustment] information [indicating the screw adjustment for the second screw;] after feeding the state data and the second [screw adjustment] information indicating [the screw adjustment for the second screw] into the neural network, obtaining a second output from the neural network, wherein the second output comprises the reward value for the second [screw]; selecting one of the N [screws] based on the obtained N reward values, wherein the selecting comprises selecting the first [screw] as a result of determining that the reward value for the first [screw] is greater than or equal to the other N-1 reward values;

Baughman teaches obtaining state data; (See e.g. [0052], Deep reinforcement learning uses a series of artificial neural network architectures to map a state to actions with reward probabilities. In both of these scenarios, captured multimedia data of agent actions is not used to generate fair, non-biased agent actions. All agent actions within and to an environment should be fair, non-biased, and ethical.) (See e.g. [0066], multimedia analysis process 400 inputs multimedia data 402 [obtaining state data], which corresponds to agent 1 404 performing a set of one or more actions to accomplish a task in an environment containing a set of one or more items, into artificial neural network 406 for agent bias analysis.)
obtaining N reward values each one of the N reward values [being associated with one of the N screws], wherein obtaining the N reward values comprises: (See e.g. [0035],  Reward 236 [reward values] may be positive reward, neutral reward, or negative reward for agent 222 performing action) (See e.g. [0067], multimedia analysis process 400 inputs multimedia data 414, which corresponds to agent N 416 performing a set of actions to accomplish a same or similar task as agent 1 404 in the same or a different environment, into artificial neural network 418 for agent bias analysis.)
a first feeding step that comprises feeding into a neural network the state data and the first [screw adjustment] information [indicating the screw adjustment for the first screw]; (See e.g. [0066], multimedia analysis process 400 inputs multimedia data 402, which corresponds to agent 1 404 performing a set of one or more actions to accomplish a task [the first information] in an environment containing a set of one or more items, into artificial neural network 406 for agent bias analysis.)
after feeding the state data and the first [screw adjustment] information [indicating the screw adjustment for the first screw] into the neural network, obtaining a first output from the neural network, wherein the first output comprises the reward value for the first [screw]; (See e.g. [0066], multimedia analysis process 400 inputs multimedia data 402, which corresponds to agent 1 404 performing a set of one or more actions to accomplish a task [the first information] in an environment containing a set of one or more items, into artificial neural network 406 for agent bias analysis.) (See e.g. [0058], The output of the artificial neural network is a reward [wherein the first output comprises the reward value] probability that maps to an agent action.)
a second feeding step that comprises feeding into the neural network the state data and the second [screw adjustment] information [indicating the screw adjustment for the second screw;] (See e.g. [0067], multimedia analysis process 400 inputs multimedia data 408, which corresponds to agent 2 410 performing a set of actions to accomplish a same or similar task [the second information] as agent 1 404 in the same or a different environment, into artificial neural network 412 for agent bias analysis.)
after feeding the state data and the second [screw adjustment] information indicating [the screw adjustment for the second screw] into the neural network, obtaining a second output from the neural network, wherein the second output comprises the reward value for the second [screw]; (See e.g. [0067], multimedia analysis process 400 inputs multimedia data 408, which corresponds to agent 2 410 performing a set of actions to accomplish a same or similar task [the second information] as agent 1 404 in the same or a different environment, into artificial neural network 412 for agent bias analysis.) (See e.g. [0058], The output of the artificial neural network is a reward [wherein the first output comprises the reward value] probability that maps to an agent action.)
selecting one of the N [screws] based on the obtained N reward values, wherein the selecting comprises selecting the first [screw] as a result of determining that the reward value for the first [screw] is greater than or equal to the other N-1 reward values; (See e.g. [0057],  illustrative embodiments use backpropagation to determine what is the best action to take by an agent in an environment to maximize reward by decreasing bias.) 

Accordingly, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention, having the teachings of Hasenclever, Wang and Baughman before them, to include Baughman’s agent selection in Hasenclever and Wang’s model selected better actions based on reward values received. One would have been motivated to make such a combination in order to improve the performance on tasks over time, as suggested by Baughman (US 20200342307 A1) (0003)

Regarding claim 2, Hasenclever, Wang and Baughman. Hasenclever further teaches the method of claim 1, wherein the imitation learning comprises a behavioral cloning technique. (See e.g. [0019], A policy known herein as linear-feedback policy cloning (LFPC), described below, may be used to ensure that the student retains expert robustness properties. Behavioural cloning [behavioral cloning technique] may refer to the optimization)

Regarding claim 7, Hasenclever, Wang and Baughman. Hasenclever further teaches the method of claim 1, wherein the RL agent comprises the Deep Deterministic Policy Gradient (DDPG) technique. (See e.g. [0131], a continuous control reinforcement learning system such as DDPG)

Regarding claim 10, Hasenclever, Wang and Baughman. Hasenclever further teaches the method of claim 1, further comprising performing the one or more actions of the output of the RL agent. (See e.g. [0110], Thus a reinforcement learning system 12 e.g. an action selection neural network, may learn an action selection policy in which an output of a reinforcement learning action selection neural network is used [further comprising performing the one or more actions of the output of the RL agent.] to select a (multidimensional) latent variable at time t.)

Regarding claim 11, Hasenclever teaches a data storage system; (See e.g. [0138], Alternatively or in addition, the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them. [a data storage system] The computer storage medium is not, however, a propagated signal.)
a data processing apparatus comprising a processor, wherein the data processing apparatus is coupled to the data storage system, and the node is configured to perform a method comprising: (See e.g. [0138], Alternatively or in addition, the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. [data processing apparatus is coupled to the data storage system] The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them. The computer storage medium is not, however, a propagated signal.)
	applying imitation learning to yield a cloned policy based on the gathered state-action pair data from the expert policy; (See e.g. [0019], A policy known herein as linear-feedback policy cloning (LFPC) [applying imitation learning to yield a cloned policy]) (See e.g. [0016], Given an expert policy, the mean action of the expert in state s may be written as p.sub.E(s). The nominal trajectory of a policy refers to the sequence of nominal state action pairs {s*.sub.t, a*.sub.t}.sub.1 . . . T obtained by executing μ.sub.E(s) (the mean action of the expert in state s) recursively from an initial point s*.sub.0.) (See e.g. [0007], The linear-feedback-stabilized policy may be based on the state-action Jacobian. [based on the gathered state-action pair data from the expert policy])
	initializing a reinforcement learning (RL) agent using the cloned policy; (See e.g. [0083], Linear Feedback Policy Cloning (LFPC). Advantageously, LFPC is able to perform as well as behavioral cloning methods while using considerably fewer expert rollouts.) (See e.g. [0131], Any reinforcement learning system may be used including, for example: a policy-based system)

	Hasenclever does not teach A method for tuning a cavity filter comprising a set of N adjustable screws, where N is an integer greater than 1 and the set of N screws includes at least a first screw and a second screw, the method comprising: for each one of the N screws, receiving from the RL agent information indicating a screw adjustment for the screw, 

	Wang teaches A node for tuning a cavity filter comprising a set of N adjustable screws, where N is an integer greater than 1 and the set of N screws includes at least a first screw and a second screw, the node comprising: (See e.g. [P2148:S4A:C2], The filter example we used was a combiner produced by a Chinese communication equipment manufacturer. The combiner had four channels interacted with each other and we only used one channel which was separated from the others and had seven tuning screws [N is an integer greater than 1 and the set of N screws includes at least a first screw and a second screw] (four resonators and three couplings).)
	for each one of the N screws, receiving from the RL agent information indicating a screw adjustment for the screw, thereby obtaining at least i) first screw adjustment information indicating a screw adjustment for the first screw and ii) second screw adjustment information indicating a screw adjustment for the second screw; (See e.g. [P2147:S3A:C2], The essence of RL is the interaction with the environment. We need the feedback information returned from the environment to determine the next action. For Atari game experiments, one can simulate the training process in the ALE and the scores are easily returned from the environment. But for our tuning task, it is a totally different story. It is the best to develop an advanced robotic tuning system which not only handles the tuning actions smoothly and accurately, but also automatically returns all the information during the tuning process. [information indicating a screw adjustment for the screw]) (See e.g. [P2149:S4A:C1], We used the simulated environment described in Section III to produce the possible state for each action. [for each one of the N screws]) (Examiner’s notes: Wang’s prior art teaches seven tuning screws that all have their own information and adjustment so this would incorporate the first and second screws in the applicant’s limitation)

Accordingly, it would have been obvious to a person having ordinary skill in the art
before the effective filing date of the claimed invention, having the teachings of Hasenclever and
Wang before them, to include Wang’s multiple adjustable screws for selection in Hasenclever’s
provided ML cloned policy. One would have been motivated to make such a combination in
order to increase the effectiveness in finding optimal screw positions, as suggested by Wang
(NPL: Reinforcement learning approach to learning human experience in tuning cavity filters)
(P2149:S4C:C2)
	
Hasenclever and Wang do not teach obtaining state data; obtaining N reward values each one of the N reward values [being associated with one of the N screws], wherein obtaining the N reward values comprises: a first feeding step that comprises feeding into a neural network the state data and the first [screw adjustment] information [indicating the screw adjustment for the first screw]; after feeding the state data and the first [screw adjustment] information [indicating the screw adjustment for the first screw] into the neural network, obtaining a first output from the neural network, wherein the first output comprises the reward value for the first [screw]; a second feeding step that comprises feeding into the neural network the state data and the second [screw adjustment] information [indicating the screw adjustment for the second screw;] after feeding the state data and the second [screw adjustment] information indicating [the screw adjustment for the second screw] into the neural network, obtaining a second output from the neural network, wherein the second output comprises the reward value for the second [screw]; selecting one of the N [screws] based on the obtained N reward values, wherein the selecting comprises selecting the first [screw] as a result of determining that the reward value for the first [screw] is greater than or equal to the other N-1 reward values;

Baughman teaches obtaining state data; (See e.g. [0052], Deep reinforcement learning uses a series of artificial neural network architectures to map a state to actions with reward probabilities. In both of these scenarios, captured multimedia data of agent actions is not used to generate fair, non-biased agent actions. All agent actions within and to an environment should be fair, non-biased, and ethical.) (See e.g. [0066], multimedia analysis process 400 inputs multimedia data 402 [obtaining state data], which corresponds to agent 1 404 performing a set of one or more actions to accomplish a task in an environment containing a set of one or more items, into artificial neural network 406 for agent bias analysis.)
obtaining N reward values each one of the N reward values [being associated with one of the N screws], wherein obtaining the N reward values comprises: (See e.g. [0035],  Reward 236 [reward values] may be positive reward, neutral reward, or negative reward for agent 222 performing action) (See e.g. [0067], multimedia analysis process 400 inputs multimedia data 414, which corresponds to agent N 416 performing a set of actions to accomplish a same or similar task as agent 1 404 in the same or a different environment, into artificial neural network 418 for agent bias analysis.)
a first feeding step that comprises feeding into a neural network the state data and the first [screw adjustment] information [indicating the screw adjustment for the first screw]; (See e.g. [0066], multimedia analysis process 400 inputs multimedia data 402, which corresponds to agent 1 404 performing a set of one or more actions to accomplish a task [the first information] in an environment containing a set of one or more items, into artificial neural network 406 for agent bias analysis.)
after feeding the state data and the first [screw adjustment] information [indicating the screw adjustment for the first screw] into the neural network, obtaining a first output from the neural network, wherein the first output comprises the reward value for the first [screw]; (See e.g. [0066], multimedia analysis process 400 inputs multimedia data 402, which corresponds to agent 1 404 performing a set of one or more actions to accomplish a task [the first information] in an environment containing a set of one or more items, into artificial neural network 406 for agent bias analysis.) (See e.g. [0058], The output of the artificial neural network is a reward [wherein the first output comprises the reward value] probability that maps to an agent action.)
a second feeding step that comprises feeding into the neural network the state data and the second [screw adjustment] information [indicating the screw adjustment for the second screw;] (See e.g. [0067], multimedia analysis process 400 inputs multimedia data 408, which corresponds to agent 2 410 performing a set of actions to accomplish a same or similar task [the second information] as agent 1 404 in the same or a different environment, into artificial neural network 412 for agent bias analysis.)
after feeding the state data and the second [screw adjustment] information indicating [the screw adjustment for the second screw] into the neural network, obtaining a second output from the neural network, wherein the second output comprises the reward value for the second [screw]; (See e.g. [0067], multimedia analysis process 400 inputs multimedia data 408, which corresponds to agent 2 410 performing a set of actions to accomplish a same or similar task [the second information] as agent 1 404 in the same or a different environment, into artificial neural network 412 for agent bias analysis.) (See e.g. [0058], The output of the artificial neural network is a reward [wherein the first output comprises the reward value] probability that maps to an agent action.)
selecting one of the N [screws] based on the obtained N reward values, wherein the selecting comprises selecting the first [screw] as a result of determining that the reward value for the first [screw] is greater than or equal to the other N-1 reward values; (See e.g. [0057],  illustrative embodiments use backpropagation to determine what is the best action to take by an agent in an environment to maximize reward by decreasing bias.) 

Accordingly, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention, having the teachings of Hasenclever, Wang and Baughman before them, to include Baughman’s agent selection in Hasenclever and Wang’s model selected better actions based on reward values received. One would have been motivated to make such a combination in order to improve the performance on tasks over time, as suggested by Baughman (US 20200342307 A1) (0003)

Regarding claim 12, Hasenclever, Wang and Baughman. Hasenclever further teaches the node of claim 11, wherein the imitation learning comprises a behavioral cloning technique. (See e.g. [0019], A policy known herein as linear-feedback policy cloning (LFPC), described below, may be used to ensure that the student retains expert robustness properties. Behavioural cloning [behavioral cloning technique] may refer to the optimization)

Regarding claim 17, Hasenclever, Wang and Baughman. Hasenclever further teaches the node of claim 11, wherein the RL agent comprises the Deep Deterministic Policy Gradient (DDPG) technique. (See e.g. [0131], a continuous control reinforcement learning system such as DDPG)

Regarding claim 20, Hasenclever, Wang and Baughman. Hasenclever further teaches the node of claim 11, further comprising performing the one or more actions of the output of the RL agent. (See e.g. [0110], Thus a reinforcement learning system 12 e.g. an action selection neural network, may learn an action selection policy in which an output of a reinforcement learning action selection neural network is used [further comprising performing the one or more actions of the output of the RL agent.] to select a (multidimensional) latent variable at time t.)

	Regarding claim 24, Hasenclever, Wang and Baughman. Hasenclever further teaches the node of claim 11, wherein the neural network is a Deep Q Network (DQN). (See e.g. [0131], a Q-learning system, such as a Deep Q-learning Network (DQN) system or Double-DQN system, in which the output approximates an action-value function, and optionally a value of a state, for determining an action)

	Regarding claim 25, Hasenclever, Wang and Baughman. Hasenclever further teaches the method of claim 1, wherein the neural network is a Deep Q Network (DQN). (See e.g. [0131], a Q-learning system, such as a Deep Q-learning Network (DQN) system or Double-DQN system, in which the output approximates an action-value function, and optionally a value of a state, for determining an action)

	Claims 5 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Hasenclever (US 20200104685 A1) in view of Wang (NPL: Reinforcement learning approach to learning human experience in tuning cavity filters) further in view of Baughman (US 20200342307 A1) further in view of Tkadlec (US 20170141446 A1)

	Regarding claim 5, Hasenclever, Wang and Baughman. Hasenclever further teaches the method of claim 1, wherein the expert policy (See e.g. [0010], The state-action Jacobian (e.g. the Jacobian matrix of all first order partial derivatives for the vector-valued function defined by the expert policy) [expert policy] can be used to construct a linear feedback controller which gives target actions in nearby perturbed states during training.)

	Hasenclever, Wang and Baughman do not teach Tuning Guide Program (TGP).

	Tkadlec teaches wherein the expert policy is based on Tuning Guide Program (TGP). (See e.g. [0132], The tuning screws 1100 can readily be threaded further into and further out of the threaded bushings 1114, and hence into and out of the cavity of the filter, and therefore may facilitate very precise tuning of the filter. The tuning screws 1100 may be adjusted many times without any degradation in performance. As the tuning screws 1100 are inserted into the threaded bushings, there are no openings to permit leakage of electromagnetic radiation. Additionally, the tuning screws 1100 are amenable to automatic tuning. Automatic tuning refers to a process where equipment is used to displace tuning elements on a filter and to measure the response of the filter during or after such displacement. The tuning screws 1100 are readily adaptable to automatic tuning as automated equipment is readily available that can be used to tighten and loosen screws.)

Accordingly, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention, having the teachings of Hasenclever, Wang, Baughman and Tkadlec before them, to include Tkadlec’s  feature cavity filter tuning that can be automated in Hasenclever, Wang and Baughman’s provides that the agent may be a robot interacting with the environment to move an object. One would have been motivated to make such a combination in order to automate the tuning of the tuning screws because doing so produce a robotics policy imitating adaptive behavior of complex physical body, as suggested by Tkadlec (US 20170141446 A1) (0008)

Regarding claim 15, Hasenclever, Wang and Baughman. Hasenclever further teaches the node of claim 11, wherein the expert policy (See e.g. [0010], The state-action Jacobian (e.g. the Jacobian matrix of all first order partial derivatives for the vector-valued function defined by the expert policy) [expert policy] can be used to construct a linear feedback controller which gives target actions in nearby perturbed states during training.)

	Hasenclever, Wang and Baughman do not teach Tuning Guide Program (TGP).

	Tkadlec teaches wherein the expert policy is based on Tuning Guide Program (TGP). (See e.g. [0132], The tuning screws 1100 can readily be threaded further into and further out of the threaded bushings 1114, and hence into and out of the cavity of the filter, and therefore may facilitate very precise tuning of the filter. The tuning screws 1100 may be adjusted many times without any degradation in performance. As the tuning screws 1100 are inserted into the threaded bushings, there are no openings to permit leakage of electromagnetic radiation. Additionally, the tuning screws 1100 are amenable to automatic tuning. Automatic tuning refers to a process where equipment is used to displace tuning elements on a filter and to measure the response of the filter during or after such displacement. The tuning screws 1100 are readily adaptable to automatic tuning as automated equipment is readily available that can be used to tighten and loosen screws.)

Accordingly, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention, having the teachings of Hasenclever, Wang, Baughman and Tkadlec before them, to include Tkadlec’s  feature cavity filter tuning that can be automated in Hasenclever, Wang and Baughman’s provides that the agent may be a robot interacting with the environment to move an object. One would have been motivated to make such a combination in order to automate the tuning of the tuning screws because doing so produce a robotics policy imitating adaptive behavior of complex physical body, as suggested by Tkadlec (US 20170141446 A1) (0008)

Claims 6 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Hasenclever (US 20200104685 A1) in view of Wang (NPL: Reinforcement learning approach to learning human experience in tuning cavity filters) further in view of Baughman (US 20200342307 A1) further in view of MEYERSON (US 20190130257 A1)

	Regarding claim 6, Hasenclever, Wang and Baughman. Hasenclever further teaches the method of claim 1, hidden layer (See e.g. [0002], Some neural networks include one or more hidden layers in addition to an output layer.)

	Hasenclever, Wang and Baughman do not teach wherein the cloned policy is in the form of a neural network, wherein the deepest [hidden layer] is convolutional in one dimension.

	MEYERSON teaches wherein the cloned policy is in the form of a neural network, wherein the deepest [hidden layer] is convolutional in one dimension. (See e.g. [0110], Yet other examples of a module include individual components of a convolutional neural network, such as a one-dimensional (1D) convolution module [wherein the deepest [hidden layer] is convolutional in one dimension]) (See e.g. [0179], a neural network-based system, constructing clones of the set of processing submodules, and arranging the clones in the encoder in a clone sequence starting from a lowest depth [wherein the deepest] and continuing to a highest depth. [wherein the cloned policy is in the form of a neural network] The clones in the encoder are shared by a plurality of classification tasks. In some implementations, the clones have the same hyperparameters.)

Accordingly, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention, having the teachings of Hasenclever, Wang, Baughman and MEYERSON before them, to include MEYERSON’s cloned policy is based on neural network which includes a deepest layer that is one dimensional in Hasenclever, Wang and Baughman’s system that provides a neural network that encompasses a hidden layer. One would have been motivated to make such a combination in order to have a classification task to each one of processing submodules in a first clone at the lowest depth in the clone sequence, as suggested by MEYERSON (US 20190130257 A1) (0180)

Regarding claim 16, Hasenclever, Wang and Baughman. Hasenclever further teaches the node of claim 11, hidden layer (See e.g. [0002], Some neural networks include one or more hidden layers in addition to an output layer.)

	Hasenclever, Wang and Baughman do not teach wherein the cloned policy is in the form of a neural network, wherein the deepest [hidden layer] is convolutional in one dimension.

	MEYERSON teaches wherein the cloned policy is in the form of a neural network, wherein the deepest [hidden layer] is convolutional in one dimension. (See e.g. [0110], Yet other examples of a module include individual components of a convolutional neural network, such as a one-dimensional (1D) convolution module [wherein the deepest [hidden layer] is convolutional in one dimension]) (See e.g. [0179], a neural network-based system, constructing clones of the set of processing submodules, and arranging the clones in the encoder in a clone sequence starting from a lowest depth [wherein the deepest] and continuing to a highest depth. [wherein the cloned policy is in the form of a neural network] The clones in the encoder are shared by a plurality of classification tasks. In some implementations, the clones have the same hyperparameters.)

Accordingly, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention, having the teachings of Hasenclever, Wang, Baughman and MEYERSON before them, to include MEYERSON’s cloned policy is based on neural network which includes a deepest layer that is one dimensional in Hasenclever, Wang and Baughman’s system that provides a neural network that encompasses a hidden layer. One would have been motivated to make such a combination in order to have a classification task to each one of processing submodules in a first clone at the lowest depth in the clone sequence, as suggested by MEYERSON (US 20190130257 A1) (0180)

Claims 8 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Hasenclever (US 20200104685 A1) in view of Wang (NPL: Reinforcement learning approach to learning human experience in tuning cavity filters) further in view of Baughman (US 20200342307 A1) further in view of Etrin (US 20030204368 A1)

Regarding claim 8, Hasenclever, Wang and Baughman. Hasenclever further teaches the method of claim 1, wherein an output of the RL agent [is forced via a multiplied tanh function.] (See e.g. [0131], The task policy may be trained in any suitable way. In an implementation, the task policy is trained using a reinforcement learning system…a Q-learning system, such as a Deep Q-learning Network (DQN) system or Double-DQN system, in which the output approximates an action-value function [wherein an output of the RL agent], and optionally a value of a state, for determining an action)

Hasenclever, Wang and Baughman do not teach a multiplied tanh function.

Etrin teaches a multiplied tanh function. (See e.g. [0073], single hidden layer network of ten neurons with `tanh` activation functions [a multiplied tanh function], and was trained using the cross-entropy minimization method on the samples obtained from the reinforcement learning process)

Accordingly, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention, having the teachings of Hasenclever, Wang, Baughman and Etrin before them, to include Etrin’s tanh function in Hasenclever, Wang and Baughman’s system that provides reinforcement learning technique. One would have been motivated to make such a combination in order to lead to faster convergence, as suggested by Etrin (US 20030204368 A1) (0064)

Regarding claim 18, Hasenclever, Wang and Baughman. Hasenclever further teaches the node of claim 11, wherein an output of the RL agent [is forced via a multiplied tanh function.] (See e.g. [0131], The task policy may be trained in any suitable way. In an implementation, the task policy is trained using a reinforcement learning system…a Q-learning system, such as a Deep Q-learning Network (DQN) system or Double-DQN system, in which the output approximates an action-value function [wherein an output of the RL agent], and optionally a value of a state, for determining an action)

Hasenclever, Wang and Baughman do not teach a multiplied tanh function.

Etrin teaches a multiplied tanh function. (See e.g. [0073], single hidden layer network of ten neurons with `tanh` activation functions [a multiplied tanh function], and was trained using the cross-entropy minimization method on the samples obtained from the reinforcement learning process)

Accordingly, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention, having the teachings of Hasenclever, Wang, Baughman and Etrin before them, to include Etrin’s tanh function in Hasenclever, Wang and Baughman’s system that provides reinforcement learning technique. One would have been motivated to make such a combination in order to lead to faster convergence, as suggested by Etrin (US 20030204368 A1) (0064)

Claims 9 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Hasenclever (US 20200104685 A1) in view of Wang (NPL: Reinforcement learning approach to learning human experience in tuning cavity filters) further in view of Baughman (US 20200342307 A1) further in view of Lillicrap (US 20170024643 A1)

Regarding claim 9, Hasenclever, Wang and Baughman teach the method of claim 1.

Hasenclever, Wang and Baughman does not teach wherein the method further comprises allowing the RL agent to run for Ncritic iterations where only a critic network is trained, with no change to an actor network or a target network, and after the Ncritic iterations, allowing the technique to run to convergence.

Lillicrap teaches wherein the method further comprises allowing the RL agent to run for Ncritic iterations where only a critic network is trained, with no change to an actor network or a target network, and after the Ncritic iterations, allowing the technique to run to convergence. (See e.g. [0050], That is, the system can determine an update to the current values of the parameters that reduces the error using conventional machine learning training techniques, e.g., by performing an iteration of gradient descent with backpropagation. [wherein the method further comprises allowing the RL agent to run for Ncritic iterations] As will be clear from the description of FIG. 4, by updating the current values of the parameters in this manner, the system trains the critic neural network [Ncritic iterations where only a critic network is trained, with no change to an actor network or a target network, and after the Ncritic iterations, allowing the technique to run to convergence.] to generate neural network outputs that represent time-discounted total future rewards that will be received in response the agent performing a given action in response to a given observation.)

Accordingly, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention, having the teachings of Hasenclever, Wang, Baughman and Lillicrap before them, to include Lillicrap’s critic network in Hasenclever, Wang and Baughman’s system that provides reinforcement learning technique. One would have been motivated to make such a combination in order to reinforced learning isolated to a critic network to reduce errors from common learning techniques, as suggested by Lillicrap (US 20170024643 A1) (0001)

Regarding claim 19, Hasenclever, Wang and Baughman teach the node of claim 11.

Hasenclever, Wang and Baughman does not teach wherein the method further comprises allowing the RL agent to run for Ncritic iterations where only a critic network is trained, with no change to an actor network or a target network, and after the Ncritic iterations, allowing the technique to run to convergence.

Lillicrap teaches wherein the method further comprises allowing the RL agent to run for Ncritic iterations where only a critic network is trained, with no change to an actor network or a target network, and after the Ncritic iterations, allowing the technique to run to convergence. (See e.g. [0050], That is, the system can determine an update to the current values of the parameters that reduces the error using conventional machine learning training techniques, e.g., by performing an iteration of gradient descent with backpropagation. [wherein the method further comprises allowing the RL agent to run for Ncritic iterations] As will be clear from the description of FIG. 4, by updating the current values of the parameters in this manner, the system trains the critic neural network [Ncritic iterations where only a critic network is trained, with no change to an actor network or a target network, and after the Ncritic iterations, allowing the technique to run to convergence.] to generate neural network outputs that represent time-discounted total future rewards that will be received in response the agent performing a given action in response to a given observation.)

Accordingly, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention, having the teachings of Hasenclever, Wang, Baughman and Lillicrap before them, to include Lillicrap’s critic network in Hasenclever, Wang and Baughman’s system that provides reinforcement learning technique. One would have been motivated to make such a combination in order to reinforced learning isolated to a critic network to reduce errors from common learning techniques, as suggested by Lillicrap (US 20170024643 A1) (0001)


Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to KYLE ALLMAN THOMPSON whose telephone number is (571)272-3671. The examiner can normally be reached Monday - Thursday, 6 a.m. - 3 p.m. ET..
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached at (571) 272-7796. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.

Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/K.A.T./Examiner, Art Unit 2125                                                                                                                                                                                                        


/KAMRAN AFSHAR/Supervisory Patent Examiner, Art Unit 2125
Read full office action
Prosecution Timeline

Nov 26, 2021
Application Filed
Mar 21, 2025
Non-Final Rejection — §103
Jun 24, 2025
Response Filed
Sep 11, 2025
Final Rejection — §103
Nov 12, 2025
Response after Non-Final Action
Dec 15, 2025
Request for Continued Examination
Jan 01, 2026
Response after Non-Final Action
Feb 17, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/508,450
Patent 12547932
MACHINE LEARNING-ASSISTED MULTI-DOMAIN PLANNING
2y 5m to grant Granted Feb 10, 2026
Study what changed to get past this examiner. Based on 1 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
83%
Grant Probability
99%
With Interview (+33.3%)
4y 3m
Median Time to Grant
High
PTA Risk
Based on 6 resolved cases by this examiner. Grant probability derived from career allow rate.