Last updated: May 29, 2026
Application No. 18/268,655
METHODS AND APPARATUSES OF DETERMINING FOR CONTROLLING A MULTI-AGENT REINFORCEMENT LEARNING ENVIRONMENT

Non-Final OA §101§102§103§112
Filed
Jun 21, 2023
Priority
Dec 22, 2020 — nonprovisional of PCTIN2020051039
Examiner
JABLON, ASHER H.
Art Unit
2127
Tech Center
2100 — Computer Architecture & Software
Assignee
Telefonaktiebolaget Lm Ericsson (Publ)
OA Round
1 (Non-Final)
This examiner grants 43% of cases after interview

— +44.0% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 93 resolved cases, 2023–2026
Examiner Intelligence

JABLON, ASHER H. View full profile →
Grants 43% of resolved cases
Career Allowance Rate
40 granted / 93 resolved
-12.0% vs TC avg
Strong +44% interview lift
Without
With
+44.0%
Interview Lift
resolved cases with interview
Typical timeline
4y 4m
Avg Prosecution
11 currently pending
Career history
117
Total Applications
across all art units
Statute-Specific Performance

§101
16.3%
-23.7% vs TC avg
§103
65.8%
+25.8% vs TC avg
§102
8.7%
-31.3% vs TC avg
§112
6.0%
-34.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 93 resolved cases
Office Action

§101 §102 §103 §112
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Status of the Claims
Claims 7, 9-14, and 20 have been amended. Claims 21-24 have been canceled. Claims 1-20 are currently pending and have been considered by the Examiner.

Specification
The disclosure is objected to because page 3, line 14 of the specification contains an embedded hyperlink and/or other form of browser-executable code. Applicant is required to delete the embedded hyperlink and/or other form of browser-executable code; references to websites should be limited to the top-level domain name without any prefix such as http:// or other browser-executable code. See MPEP § 608.01.

Claim Objections
Claims 2-6, 14 and 20 are objected to because of the following informalities:
Claim 2, line 3 and claim 20, line 3 is missing a comma after “s1,t”.
Claim 3, line 3 is missing a comma after “s2,t+1”.
Claim 4, line 3 is missing a comma after “s1,t+1”.
In claim 5, Examiner suggests deleting the term “and” at the end of line 7.
In claim 6, Examiner suggests moving the limitation “wherein the step of determining the regularization component comprises” in lines 5-6 to a new line. Examiner suggests adding “and” to the end of line 9. 
Claim 14 is missing a period at the end of line 5. Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 3-9, 13, and 15-18 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
	Claim 3 is rendered indefinite. The notation s2,t+1 represents both “a second state” in line 3 and “a second next state” in line 4. It is unclear if the limitation “a second state s2,t+1” in line 3 should recite “a second state s2,t” which would align with the notation in claim 2. Examiner treats “a second state s2,t+1” in line 3 as “a second state s2,t”.
	Claims 4-9 are rejected for failing to cure the deficiencies of parent claim 3.
Claim 4 recites the limitation "the value of regularization component" in lines 2-3.  There is insufficient antecedent basis for this limitation in the claim. Examiner treats this limitation as “a value of the regularization component”.
Claim 5 recites the limitation "the combination" in line 4.  There is insufficient antecedent basis for this limitation in the claim. Examiner treats this limitation as “a combination”.
Claim 13 recites the limitation "the radio access network" in line 3.  There is insufficient antecedent basis for this limitation in the claim. Claim 13 recites “radio access network counters” in line 2, but this does not provide sufficient antecedent basis. In line 5, it is unclear if “the network” means “the radio access network”. Examiner treats “the radio access network” in line 3 as “a radio access network” and Examiner treats “the network” in line 5 as “the radio access network”.
In claim 15, the limitation “the agent” in line 4 renders the claim indefinite because it is unclear if this limitation means “the local agent”. Examiner treats “the agent” as “the local agent”. Claims 16-18 are rejected for failing to cure the deficiencies of claim 15.
Claim 17 recites the same indefinite limitations as claim 13.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.

Step 1: Claims 1-18 recite a method and claims 19-20 recite a global agent comprising processing circuitry (a system). Each of a method and a system falls within one of the four statutory categories of patent eligible subject matter.

Claim 1 
Step 2A Prong 1: Determining a combined loss function based on the plurality of loss functions is a mathematical calculation. On specification page 9, line number 10 discloses the combined loss function may comprise a sum of the plurality of loss functions.
Minimizing the combined loss function with respect to the first parameters and the second parameters to determine updated values for the first parameters and updated values for the second parameters is a mathematical calculation. On specification page 9, the paragraph next to line number 25 discloses this step may comprise performing gradient descent on the combined loss function. The claim recites an abstract idea.
Step 2A Prong 2: Controlling a multi-agent reinforcement learning environment amounts to mere instructions to apply the abstract ideas on a generic computer under MPEP 2106.05(f).
Obtaining a plurality of loss functions comprising: a first loss function associated with a first reinforcement learning, RL, model performed by a first local agent, wherein the first loss function is a function of one or more first parameters amounts to insignificant pre-solution activity under MPEP 2106.05(g).
Obtaining a second loss function associated with a second RL model at a second local agent, wherein the second loss function is a function of one or more second parameters amounts to insignificant pre-solution activity under MPEP 2106.05(g).
Initiating execution of a first updated action by the first local agent based on the updated values of the first parameters amounts to mere instructions to apply the abstract ideas on a generic computer under MPEP 2106.05(f).
Initiating execution of a second updated action by the second local agent based on the updated values of the second parameters amounts to mere instructions to apply the abstract ideas on a generic computer under MPEP 2106.05(f).
The additional elements as disclosed above, alone or in combination, do not integrate the abstract ideas into a practical application as they are mere insignificant pre-solution activities as disclosed in combination with generic computer functions that are implemented to perform the abstract ideas disclosed above. The claim is directed to an abstract idea.
Step 2B: Controlling a multi-agent reinforcement learning environment amounts to mere instructions to apply the abstract ideas on a generic computer under MPEP 2106.05(f).
Obtaining a plurality of loss functions comprising: a first loss function associated with a first reinforcement learning, RL, model performed by a first local agent, wherein the first loss function is a function of one or more first parameters is analogous to reading data from memory, which the courts have recognized as a well-understood, routine, conventional activity under MPEP 2106.05(d)(II).
Obtaining a second loss function associated with a second RL model at a second local agent, wherein the second loss function is a function of one or more second parameters is analogous to reading data from memory, which the courts have recognized as a well-understood, routine, conventional activity under MPEP 2106.05(d)(II).
Initiating execution of a first updated action by the first local agent based on the updated values of the first parameters amounts to mere instructions to apply the abstract ideas on a generic computer under MPEP 2106.05(f).
Initiating execution of a second updated action by the second local agent based on the updated values of the second parameters amounts to mere instructions to apply the abstract ideas on a generic computer under MPEP 2106.05(f).
The additional elements as disclosed above, in combination with the abstract ideas, are not sufficient to amount to significantly more than the abstract ideas as they are well-understood, routine and conventional activities as disclosed in combination with generic computer functions that are implemented to perform the abstract ideas disclosed above. The claim is not patent eligible.

Claim 2 incorporates the rejection of claim 1. 
Step 2A Prong 1: The abstract ideas of claim 1 are incorporated. The first loss function is calculated based on a first replay experience of the first local agent, wherein the first replay experience comprises a first state, s1,t a first action, a1,t, a first reward, r1,t+1, and a first next state, s1,t+1 is a mathematical calculation. 
Step 2A Prong 2 and Step 2B: The claim does not recite any additional elements which, alone or in combination, would integrate the abstract ideas into a practical application or which, in combination with the abstract ideas, would be sufficient to amount to significantly more than the abstract ideas. The claim is not patent eligible.

Claim 3 incorporates the rejection of claim 2. 
Step 2A Prong 1: The abstract ideas of claim 2 are incorporated. The second loss function is calculated based on a second replay experience of the second local agent, wherein the second replay experience comprises a second state, s2,t+1 a second action, a2,t, a second reward, r2,t+1, and a second next state, s2,t+1 is a mathematical calculation. 
Step 2A Prong 2 and Step 2B: The claim does not recite any additional elements which, alone or in combination, would integrate the abstract ideas into a practical application or which, in combination with the abstract ideas, would be sufficient to amount to significantly more than the abstract ideas. The claim is not patent eligible.

Claim 4 incorporates the rejection of claim 3. 
Step 2A Prong 1: The abstract ideas of claim 3 are incorporated. The combined loss function further comprises a regularization component, wherein the value of regularization component increases when the first next state, s1,t+1 and the second next state, s2,t+1, are closer together is a mathematical calculation. On specification page 10, the paragraph next to line number 20 discloses the combined loss function plus a regularization component.
Step 2A Prong 2 and Step 2B: The claim does not recite any additional elements which, alone or in combination, would integrate the abstract ideas into a practical application or which, in combination with the abstract ideas, would be sufficient to amount to significantly more than the abstract ideas. The claim is not patent eligible.

Claim 5 incorporates the rejection of claim 4. 
Step 2A Prong 1: The abstract ideas of claim 4 are incorporated. Determining the regularization component by determining a distance, d(s1,t+1,s2,t+1), between the first next state and the second next state is a mathematical calculation. This calculation is disclosed by the specification between page 11, line 15 and page 12, line 20.
For each action in the combination of all available actions in the first state and the second state: calculating a reward difference as a magnitude of a difference between local rewards obtained from the first next state and the second next state after applying the action is a mathematical calculation.
Calculating a Kantorovich distance between the probability of transitioning into the first next state based on the first state and the first action, and the probability of transitioning into the second next state based on the second state and the second action is a mathematical calculation. 
Calculating a distance sum for the action by summing the reward difference and the Kantorovich distance is a mathematical calculation.
Setting the distance between the first next state and the second next state as the distance sum with the maximum value is a mathematical calculation.
Step 2A Prong 2 and Step 2B: The claim does not recite any additional elements which, alone or in combination, would integrate the abstract ideas into a practical application or which, in combination with the abstract ideas, would be sufficient to amount to significantly more than the abstract ideas. The claim is not patent eligible.

Claim 6 incorporates the rejection of claim 5. 
Step 2A Prong 1: The abstract ideas of claim 5 are incorporated. The plurality of loss functions comprises N loss functions, where N is an integer, each associated with a respective RL model performed by an ith agent, where i = 1, … , N wherein the ith loss function is calculated based on a ith replay experience of the ith agent, wherein the ith replay experience comprises an ith state, an ith action, an ith reward and an ith next state is a mathematical calculation.
The step of determining the regularization component comprises: calculating distances between each combination of possible pairs of states in: the first next state to the Nth next state is a mathematical calculation.
Calculating a Frobenius norm of a matrix comprising the distances is a mathematical calculation.
Setting the regularization component as a square of the Frobenius norm divided by 2 is a mathematical calculation.
Step 2A Prong 2 and Step 2B: The claim does not recite any additional elements which, alone or in combination, would integrate the abstract ideas into a practical application or which, in combination with the abstract ideas, would be sufficient to amount to significantly more than the abstract ideas. The claim is not patent eligible.

Claim 7 incorporates the rejection of claim 4. 
Step 2A Prong 1: The abstract ideas of claim 4 are incorporated. The combined loss function comprises a sum of each of the plurality of loss functions plus the regularization component is a mathematical calculation. On specification page 11, the paragraph next to line number 30 discloses the combined loss function plus a regularization component.
Step 2A Prong 2 and Step 2B: The claim does not recite any additional elements which, alone or in combination, would integrate the abstract ideas into a practical application or which, in combination with the abstract ideas, would be sufficient to amount to significantly more than the abstract ideas. The claim is not patent eligible.

Claim 8 incorporates the rejection of claim 7. 
Step 2A Prong 1: The abstract ideas of claim 7 are incorporated. The sum of each of the plurality of loss functions is a weighted sum, wherein each loss function is associated with a weighting factor, βi is a mathematical calculation. On specification page 9, the paragraph next to line number 15 and on page 11, the paragraph next to line number 30 discloses the weighted sum.
Step 2A Prong 2 and Step 2B: The claim does not recite any additional elements which, alone or in combination, would integrate the abstract ideas into a practical application or which, in combination with the abstract ideas, would be sufficient to amount to significantly more than the abstract ideas. The claim is not patent eligible.

Claim 9 incorporates the rejection of claim 7. 
Step 2A Prong 1: The abstract ideas of claim 7 are incorporated. The regularization component is multiplied by a regularization weighting factor, δ is a mathematical calculation. On page 11, the paragraph next to line number 30 discloses the regularization weighting factor.
Step 2A Prong 2 and Step 2B: The claim does not recite any additional elements which, alone or in combination, would integrate the abstract ideas into a practical application or which, in combination with the abstract ideas, would be sufficient to amount to significantly more than the abstract ideas. The claim is not patent eligible.

Claim 10 incorporates the rejection of claim 1.
Step 2A Prong 1: The abstract ideas of claim 1 are incorporated.
Step 2A Prong 2 and Step 2B: The updated first action is determined based on the action that provides the first local agent with greatest Q-value in the first RL model given the first next state and the updated first parameters amounts to mere instructions to apply the abstract ideas on a generic computer under MPEP 2106.05(f). The claim is not patent eligible.

Claim 11 incorporates the rejection of claim 1.
Step 2A Prong 1: The abstract ideas of claim 1 are incorporated.
Step 2A Prong 2 and Step 2B: The updated second action comprises the action that provides the second local agent with the greatest Q-value in the second RL model given the second next state and the updated second parameters amounts to mere instructions to apply the abstract ideas on a generic computer under MPEP 2106.05(f). The claim is not patent eligible.

Claim 12 incorporates the rejection of claim 1. 
Step 2A Prong 1: The abstract ideas of claim 1 are incorporated.
Step 2A Prong 2 and Step 2B: The first local agent is configured to control a transmission power of an antenna to decrease a Signal-to-Interference-plus-Noise Ratio, SINR, of the antenna amounts to a mere field of use and technological environment under MPEP 2106.05(h). The limitation is recited at a high level of generality and does not explain how SINR is decreased.
The second local agent is configured to control a tilt of an antenna to decrease the SINR of the antenna amounts to a mere field of use and technological environment under MPEP 2106.05(h). The limitation is recited at a high level of generality and does not explain how SINR is decreased. The claim is not patent eligible.

Claim 13 incorporates the rejection of claim 1. 
Step 2A Prong 1: The abstract ideas of claim 1 are incorporated.
Step 2A Prong 2 and Step 2B: The first local agent is configured to control radio access network counters to minimize handover rate in the radio access network amounts to a mere field of use and technological environment under MPEP 2106.05(h). The limitation is recited at a high level of generality and does not explain how handover rate is minimized.
The second local agent is configured to control radio access network counters to maximize Reference Signal Received Power, RSRP, values in the network amounts to a mere field of use and technological environment under MPEP 2106.05(h). The limitation is recited at a high level of generality and does not explain how RSRP is maximized. The claim is not patent eligible.

Claim 14 incorporates the rejection of claim 1. 
Step 2A Prong 1: The abstract ideas of claim 1 are incorporated.
Step 2A Prong 2 and Step 2B: The first local agent is configured to control obtaining resources for a first network slice to meet network requirements of the first network slice amounts to a mere field of use and technological environment under MPEP 2106.05(h). The limitation is recited at a high level of generality and does not explain how network requirements of the first network slice are met.
The second local agent is configured to control obtaining resources for a second network slice to meet network requirements of the second network slice amounts to a mere field of use and technological environment under MPEP 2106.05(h). The limitation is recited at a high level of generality and does not explain how network requirements of the second network slice are met. The claim is not patent eligible.

Claim 15
Step 2A Prong 1: Determining an updated action based on the received updated values of the parameters is a judgement mental process which can reasonably be performed in the human mind with the aid of pencil and paper. The claim recites an abstract idea.
Step 2A Prong 2: A local agent, wherein the local agent is configured to perform a reinforcement learning, RL, model in an environment amounts to a generic computer component for applying the abstract ideas on a generic computer under MPEP 2106.05(f). Specification page 22, line 20 discloses a local agent comprises processing circuitry, so the local agent may amount to a generic computer.
Transmitting, to a global agent, either a loss function associated with the RL model or a replay experience of the agent, wherein the replay experience comprises a state, an action, a reward and a next state, wherein the action is determined based on a maximum Q-value for the state given current values of parameters of the environment amounts to insignificant pre-solution activity under MPEP 2106.05(g).
Transmitting current values of the parameters to the global agent amounts to insignificant pre-solution activity under MPEP 2106.05(g).
Receiving updated values of the parameters from the global agent amounts to insignificant pre-solution activity under MPEP 2106.05(g).
Performing the updated action amounts to a mere instructions for applying the abstract ideas on a generic computer under MPEP 2106.05(f).
	The additional elements as disclosed above, alone or in combination, do not integrate the abstract ideas into a practical application as they are mere insignificant pre-solution activities as disclosed in combination with generic computer functions that are implemented to perform the abstract ideas disclosed above. The claim is directed to an abstract idea.
Step 2B: A local agent, wherein the local agent is configured to perform a reinforcement learning, RL, model in an environment amounts to a generic computer component for applying the abstract ideas on a generic computer under MPEP 2106.05(f). Specification page 22, line 20 discloses a local agent comprises processing circuitry, so the local agent may amount to a generic computer.
Transmitting, to a global agent, either a loss function associated with the RL model or a replay experience of the agent, wherein the replay experience comprises a state, an action, a reward and a next state, wherein the action is determined based on a maximum Q-value for the state given current values of parameters of the environment amounts to transmitting data over a network, which the courts have recognized as a well-understood, routine, conventional activity under MPEP 2106.05(d)(II).
Transmitting current values of the parameters to the global agent amounts to transmitting data over a network, which the courts have recognized as a well-understood, routine, conventional activity under MPEP 2106.05(d)(II).
Receiving updated values of the parameters from the global agent amounts to receiving data over a network, which the courts have recognized as a well-understood, routine, conventional activity under MPEP 2106.05(d)(II).
Performing the updated action amounts to a mere instructions for applying the abstract ideas on a generic computer under MPEP 2106.05(f).
The additional elements as disclosed above, in combination with the abstract ideas, are not sufficient to amount to significantly more than the abstract ideas as they are well-understood, routine and conventional activities as disclosed in combination with generic computer functions that are implemented to perform the abstract ideas disclosed above. The claim is not patent eligible.

Claims 16-18 recite similar features as claims 12-14, respectively, and are therefore rejected for at least the same reasons.

Claim 19 recites a system which implements the same features as the method of claim 1 and is therefore rejected for at least the same reasons.
In Step 2A Prong 2 and Step 2B, a global agent comprising processing circuitry amounts to a generic computer component for applying the abstract ideas on a computer under MPEP 2106.05(f). The claim is not patent eligible.

Claim 20 recites a system which implements the same features as the method of claim 2 and is therefore rejected for at least the same reasons.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-3, 10-11, and 19-20 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Van Seijen et al. (US 20180165603 A1).

	Regarding claim 1, Van Seijen teaches: A computer-implemented method of controlling a multi-agent reinforcement learning environment, the method comprising: (Abstract and [0231], lines 1-7)
obtaining a plurality of loss functions comprising: a first loss function associated with a first reinforcement learning, RL, model performed by a first local agent, wherein the first loss function is a function of one or more first parameters; and ([0234] discloses a single-agent loss function and parameters are θi. [0241] and [0243] discloses a combined loss function for a multi-agent Hybrid Reward Architecture (HRA) model. A first local agent is indicated by n = 1, and first parameters are deep network parameters associated with the first head.)
a second loss function associated with a second RL model at a second local agent, wherein the second loss function is a function of one or more second parameters; ([0241] and [0243] discloses a combined loss function. A second local agent is indicated by n = 2, and second parameters are deep network parameters associated with the second head.)
determining a combined loss function based on the plurality of loss functions; ([0241] and [0243])
minimizing the combined loss function with respect to the first parameters and the second parameters to determine updated values for the first parameters and updated values for the second parameters;  ([0244]-[0245] discloses minimizing the loss function. This results in updating values of the first and second parameters.)
initiating execution of a first updated action by the first local agent based on the updated values of the first parameters; and ([0064], [0088], and [0231], lines 7-18 disclose that a first local agent is a horizontal agent which controls a robot’s motion in a horizontal direction. The aggregator may select actions for the horizontal agent once it has been trained.)
initiating execution of a second updated action by the second local agent based on the updated values of the second parameters. ([0064], [0088], and [0231], lines 7-18 disclose that a second local agent is a vertical agent which controls a robot’s motion in a vertical direction. The aggregator may select actions for the vertical agent once it has been trained.)

	Regarding claim 2, Van Seijen teaches: The computer-implemented method as claimed in claim 1 wherein the first loss function is calculated based on a first replay experience of the first local agent, wherein the first replay experience comprises a first state, s1,t a first action, a1,t, a first reward, r1,t+1, and a first next state, s1,t+1. ([0232], lines 1-8 and [0234]-[0235] discloses that the loss function L is based on a state, an action, a reward, and a next state. In the HRA model, the loss function for the first head/local agent is calculated based on an initial state shared by all heads and a first action, a first reward, and a first next state corresponding to the first head.)

	Regarding claim 3, Van Seijen teaches: The computer-implemented method as claimed in claim 2 wherein the second loss function is calculated based on a second replay experience of the second local agent, wherein the second replay experience comprises a second state, s2,t+1 a second action, a2,t, a second reward, r2,t+1, and a second next state, s2,t+1. ([0232], lines 1-8 and [0234]-[0235] discloses that the loss function L is based on a state, an action, a reward, and a next state. In the HRA model, the loss function for the second head/local agent is calculated based on an initial state shared by all heads and a second action, a second reward, and a second next state corresponding to the second head.)

	Regarding claim 10, Van Seijen teaches: The computer-implemented method as claimed in claim 1 wherein the updated first action is determined based on the action that provides the first local agent with greatest Q-value in the first RL model given the first next state and the updated first parameters. ([0233] discloses model-free reinforcement learning methods take the action with the highest Q*-value in every state. [0234] discloses using a DQN to estimate the optimal Q-value function.)

	Regarding claim 11, Van Seijen teaches: The computer-implemented method as claimed in claim 1 wherein the updated second action comprises the action that provides the second local agent with the greatest Q-value in the second RL model given the second next state and the updated second parameters. ([0233] discloses model-free reinforcement learning methods take the action with the highest Q*-value in every state. [0234] discloses using a DQN to estimate the optimal Q-value function.)

Claim 19 recites a system which implements the same features as the method of claim 1 and is therefore rejected for at least the same reasons.
Van Seijen teaches: A global agent for controlling a multi-agent reinforcement learning environment, (Abstract and [0231], lines 1-14, where a global agent is an aggregator)
the global agent comprising processing circuitry configured to: ([0031], lines 7-10)

Claim 20 recites a system which implements the same features as the method of claim 2 and is therefore rejected for at least the same reasons.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 4, 7, and 9 are rejected under 35 U.S.C. 103 as being unpatentable over Van Seijen et al. (US 20180165603 A1) and Wang et al. (“ROMA: Multi-Agent Reinforcement Learning with Emergent Roles”).

	Regarding claim 4, Van Seijen teaches: The computer-implemented method as claimed in claim 3 wherein the combined loss function further comprises a regularization component, ([0228], lines 13-19 and [0237] disclose that the target function can be regularized by splitting the reward function into n reward functions, weighted by wi. A regularization component includes wi*Ri(s,a) for any i.)
However, Van Seijen does not explicitly teach: wherein the value of regularization component increases when the first next state, s1,t+1 and the second next state, s2,t+1, are closer together.
But Wang teaches: wherein the value of regularization component increases when the first next state, s1,t+1 and the second next state, s2,t+1, are closer together. (Page 2, col. 1, § 2 to col. 2, line 7; and page 4, col. 1, lines 1-10, equation 6, and equation 7 discloses each agent select an action leading to a next state s’ forming a trajectory, and a second regularizer ℒD seeks to maximize different trajectories of agents i and j. This has an effect of increasing the regularizer when the next states of agents i and j are closer together.)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have applied Wang’s regularizer ℒD to Van Seijen’s loss function as it has been applied to Wang’s equation 7. A motivation for the combination is to encourages sub-task specialization. (Wang, Page 4, page 4, col. 1, lines 1-10)

	Regarding claim 7, the combination of Van Seijen and Wang teaches: The computer-implemented method as claimed in claim 4 
Van Seijen teaches: wherein the combined loss function comprises a sum of each of the plurality of loss functions 
	However, Van Seijen does not explicitly teach: the plurality of loss functions plus the regularization component.
	But Wang teaches: the plurality of loss functions plus the regularization component. (On page 4, § 3.3, in equation 7, ℒD represents a regularizer which is added to other loss functions ℒTD and ℒI.)
	A motivation for the combination is the same as the motivation given for claim 4.

	Regarding claim 9, the combination of Van Seijen and Wang teaches: The computer-implemented method as claimed in claim 7 
	However, Van Seijen does not explicitly teach: wherein the regularization component is multiplied by a regularization weighting factor, δ.
	But Wang teaches: wherein the regularization component is multiplied by a regularization weighting factor, δ. (On page 4, col. 2, the line below equation 7 states λD is a scalar value for the regularizer.)
	A motivation for the combination is the same as the motivation given for claim 4.

Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Van Seijen et al. (US 20180165603 A1), Wang et al. (“ROMA: Multi-Agent Reinforcement Learning with Emergent Roles”), and Song et al. (“Measuring the Distance Between Finite Markov Decision Processes”).
	
	Regarding claim 5, the combination of Van Seijen and Wang teaches: The computer-implemented method as claimed in claim 4 further comprising determining the regularization component by 
	However, Van Seijen and Wang do not explicitly teach: determining a distance, d(s1,t+1, s2,t+1), between the first next state and the second next state by: for each action in the combination of all available actions in the first state and the second state: calculating a reward difference as a magnitude of a difference between local rewards obtained from the first next state and the second next state after applying the action; and calculating a Kantorovich distance between the probability of transitioning into the first next state based on the first state and the first action, and the probability of transitioning into the second next state based on the second state and the second action; and calculating a distance sum for the action by summing the reward difference and the Kantorovich distance; and setting the distance between the first next state and the second next state as the distance sum with the maximum value.
	But Song teaches: determining a distance, 
for each action in the combination of all available actions in the first state and the second state: (Page 470, col. 1, in the paragraph starting with “Equivalent state representations” up to and including col. 2, Definition 3.3 discloses the calculating limitations below. The notation “a∈A” in equation 2  means for each action in the finite set of actions A. Equation 2 is equivalent to the equation in the specification on page 10, line 25.)
calculating a reward difference as a magnitude of a difference between local rewards obtained from the first next state and the second next state after applying the action; and (In equation 2, the term 
    PNG
    media_image1.png
    96
    392
    media_image1.png
    Greyscale
 is a reward difference. 
    PNG
    media_image2.png
    79
    153
    media_image2.png
    Greyscale
is the immediate reward for taking action a from state s in S1 for the first MDP, and thus is the reward obtained from transitioning to the first next state after applying the action. 
    PNG
    media_image3.png
    75
    142
    media_image3.png
    Greyscale
 is the immediate reward for taking action a from state s’ in S2 for the second MDP, and thus is the reward obtained from transitioning to the second next state after applying the action.)
calculating a Kantorovich distance between the probability of transitioning into the first next state based on the first state and the first action, and the probability of transitioning into the second next state based on the second state and the second action; and (In equation 2, 
    PNG
    media_image4.png
    76
    565
    media_image4.png
    Greyscale
 is a Kantorovich distance, and c is a weight.)
calculating a distance sum for the action by summing the reward difference and the Kantorovich distance; and (The addition sign in equation 2 discloses this summing the two terms.)
setting the distance between the first 
    PNG
    media_image5.png
    83
    117
    media_image5.png
    Greyscale
 represents setting the distance d’ as the distance sum with the maximum value.)
Song discloses a distance metric between a first next state and a second next state at page 470, col. 2, in the paragraph starting with “The worst case running time” when the two MDPs are deterministic ones. It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have extended Song’s distance metric between first and second states to a distance metric between first and second next states, and to have incorporated the distance metric into the regularization term in the combination of Van Seijen and Wang. A motivation for the combination is to incorporate homogenous task information. (Song, Page 470, final paragraph in col. 1)

Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over Van Seijen et al. (US 20180165603 A1), Wang et al. (“ROMA: Multi-Agent Reinforcement Learning with Emergent Roles”), and Kullberg et al. (US 20180144472 A1).

	Regarding claim 8, the combination of Van Seijen and Wang teaches: The computer-implemented method as claimed in claim 7 
	However, Van Seijen and Wang do not explicitly teach: wherein the sum of each of the plurality of loss functions is a weighted sum, wherein each loss function is associated with a weighting factor, βi.
	But Kullberg teaches: wherein the sum of each of the plurality of loss functions is a weighted sum, wherein each loss function is associated with a weighting factor, βi. ([0100]-[0101] and [0104] discloses a sum of cost (loss) functions each associated with a weighting factor ωi)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have applied Kullberg’s weighting factors to each head’s loss function in Van Seijen, paragraph [0243]. A motivation for the combination is to control a contribution of each head to the overall loss function. 

Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Van Seijen et al. (US 20180165603 A1) and Tan et al. (US 20160205697 A1).

Regarding claim 12, Van Seijen teaches: The computer-implemented method as claimed in claim 1 wherein the first local agent is configured to control… , and the second local agent is configured to control… ([0088] discloses a first local agent is a horizontal agent which controls a robot’s motion in a horizontal direction, and a second local agent is a vertical agent which controls the robot’s motion in a vertical direction.)
However, Van Seijen does not explicitly teach: wherein the first local agent is configured to control a transmission power of an antenna to decrease a Signal-to-Interference-plus-Noise Ratio, SINR, of the antenna, and the second local agent is configured to control a tilt of an antenna to decrease the SINR of the antenna.
But Tan teaches: wherein the first local agent is configured to control a transmission power of an antenna to [increase] 
Since Tan’s actuating agents can adjust antenna tilts and transmit power levels to increase SINR, they can also adjust antenna tilts and transmit power levels to decrease SINR. It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have applied Van Seijen’s reinforcement learning methods to Tan’s agents in a self-organizing network. A motivation for the combination is to manage and operate large, complex mobile networks. (Tan, [0003])

Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over Van Seijen et al. (US 20180165603 A1), Calzolari et al. (US 20200412417 A1) and Oroskar et al. (US 10433255 B1).

	Regarding claim 13, Van Seijen teaches: The computer-implemented method as claimed in claim 1 wherein the first local agent is configured to control…, and the second local agent is configured to control… ([0088] discloses a first local agent is a horizontal agent which controls a robot’s motion in a horizontal direction, and a second local agent is a vertical agent which controls the robot’s motion in a vertical direction.)
	However, Van Seijen does not explicitly teach: wherein the first local agent is configured to control radio access network counters to minimize handover rate in the radio access network, and the second local agent is configured to control radio access network counters to maximize Reference Signal Received Power, RSRP, values in the network.
	But Calzolari teaches: wherein the first local agent is configured to control radio access network 
The [first] 
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have applied Van Seijen’s first and second local agents to control antenna switching, with the first local agent penalizing lower RSRP values and the second local agent rewarding higher RSRP values. A motivation for the combination is to determine whether to switch from a currently active antenna to a different, more optimal antenna. (Calzolari, [0018])
	However, the combination of Van Seijen and Calzolari does not explicitly teach: control radio access network counters
	But Oroskar teaches: control radio access network counters (C. 9, L. 31-44)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have incorporated Oroskar’s handover counters into the combination of Van Seijen and Calzolari. A motivation for the combination is that handover counters would provide additional information for the agents to determine whether to switch antennas.

Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Van Seijen et al. (US 20180165603 A1) and Cui et al. (US 20170257289 A1).

	Regarding claim 14, Van Seijen teaches: The computer-implemented method as claimed in claim 1 wherein the first local agent is configured to control… , and the second local agent is configured to control… ([0088] discloses a first local agent is a horizontal agent which controls a robot’s motion in a horizontal direction, and a second local agent is a vertical agent which controls the robot’s motion in a vertical direction.)
However, Van Seijen does not explicitly teach: wherein the first local agent is configured to control obtaining resources for a first network slice to meet network requirements of the first network slice, and the second local agent is configured to control obtaining resources for a second network slice to meet network requirements of the second network slice.
	But Cui teaches: wherein the first local agent is configured to control obtaining resources for a first network slice to meet network requirements of the first network slice, and ([0030], lines 5-7; [0031]-[0032] and [0034] discloses a machine-learned SDN controller 320 for slicing a network into at least a first slice.)
The [first] 
It would have been obvious to a person having ordinary skill in the art to have incorporated Van Seijen’s first and second local agents into Tan’s SDN controller 320, where each local agents controls different slices. A motivation for the combination is to allow a more dynamic network slicing decision to best utilize the limited RAN (radio access network) resources. (Tan, [0031])

Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Van Seijen et al. (US 20180165603 A1) and Misra et al. (US 20210326883 A1).

Regarding claim 15, Van Seijen teaches: A method in a local agent, wherein the local agent is configured to perform a reinforcement learning, RL, model in an environment, the method comprising: ([0234] discloses a single-agent loss function)
[determining, at] 
…
[determining] 
determining an updated action based on the 
performing the updated action. ([0064] discloses a robot executes an action.)
	However, Van Seijen does not explicitly teach: transmitting, to a global agent, a loss function; transmitting current values of the parameters to the global agent; receiving updated values of the parameters from the global agent;
	But Misra teaches: transmitting, to a global agent, a loss function, and transmitting current values of the parameters to the global agent; ([0031], [0034]-[0037] discloses submitting a generated ruleset and an objective function to a server interface. A “global agent” corresponds to the server interface, “a loss function” corresponds to the objective function, and “current values of the parameters” corresponds to the ruleset.)
receiving updated values of the parameters from the global agent; ([0038]-[0039] discloses the server interface generates rules design feedback and transmits it to the client interface. The “rules design feedback” corresponds to updated values of the parameters.)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have incorporated Misra’s evaluation of rules and objective function by a server interface into Van Seijen’s multi-agent reinforcement learning system. A motivation for the combination is to validate model parameters.

Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over Van Seijen et al. (US 20180165603 A1), Misra et al. (US 20210326883 A1), and Tan et al. (US 20160205697 A1).

Regarding claim 16, the combination of Van Seijen and Misra teaches: The method as claimed in claim 15 
Van Seijen teaches: wherein the local agent is configured to control one of: ([0064] discloses a local agent controlling a robot.)
However, Van Seijen and Misra do not explicitly teach: a transmission power of an antenna to decrease a Signal-to-Interference-plus-Noise Ratio, SINR, of the antenna; or a tilt of an antenna to decrease the SINR of the antenna.
But Tan teaches: wherein the local agent is configured to control one of: a transmission power of an antenna to decrease a Signal-to-Interference-plus-Noise Ratio, SINR, of the antenna; or a tilt of an antenna to decrease the SINR of the antenna. ([0034], lines 12-14 and [0067], line 1 to “system” in line 4 discloses agents that adjust antenna tilts and/or transmit power levels to reduce inter-cell interference. Reducing interference leads to an increased SINR. Also, [0076], lines 7-11 discloses decision making agents can use reinforcement learning.)
Since Tan’s actuating agents can adjust antenna tilts and transmit power levels to increase SINR, they can also adjust antenna tilts and transmit power levels to decrease SINR. It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have applied Van Seijen’s reinforcement learning methods to Tan’s agents in a self-organizing network. A motivation for the combination is to manage and operate large, complex mobile networks. (Tan, [0003])

Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over Van Seijen et al. (US 20180165603 A1), Misra et al. (US 20210326883 A1), Calzolari et al. (US 20200412417 A1), and Oroskar et al. (US 10433255 B1).

	Regarding claim 17, the combination of Van Seijen and Misra teaches: The method as claimed claim 15 
Van Seijen teaches: wherein the local agent is configured to control one of: ([0064] discloses a local agent controlling a robot.)
However, Van Seijen and Misra do not explicitly teach: radio access network counters to minimize handover rate in the radio access network; or radio access network counters to maximize Reference Signal Received Power, RSRP, values in the network.
But Calzolari teaches: wherein the local agent is configured to control one of: radio access network 
radio access network 
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have applied Van Seijen’s local agent to control antenna switching. A motivation for the combination is to determine whether to switch from a currently active antenna to a different, more optimal antenna. (Calzolari, [0018])
However, the combination of Van Seijen, Misra, and Calzolari does not explicitly teach: control radio access network counters
But Oroskar teaches: control radio access network counters (C. 9, L. 31-44)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have incorporated Oroskar’s handover counters into the combination of Van Seijen, Misra, and Calzolari. A motivation for the combination is that handover counters would provide additional information for the agent to determine whether to switch antennas.

Claim 18 is rejected under 35 U.S.C. 103 as being unpatentable over Van Seijen et al. (US 20180165603 A1), Misra et al. (US 20210326883 A1), and Cui et al. (US 20170257289 A1).

	Regarding claim 18, the combination of Van Seijen and Misra teaches: The method as claimed in claim 15 
	Van Seijen: wherein the local agent is configured to control ([0064] discloses a local agent controlling a robot.)
However, Van Seijen and Misra do not explicitly teach: obtaining resources for a network slice to meet network requirements of the network slice.
But Cui teaches: wherein the local agent is configured to control obtaining resources for a network slice to meet network requirements of the network slice. ([0030], lines 5-7; [0031]-[0032] and [0034] discloses a machine-learned SDN controller 320 for slicing a network into at least a first slice.)
It would have been obvious to a person having ordinary skill in the art to have incorporated Van Seijen’s local agents into Tan’s SDN controller 320. A motivation for the combination is to allow a more dynamic network slicing decision to best utilize the limited RAN (radio access network) resources. (Tan, [0031])

Examiner’s Note
	No prior art rejection has been provided for pending claim 6. The features of
the plurality of loss functions comprises N loss functions, where N is an integer, each associated with a respective RL model performed by an ith agent, where i = 1, … , N wherein the ith loss function is calculated based on a ith replay experience of the ith agent, wherein the ith replay experience comprises an ith state, an ith action, an ith reward and an ith next state; and wherein the step of determining the regularization component comprises: calculating distances between each combination of possible pairs of states in: the first next state to the Nth next state; calculating a Frobenius norm of a matrix comprising the distances; setting the regularization component as a square of the Frobenius norm divided by 2, when taken in the context of the claim as a whole, were not uncovered in the prior art of record.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Chandak et al. (US 20220121968 A1) teaches a policy loss function plus an entropy-regularizer in the final line of Algorithm 1, and explained by paragraph [0108].
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Asher H. Jablon whose telephone number is (571)270-7648. The examiner can normally be reached Monday - Friday, 9:00 am - 6:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Abdullah Al Kawsar can be reached at (571)270-3169. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/A.H.J./Examiner, Art Unit 2127                                                                                                                                                                                                        

/ABDULLAH AL KAWSAR/Supervisory Patent Examiner, Art Unit 2127
Read full office action
Prosecution Timeline

Jun 21, 2023
Application Filed
Apr 30, 2026
Non-Final Rejection mailed — §101, §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/080,407
Patent 12626141
AUTOMATED GENERATION OF MACHINE LEARNING MODELS
3y 5m to grant Granted May 12, 2026
18/786,758
Patent 12614076
NEURAL NETWORK OPTIMIZATION DEVICE FOR EDGE DEVICE MEETING ON-DEMAND INSTRUCTION AND METHOD USING THE SAME
1y 9m to grant Granted Apr 28, 2026
17/074,878
Patent 12572794
SYSTEM AND METHOD FOR AUTOMATED OPTIMAZATION OF A NEURAL NETWORK MODEL
5y 4m to grant Granted Mar 10, 2026
17/025,418
Patent 12456047
Distilling from Ensembles to Improve Reproducibility of Neural Networks
5y 1m to grant Granted Oct 28, 2025
17/209,302
Patent 12450493
DIMENSION REDUCTION IN THE CONTEXT OF UNSUPERVISED LEARNING
4y 7m to grant Granted Oct 21, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
43%
Grant Probability
87%
With Interview (+44.0%)
4y 4m (~1y 5m remaining)
Median Time to Grant
Low
PTA Risk
Based on 93 resolved cases by this examiner. Grant probability derived from career allowance rate.