Office Action Analysis: 18119755 — Coordinating Reinforcement Learning (RL) for multiple agents in a distributed system

Office Action

§103 §112
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

The term “relatively high level of affiliation ” in independent claims 1, 12 are a relative term which renders the claim indefinite. The term “relatively high level of affiliation” is not defined by the claim, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention. One of ordinary skill in the art would not know whether the claimed “relatively high level of affiliations” includes only each agent’s nearest neighbor, or if not, how many levels beyond the nearest neighbors are included.
Dependent claims 2-11, 13-20 inherit the same deficiency from their corresponding parent claim and are therefore similarly rejected.
Claim 11 similarly recites “relatively high level of affiliation” and thus is similarly additionally rejected.
Claims 2, 13 similarly recite “relatively low level of affiliation” and “little or no affiliation, and thus are similarly additionally rejected.
Claims 4, 15 similarly recite “a certain degree” and thus are similarly additionally rejected.
Claim 6 does not seem to make grammatical sense, perhaps it should read (i.e., similar to claim 17) “local improvement procedure in accordance with a predetermined sequence” instead of current language “local improvement procedure at a predetermined sequence.” 
Claims 6, 17 recite the phrase “predetermined sequence”, but that phrase does not seem to imply any sequence, and therefore the phrase seems indefinite.

Claim Objections
Claim 16 is objected to because of the following informalities: the word “aa” seems like it should be “a”.  Appropriate correction is required.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art 


Note: In order to better show what is and is not taught by the references, Examiner shows some words underlined. Words that are underlined indicate teachings of the cited reference, and may not specifically be claimed.




Claims 1-10, 12-20 are rejected under 35 U.S.C. 103 as being unpatentable over Bird et al. (US20200175419, Bird) in view Janulewicz et al. (US20190138948, Janulewicz), as best understood.

As to claim 1:
Bird shows a non-transitory computer-readable medium associated with 
an individual agent arranged within a distributed system having multiple agents and multiple links (¶ [0025]) (e.g.,  FIG. 3 depicts the basic operating mechanism. In this example scenario, a set of edge machines 300 are provided. The machines act as peer computing nodes in a multi-machine collaborative learning technique of this disclosure. To this end, each edge machine builds a local machine learning model 302 of a particular behavior of interest. The edge machines communicate these models (or portions thereof) with one another, e.g., by a gossip protocol or other group communication mechanism.), 
wherein the multiple agents and multiple links are arranged in such a way so as to create different levels of affiliations between the agents (¶ [0028]) (e.g., A preferred technique to create a mode is based on multicasting (using gossiping) wherein a node propagates its model to other nearby nodes, e.g., using a fan-out (a multicasting tree), after which one or more of such nearby nodes then further propagate the model farther afield. Meanwhile, other edge machines (and thus other groupings or subset(s) of machines) , 
the non-transitory computer-readable medium configured to store computer logic having instructions that, when executed, enable one or more processing devices to: participate in a training process involving each of the multiple agents, the training process including multiple rounds of training, each round of training allowing each agent to perform a local improvement procedure (¶ [0050]) (e.g., Edge machines herein may operate as on-line learners, which learn in real-time using streamed data, or off-line learners, which learn by training over and over on a single batch of data. Preferably, a node is a life-long learner, which is a node that becomes smarter the longer it operates in this collaborative manner.) 
during each round of training, perform the local improvement procedure using training data associated with one or more other agents having a relatively high level of affiliation of the different levels of affiliations with the individual agent and additional training data associated with the individual agent itself (¶ [0049]) (e.g., neural networks are used for the learning. Neural networks here may perform in-band learning, or out-of-band learning. In-band learning involves keeping track of pieces of interesting data (e.g., anomalies), and then gossiping this data to the nearby nodes. In out-of-band learning, the neural network comprises a set of weights (floating point numbers over which various mathematical operations are performed), and it is the set of weights that are shared to facilitate the collaboration. To this end, t receiving node would take the weights received and incorporate them in its weight matrix/vector, or the like. Another approach to training a neural network is to create a trained lightweight model (in the manner described) and then share it to a subset of the other nodes.), 
wherein the training data associated with each agent includes at least a local machine builds a local machine learning model 302 of a particular behavior of interest. The edge machines communicate these models (or portions thereof) with one another, e.g., by a gossip protocol or other group communication mechanism. Using knowledge obtained from one or more its peers, a machine 300 then adjusts its local model such that the local classification algorithm being executed by a machine is augmented or enhanced by the knowledge that was taken in by one or more of its peers. ).
Bird fails to specifically show: using Reinforcement Learning (RL) to perform a local improvement procedure; the local policy under development being a Reinforcement Learning policy.
In the same field of invention, Janulewicz teaches: RL for autonomous networks. Janulewicz further teaches: using Reinforcement Learning (RL) to perform a local improvement procedure; the local policy under development being a Reinforcement Learning policy (abstract).
Thus, it would have been obvious to one of ordinary skill in the art, having the teachings of Bird and Janulewicz before the effective filing date of the invention, to have combined the teachings of Janulewicz with the non-transitory computer-readable medium as taught by Bird.
One would have been motivated to make such combination because a way to enable autonomous, self-learning networks would have been obtained and desired, as expressly taught by Janulewicz (¶ [0004]).

As to claim 2, Bird further shows:
wherein the individual agent has little or no visibility of another set of one or more other agents having a relatively low level of affiliation of the different levels of affiliation with the individual agent (¶ [0027]) (e.g., each of a set of global host processes (running on a set of edge machines) is acting as a machine learner, which then communicates with other nearby global host processes to incorporate relevant observations (e.g., of request patterns). As used here, nearby refers to machines that are physically co-located (in a region or set of such machines), those that are not necessarily co-located physically but that are still sufficiently close to one another in a network sense (e.g., based on some latency measurement)).

As to claims 3, 14, Janulewicz further teaches:
wherein the local improvement procedure is configured to increase an RL reward of the local RL policy under development (¶ [0003]) (e.g., a software agent does not have any a priori knowledge of its operating environment and must discover which actions yield the most reward by trying them out. This leads to the trade-off between exploration and exploitation. The agent must exploit what it already knows to obtain rewards, but also needs to explore to make better actions in the future.).
One would have been motivated to make such combination because a way to enable autonomous, self-learning networks would have been obtained and desired, as expressly taught by Janulewicz (¶ [0004]).

As to claims 4, 15, Janulewicz further teaches:
The non-transitory computer-readable medium of claim 3, wherein, in each round, the local improvement procedure is configured to increase the RL reward of the local RL policy under development up to a certain degree (e.g., At each iteration of the above closed-loop, the state of the network s is determined from the telemetry data. This determines a value of the reward r(s) (also referred to as “cost”) associated with that state. Then. the RL process determines the action a that can be taken on the network in order to bring it to the next state s′, which is expected to get a better or equal reward r(s′)).
One would have been motivated to make such combination because a way to enable autonomous, self-learning networks would have been obtained and desired, as expressly taught by Janulewicz (¶ [0004]).

As to claims 5, 16:
Bird further shows:
The non-transitory computer-readable medium of claim 1, wherein, 
Janulewicz further teaches:
wherein, after each round of training is complete, the local RL policy is provided for a global reward calculation (¶ [0024], [0078]) (e.g., the optimal policies defining what actions to take for each state can be learned; The actions are specific to the identified rewards. For example, a reward of maximizing throughput of (high-priority) services and/or maximize throughput of overall network can have actions of increase/decrease bandwidth of competing services and/or re-route to less congested paths.)
One would have been motivated to make such combination because a way to enable autonomous, self-learning networks would have been obtained and desired, as expressly taught by Janulewicz (¶ [0004]).

As to claim 6, Bird further shows:
The non-transitory computer-readable medium of claim 1, wherein, in each round, the associated agent performs its local improvement procedure at a predetermined sequence (¶ [0028], [0030]) (e.g., Preferably, there is no absolute sequence or ordering of the data transfer(s); rather, data is being shared on an ad hoc basis continuously. As a result, and even where a “model” is of the same type, individual computing nodes are expected to have different variants or versions of the model in question (since the data used to build that model will be variable across subsets or groupings). While the above describes a particular order of operations performed by certain embodiments of the invention, it should be understood that such order is exemplary, as alternative embodiments may perform the operations in a different order, combine certain operations, overlap certain operations, or the like.).
One would have been motivated to make such combination because a way to enable autonomous, self-learning networks would have been obtained and desired, as expressly taught by Janulewicz (¶ [0004]).

As to claims 7, 18, Bird further shows:
wherein the distributed system is one of a real-world system, a virtual system, and a simulated system (¶ [0045] )(e.g., Internet-of-Things (IoT) devices, cloud infrastructure-based computing nodes and resources, virtualized computing nodes, virtual machines, and the like.).

As to claims 8, 19, Bird further shows:
wherein the distributed system is a communications network, each agent of the multiple agents is associated with a network node, the individual agent is associated with an individual network node, and each link is associated with a communication path between nodes (abstract).

As to claims 9, 20, Janulewicz further teaches:
wherein the training data and the additional training data includes resource availability information of respective agent related to an ability to perform network service functions (¶ [0006]) (e.g., the predetermined reward can be minimizing workload of network elements, and wherein the one or more actions can include i) re-routing one or more services to less busy network elements).
One would have been motivated to make such combination because a way to enable autonomous, self-learning networks would have been obtained and desired, as expressly taught by Janulewicz (¶ [0004]).

As to claim 10, Janulewicz further teaches:
The non-transitory computer-readable medium of claim 8, wherein the local RL policy under development is combined with the local RL policies of the other agents such that a global RL policy emerges for maximizing utilization of the network nodes to handle as many network service requests as possible (¶ [0006]) (e.g., The predetermined reward can be maximizing throughput of one or more services, high-priority services, or overall throughput of the network).
One would have been motivated to make such combination because a way to enable autonomous, self-learning networks would have been obtained and desired, as expressly taught by Janulewicz (¶ [0004]).

As to claim 12:
Bird a non-transitory computer-readable medium configured to store computer logic having instructions that, when executed, enable one or more processing devices to: 
coordinate a training process for training a distributed system having multiple agents and multiple links (¶ [0006]) (e.g., a gossip protocol, or some other equivalent communication mechanism, nodes exchange some portion of their ML models between or among each other. The portion of the local model that is exchanged with one or more other nodes encodes or encapsulates relevant knowledge (learned at the source node) for the particular behavior of interest; in this manner relevant transfer learning is enabled such that individual nodes (namely, their associated ML models) become smarter), 
wherein the multiple agents and multiple links are arranged in such a way so as to create different levels of affiliations between the agents (¶ [0024]) (e.g., individual nodes (e.g., edge machines, servers, appliances and devices) in an overlay network (e.g., a CDN) each build local models associated with a particular behavior of interest; Sets of machines that collaborate in this manner converge their models toward some steady state solution that is then used to facilitate the (i.e., whole) overlay network function or optimization); 
prompt each agent
wherein the local improvement procedure allows each individual agent to use training data associated with one or more other agents having a relatively high level of affiliation of the different levels of affiliation with the individual agent and additional training data associated with the individual agent itself (¶ [0024]) (e.g., es exchange some portion of their ML models between or among each other. The portion of the local model that is exchanged with one or more other nodes encodes or encapsulates relevant knowledge (learned at the source node) for the particular behavior of interest; in this manner, relevant transfer learning is enabled such that individual nodes (namely, their associated ML models) become smarter. Stated another way, in this scheme a number of “partial” models are built locally, and then relevant knowledge is shared between and among the machines to facilitate a collaborative, cross-validation of the relevant knowledge-base. ), and 
wherein the training data associated with each agent includes at least a local 
enable the multiple agents to repeat multiple training rounds (¶ [0050]) (e.g., training over and over on a single batch of data.).
Bird fails to specifically show: within a training round, prompt each agent to perform a local improvement procedure using Reinforcement Learning (RL)
the at least a local policy under development being an RL policy.
In the same field of invention, Janulewicz teaches: RL for autonomous networks. Janulewicz further teaches:
within a training round, prompt each agent to perform a local improvement procedure using Reinforcement Learning (RL) (abstract);
the at least a local policy under development being an RL policy (abstract).
Thus, it would have been obvious to one of ordinary skill in the art, having the teachings of Bird and Janulewicz before the effective filing date of the invention, to have combined the teachings of Janulewicz with the non-transitory computer-readable medium as taught by Bird.
One would have been motivated to make such combination because a way to enable autonomous, self-learning networks would have been obtained and desired, as expressly taught by Janulewicz (¶ [0004]).

As to claim 13, Bird further shows:
wherein each of one or more agents has little or no visibility of a set of other agents having a relatively low level of affiliation of the different levels of affiliation with the respective agent (¶ [0027]) (e.g., each of a set of global host processes (running on a set of edge machines) is acting as a machine learner, which then communicates with other nearby global host processes to incorporate relevant observations (e.g., of request patterns). As used here, nearby refers to machines that are physically co-located (in a region or set of such machines), those that are not necessarily co-located physically but that are still sufficiently close to one another in a network sense (e.g., based on some latency measurement)).

As to claim 17, Bird further shows:
wherein the instructions further enable the one or more processing devices to coordinate the agents such that, within each training round, each agent, one at a time, is allowed to perform its respective local improvement procedure in accordance with a predetermined sequence (¶ [0028], [0030]) (e.g., Preferably, there is no absolute sequence or ordering of the data transfer(s); rather, data is being shared on an ad hoc basis continuously. As a result, and even where a “model” is of the same type, individual computing nodes are expected to have different variants or versions of the model in question (since the data used to build that model will be variable across subsets or groupings). While the above describes a particular order of operations performed by certain embodiments of the invention, it should be understood that such order is exemplary, as alternative embodiments may perform the operations in a different order, combine certain operations, overlap certain operations, or the like.).
One would have been motivated to make such combination because a way to enable autonomous, self-learning networks would have been obtained and desired, as expressly taught by Janulewicz (¶ [0004]).

It is noted that any citation to specific, pages, columns, lines, or figures in the prior art references and any interpretation of the references should not be considered to be limiting in any way.  A reference is relevant for all it contains and may be relied upon for all that it would have reasonably suggested to one having ordinary skill in the art. In re Heck, 699 F.2d 1331, 1332-33,216 USPQ 1038, 1039 (Fed. Cir. 1983) (quoting In re Lemelson, 397 F.2d 1006,1009, 158 USPQ 275, 277 (CCPA 1968)).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Barber et al. 		[U.S. 20250350540], Creating A Global Reinforcement Learning (RL) Model From Subnetwork RL Agents
Darvish Rouhani et al.	[U.S. 12547872], Machine Learning Model Processing Based On Perplexity
Wu et al. 		[U.S. 12452733], Multi-batch Reinforcement Learning Via Multi-imitation Learning

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Jordany Núñez whose telephone number is (571)272-2753. The examiner can normally be reached M-F 8:30 AM - 5 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Cesar Paula can be reached on 5712724128. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/JORDANY NUNEZ/Primary Examiner, Art Unit 2145                                                                                                                                                                                                        3/13/2026
Read full office action
Coordinating Reinforcement Learning (RL) for multiple agents in a distributed system

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

Coordinating Reinforcement Learning (RL) for multiple agents in a distributed system

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email