Last updated: April 19, 2026
Application No. 17/964,704
DEEP REINFORCEMENT LEARNING BASED WIRELESS NETWORK SIMULATOR

Non-Final OA §103§112
Filed
Oct 12, 2022
Examiner
DIEP, DUY T
Art Unit
2123
Tech Center
2100 — Computer Architecture & Software
Assignee
Nokia Solutions and Networks Oy
OA Round
1 (Non-Final)
Interview Optional

— +5.5% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 20 resolved cases, 2023–2026
Examiner Intelligence

DIEP, DUY T View full profile →
Grants only 25% of cases
Career Allow Rate
5 granted / 20 resolved
-30.0% vs TC avg
Moderate +6% lift
Without
With
+5.5%
Interview Lift
resolved cases with interview
Typical timeline
4y 2m
Avg Prosecution
39 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
34.1%
-5.9% vs TC avg
§103
54.0%
+14.0% vs TC avg
§102
2.3%
-37.7% vs TC avg
§112
9.6%
-30.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 20 resolved cases
Office Action

§103 §112
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Objections
Claim 10 is objected to because of the following informalities: “The device according to claim 8, wherein the autoencoder comprises ...” the claim recites that claim 10 depends on claim 8 and recites “the autoencoder”. However, claim 8 does not recite any “autoencoder” but only the user device. Claim 8 depends on claim 6, and claim 6 depends on claim 1 similarly do not recite any “autoencoder”. The autoencoder is first introduced in claim 9. The examiner suggested to amend the claim dependent in claim 10 from claim 8 to 9. For examination purpose, the examiner will consider claim 10 as dependent on claim 9.

Claim 17 is objected to because of the following informalities: “offine training”.  Appropriate correction is required. The examiner suggested to amend the term into “offline training” to correct the claim. For examination purpose, the examiner will consider “offine training” as “offline training”.


Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.



Claim 5 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 5 recites the following “The device according to claim 1, wherein each DRL agent is further configured to output a next inner state based on said states, the next inner state representing a network configuration of the DRL agent based on said states”. According to claim 5, the next inner state is configured based on said states, however, the claim does not clarify which said state is the next inner state based upon. Claim 1, upon which claim 5 depends on recites various states such as “states representing information of the wireless network and information of the component”, and “the states are augmented”. It is unclear whether which “said states” is mentioned in the claim. Therefore, the usage of the term “said states” render the claim indefinite for failing to particularly point out and distinctly claim the subject matter. For examination purpose, the examiner will consider “said states” as “states representing information of the wireless network” as recited in claim 1.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1 are rejected under 35 U.S.C. 103 as being unpatentable over Bouton et.al (US 20240022950 A1) in view of Nagarathinam et.al (US 20210200163 A1).

Regarding claim 1,
Bouton teaches the limitation “at least one processor” (paragraph 26 “An electronic device stores and transmits ... code (which is composed of software instructions and which is sometimes referred to as computer program code or a computer program) and/or data using machine-readable media (also called computer-readable media) ... Thus, an electronic device (e.g., a computer) includes hardware and software, such as a set of one or more processors” Bouton discloses a method and system for optimizing radio access networks using reinforcement learning, wherein the method and system can be implemented via electronic device, which include hardware comprising a set of one or more processor(s).)
 Bouton teaches the limitation “at least one memory including computer program code” (paragraph 26 “An electronic device stores and transmits (internally and/or with other electronic devices over a network) code (which is composed of software instructions and which is sometimes referred to as computer program code or a computer program) and/or data using machine-readable media (also called computer-readable media), such as machine-readable storage media (e.g., magnetic disks, optical disks, solid state drives, read only memory (ROM), flash memory devices, phase change memory)” Bouton discloses electronic device over a network comprising computer program code stored within machine-readable storage media such as memory devices to store program code to perform the system and method.)
Bouton teaches the limitation “at least one memory and the computer program code configured, with the at least one processor, to cause the device to” (paragraph 26 “An electronic device stores and transmits (internally and/or with other electronic devices over a network) code (which is composed of software instructions and which is sometimes referred to as computer program code or a computer program) and/or data using machine-readable media (also called computer-readable media), such as machine-readable storage media (e.g., magnetic disks, optical disks, solid state drives, read only memory (ROM), flash memory devices, phase change memory) ... Thus, an electronic device (e.g., a computer) includes hardware and software, such as a set of one or more processors” Bouton discloses electronic device over a network comprising computer program code stored within machine-readable storage media such as memory devices to store program code, as well as a processor to perform the system and method.)
Bouton teaches the limitation “configure deep reinforced learning, DRL, agents, wherein each DRL agent is configured to emulate an operation of a component of the wireless network, and each DRL agent is configured to states representing information of the wireless network and information of the component” (paragraph 43 “The embodiments address the problem of optimizing network performance by modeling the problem as a multiagent Markov decision process (MDP) where each base station is an agent.” and paragraph 44 “Each base station can observe various performance indicators from the network such as the Signal to Interference and Noise Ratio (SINR), Reference Signal Received Power (RSRP), and/or Channel Quality Indicator (CQI) of each user equipment (UE) connected to it. This information can be processed and used as a state input to the reinforcement learning process” Bouton discloses the embodiments address the problem of optimizing network performance by modeling the problem as a multiagent reinforcement learning framework, in which each base station is an agent. Under the broadest interpretation, the base station is analogous to the component of the wireless network, as claimed, and each base station is represented as an agent within the reinforcement learning framework, which is analogous to the emulation of an operation of a component of the wireless network, as claimed. Furthermore, each base station can observe various performance indicators from the network such as the Signal to Interference and Noise Ratio (SINR), and Channel Quality Indicator (CQI) of each user equipment (UE) connected to it, in which these various performance indicators (e.g., SINR, CQI) corresponds to information of the wireless network and information of the component respectively, as claimed.)
Bouton teaches the limitation “inter-connect the DRL agents to emulate real connections between the components in the wireless network” (paragraph 41 “The embodiments enable efficient use of inter-agent communication. The data measured from the network stay local to the base stations and are not sent to a centralized server, thereby avoiding privacy concerns. The coordination mechanism of the embodiments uses a low amount of data transfer. In addition, the distributed nature of the embodiments makes them more resilient to failures in the network” Bouton discloses the embodiments enable efficient use of inter-agent communication, in which the multiagent reinforcement learning framework allow communication between agents to transfer information, thereby corresponds to the inter-connect the DRL agents to emulate real connections between the components in the wireless network, as claimed.)
Bouton teaches the limitation “execute the DRL agents based on the states as inputs to simulate the wireless network online” (paragraph 89 “Thus, the embodiments provide a distributed coordinated reinforcement to optimize network configuration online. The representation of the cellular networks as a coordination graph allows global optimization to be performed in a distributed way and using local information and communication. The embodiments provide the possibility to transfer knowledge through the edge of the graph either by smart initialization (simulation or expert knowledge)” Bouton discloses above that each agent represent each base station to receive state input to perform reinforcement learning, wherein the embodiments of decentralized coordinated reinforcement learning provide the possibility to transfer knowledge through the edge (base station) of the graph either by smart initialization (simulation or expert knowledge), which is analogous to the claimed process of executing the reinforced learning agents based on the states as inputs to simulate the wireless network online.)

Bouton does not teach the receive and execute training data aspect of each agent within the limitation “wherein the DRL agents are configured to receive and execute training data so that the states are augmented and reward estimated”. However, Nagarathinam teaches this (paragraph 42 “At step 406 of the present disclosure, the one or more hardware processors 104 train a plurality of deep Reinforcement learning (RL) agents using the generated simulation model ... It is understood by person having ordinary skill in the art or person skilled in the art that RL comes under the class of ML where a RL agent interacts with the environment and learns to act using a Markov Decision Process (MDP) method. Generally, MPDs are defined by a tuple consisting the environment state, action taken, resulting reward and a transition probability. Specifically, the tuple is given by (S, A, P, r) where S is the state space, A is the action space, P is the transition probability of environment state being St+1 time t+1 as a result of choosing an action A in state St at time t and R is the immediate reward received after transiting from state St to St+1”, paragraph 53 “In the learning phase, the tuples ... are stored in experience buffer. A mini batch is sampled for the network training.” Nagarathinam discloses a system that implements multi-agent deep reinforcement learning for dynamically controlling electrical equipment. Within the disclosure, Nagarathinam discloses the system carried the reinforcement learning framework via a simulation of a plurality of deep Reinforcement learning (RL) agents that represent each electrical equipment and monitors one or more states affecting performance of the one or more controllable electric equipment for continuously learning. Each agent may receive input data of sampled mini batch of data for training. The training of the deep reinforcement learning is carried out within the simulation to simulate the decision made by each agent in accordance with a transition from the old state to a new state and a reward received for the state transition, wherein this training procedure is analogous to the training process of each DRL agent configured to receive and execute training data so that the states are augmented and reward estimated, as claimed.)
Before the effective filing date, it would have been obvious to one of ordinary skilled in the art to combine the teaching of a method and system for optimizing radio access networks using reinforcement learning by Bouton with the teaching of multi agent framework with reinforcement learning to model connection between various electrical equipment while execute reinforcement learning at each agent using data by Nagarathinam. The motivation to do so is referred to in Nagarathinam’s disclosure (paragraph 3 “Although schedule and rule-based strategies are easy to realize in practice, they are often sub-optimal either in terms of energy or occupant comfort or both. On the other hand, model based predictive control have been shown to perform much better than simple strategies. However, developing a calibrated model of a building that is required for the Model predictive control (MPC) is complex.”, and paragraph 5 “Reinforcement learning techniques can circumvent most of the above challenges. It is a model-free approach where the learning agent directly interacts with the system (to be controlled), receives (partial) feedback in the form of rewards in response to the actions taken by the agent. The agent progressively refines its policy (control law) by trying out different actions as it balances exploiting its current knowledge versus exploring unseen actions/states. These techniques in addition to being model-free can further tackle continuous state spaces as well ... which make them a very attractive alternative for traditional optimal control approaches.” Nagarathinam discloses the benefit of the current invention, which provide a model-free approach using reinforcement learning to determine the control of each equipment represented by each reinforcement learning agent and tackle continuous state spaces. The teaching by Nagarathinam provides a similar setting to the teaching by Bouton where each base station includes an agent to perform cooperative multi agent reinforcement learning as well. While Bouton does not explicitly disclose the local training of each agent using reinforcement learning, Bouton did imply the training using training data at paragraph 3 “machine learning algorithms build a model based on training data. Training data can be a set of sample or starting data with known properties such as correlation with a task outcome. The training data are input into the algorithm and model to ‘train’ the AI to perform a task”. Given the similar setup between Bouton and Nagarathinam, the disclosed benefit of the reinforcement learning method as a model-free approach and tackling over continuous state space, as well as the implication of machine training in Bouton and the training of multiple reinforcement learning agents by Nagarathinam, one of ordinary skilled in the art would have been able to implement the training of each agent within Bouton based on reinforcement learning using training data within a simulation to obtain the best possible design to optimize radio access networks. The combination would have been a predictable design choice as they have similar setting and similar reinforcement machine learning algorithm, and the teaching by Bouton will further obtain the advantage of configuring the machine learning as a model-free simulation.)

Regarding claim 2 depends on claim 1, thus the rejection of claim 1is incorporated.
Bouton teaches the limitation “The device according to claim 1, wherein each DRL agent is configured to emulate an individual component in a real wireless network, wherein the component comprises the individual component and the wireless network comprises the real wireless network implemented in a certain geographical area” (paragraph 43 “The embodiments address the problem of optimizing network performance by modeling the problem as a multiagent Markov decision process (MDP) where each base station is an agent.”, and paragraph 57 “Geographic neighbor relations use the geographic distance between antennas from which it can be determined which antennas are likely to interfere with each other. Antennas belonging to the same base station are connected to each other by edges in the coordination graph and/or physically connected at the base station, and antennas belonging to base stations geographically close are also connected to each other by edges in the coordination graph” Bouton discloses modeling the problem as a multiagent reinforcement learning framework, in which each base station is an agent, which is analogous to the component of the wireless network, as claimed, and each base station is represented as an agent within the reinforcement learning framework, which is analogous to the emulation of an operation of a component of the wireless network, as claimed. Bouton further discloses each base station is connected via antenna to create a geographic neighbor relations using geographic distance to represent the connection between each station physically, thus corresponds to the real wireless network implemented in a certain geographical area, as claimed.)

Regarding claim 3 depends on claim 1, thus the rejection of claim 1 is incorporated.
Bouton teaches the limitation “The device according to claim 1, wherein the states comprise an inner state representing technical inner information of the component and wherein each DRL agent is configured to receive the inner state as an input” (paragraph 44 “Each base station can observe various performance indicators from the network such as the Signal to Interference and Noise Ratio (SINR), Reference Signal Received Power (RSRP), and/or Channel Quality Indicator (CQI) of each user equipment (UE) connected to it. This information can be processed and used as a state input to the reinforcement learning process.” Bouton discloses each base station can observe their various performance indicator information such as Signal to Interference and Noise Ratio or Reference Signal Received Power, which corresponds to the claimed technical inner information of the component under the broadest reasonable interpretation, and that the agent of the base station take these information as state input.)

Regarding claim 4 depends on claim 1, thus the rejection of claim 1 is incorporated.
Bouton teaches the limitation “The device according to claim 1, wherein the states comprise an outer state representing a wireless network user status and states of other DRL agents, and wherein each DRL agent is configured to receive the outer state as an input” (paragraph 52 “The node and/or base station collects relevant network metrics for the network parameters being optimized (Block 207). Any number or variety of metrics can be collected including power consumption, congestion rate, throughput of users, quality of experience (QoE), SINR, RSRP, CQI, and similar metrics”, paragraph 53 “Based on received optimization and network metric data from neighbor nodes, the local configuration and optimization can be updated”, and paragraph 86 “To perform the updates, the following information needs to be shared: ... the observed states at each connected antenna ... the reward signals ... This information can already be measured by the base stations. The base stations would have to communicate this information to their neighbors” Bouton discloses the plurality base station represented by each agent is configured on a coordination graph with node represent each base station. Each node/base station collect their own information of network metrics and send it to the neighbor node (neighbor base station) to collect and perform local update correspondingly. Each base station receives variety of metrics information and state information at each antenna from another base station to perform local update, thus implying these variety of metrics information and state information as input to perform local update, which corresponds to the wireless network user status and states of other DRL agents as outer state input configured to be received by each DRL agent, as claimed.)

Regarding claim 5 depends on claim 1, thus the rejection of claim 1 is incorporated.
Bouton teaches the limitation “The device according to claim 1, wherein each DRL agent is further configured to output a next inner state based on said states, the next inner state representing a network configuration of the DRL agent based on said states” (paragraph 77 “The actions of Blocks 305-311 implement a reinforcement learning loop in the example of FIG. 3. These steps comprise a reinforcement learning loop with a custom update rule that only uses local information. The first action of the reinforcement loop comprises taking an action and collecting data. Once an optimal joint action is computed through message passing, each agent can decide to take this action or to take an exploratory action ... After the agent takes this action, it receives a reward from the environment. The reward signal can include any type of performance indicator measurable by a base station ... In this interaction each agent gathers an experience tuple, (si,ai,ri,si′), where si′ is the state observed after applying configuration” Bouton discloses the conventional reinforcement learning practice of each agent at each base station, which comprises of an agent to take an action in response to the current state, receives a reward in accordance to the taken action, and gathers an experience tuple, which comprises of current and new state information. The new state information is the state observed after applying configuration and the base station comprise a new network configuration in this new state, which corresponds to the next inner state based on said states that represent a network configuration of the DRL agent, as claimed.)

Regarding claim 14 depends on claim 1, thus the rejection of claim 1 is incorporated.
Bouton teaches the limitation “The device according to claim 1, wherein the wireless network comprises a mobile network” (paragraph 28 “The embodiments provide a process and system for decentralized coordinated reinforcement learning for optimizing radio access networks (RANs). Radio access networks are key infrastructure in mobile telecommunication networks that provide connectivity between user equipment (UE) and the RAN. The RAN is composed of a set of base stations that manage communication with UEs in a region referred to as a cell”. Bouton discloses a mobile telecommunication network that provide connectivity between user equipment (UE) and the RAN, wherein the RAN composed of a set of base stations that manage communication with UEs. The mobile telecommunication network is analogous to the mobile network, as claimed.)

Regarding claim 15 depends on claim 1, thus the rejection of claim 1 is incorporated.
Bouton teaches the limitation “The device according to claim 1, wherein the DRL agent is configured to emulate a base station, a switch, or a data processor unit of the wireless network.” (paragraph 43 “The embodiments address the problem of optimizing network performance by modeling the problem as a multiagent Markov decision process (MDP) where each base station is an agent.” Bouton discloses each base station is an agent, which is analogous to the DRL agent is configured to emulate a base station, as claimed.)

Regarding claim 16, which recites similar limitation and processing steps to claim 1. Therefore, the claim is rejection under the same rationale as claim 1.

Regarding claim 17 depends on claim 16, thus the rejection of claim 16 is incorporated.
Nagarathinam teaches the limitation “The method of claim 16, further comprising offine training of the DRL agents before the state augmentation” (paragraph 34 “Embodiments of the present disclosure provide systems and methods that implement a multi-agent based Deep reinforcement learning for learning the optimal control policy. RL agent interacting with a real-world building to determine optimal policy may not be viable due to comfort constraints ... Due to this, the RL agent is trained offline using a simplistic model in the present disclosure. A simplistic model (also referred as ‘simulation model’ or an ‘abstract model’) can be constructed using the design details of the building envelope (construction materials, orientation, etc.) and electrical equipment (or controller equipment)” Nagarathinam discloses systems and methods that implement a multi-agent based Deep reinforcement learning for learning the optimal control policy, in which the RL agent is trained offline using a simplistic model such as a simulation model. One of ordinary skilled in the art would have been able to configure the offline training of the RL agent by Nagarathinam to correspond to the training of the agent at each station base by Bouton based on their combination above and further configure so that the training occur prior to the reinforcement learning is performed to transform the state information of each agent, as this is a predictable design choice and training the agent prior will help stabilize and improve performance of the reinforcement learning system, thereby obtain the optimization result.)

Regarding claim 18 depends on claim 16, thus the rejection of claim 16 is incorporated. Claim 18 recites similar limitation and processing steps to claim 2. Therefore, the claim is rejection under the same rationale as claim 2.

Regarding claim 19 depends on claim 16, thus the rejection of claim 16 is incorporated.
Nagarathinam teaches the limitation “The method of claim 16, wherein the method is configured for model-free simulation” (paragraph 5 “Reinforcement learning techniques can circumvent most of the above challenges. It is a model-free approach where the learning agent directly interacts with the system (to be controlled), receives (partial) feedback in the form of rewards in response to the actions taken by the agent. The agent progressively refines its policy (control law) by trying out different actions as it balances exploiting its current knowledge versus exploring unseen actions/states. These techniques in addition to being model-free ...” Nagarathinam discloses the reinforcement learning techniques being implemented as a model-free approach where the learning agent directly interacts with the system and the agent perform reinforcement learning within a simulation as mentioned above, which is analogous to the model-free simulation, as claimed.)


Claims 6-8 are rejected under 35 U.S.C. 103 as being unpatentable over Bouton et.al (US 20240022950 A1) in view of Nagarathinam et.al (US 20210200163 A1)., further in view of Ganapathi et.al (US 20190141549 A1)

Regarding claim 6 depends on claim 1, thus the rejection of claim 1 is incorporated.
Bouton/Nagarathinam does not teach the limitation “The device according to claim 1, further comprising a user agent configured to emulate operations of a user device of the wireless network, and the user agent configured to generate data traffic of the wireless network and performances of the user within the wireless network.” However, Ganapathi teaches this limitation (paragraph 92 “As illustrated in FIG. 1a, a system 100 includes a user device 102 that communicates data requests through a network 104. A proxy server 108 may receive the data requests and communicate the requests to a data center 110. The adaptive network performance optimizer 106 may gather information from the proxy server 108 and store information in the historical network data traffic store 112, in an embodiment. ... Then, over time, mobile network traffic may be assigned parameters from this space at random and performance data may be stored in the historical network data traffic store 112”, and paragraph 97 “Once a multitude of raw network traffic data associated with data requests between user devices 102 and the data centers 110 are logged in the historical network data traffic store 112, it becomes possible to simulate wireless networks using probabilistic models” Ganapathi discloses the user device that communicate data requests through a network, thereby network traffic is obtained and stored within the historical network data traffic store as well as performance data, which is then used to generate a simulate wireless networks. The stored network traffic and performance data corresponds to the data traffic and the performances of the user within the wireless network, as claimed.)
Before the effective filing date, it would have been obvious to one of ordinary skilled in the art to combine the teaching of a method and system for optimizing radio access networks using reinforcement learning by Bouton, the teaching of multi agent framework with reinforcement learning to model connection between various electrical equipment while execute reinforcement learning at each agent using data by Nagarathinam, with the teaching of user device to gather network traffic data and performance by Ganapathi. The motivation to do so is referred to in Ganapathi’s disclosure (paragraph 90 “A wireless network modeler 118 may use historical network traffic data to simulate a wireless network using probabilistic models. An application performance emulator 140 generates a virtual machine using a network simulator running on sampled network parameters used to emulate performance of an application.”, paragraph 95 “By inserting an agent 114 in the user device 102 to report the observed networking conditions back to the accelerator 116, estimates about the state of the network can be vastly improved. The main benefits of having a presence (the agent 114) on the user device 102 include the ability to perform measurements that characterize one leg of the session, e.g., measuring just the client-to-server leg latency, etc.”, and paragraph 98 “For example, probabilistic modeling on historical network traffic data may be used to simulate wireless networks such that a virtual machine running the application on the simulated networks may generate outcomes of the download, such as the throughput, download complete time, and time to first byte, during a specified time period.” Ganapathi discloses the benefit of the invention by providing user device in combination with agent to receive and use historical network traffic data and performance to simulate a wireless network. The user device is capable of perform measurements that characterize data traffic and performance, which help improve the state of the network. The main benefit remains on the capability of collect traffic information and performance to perform a simulation of network, wherein the teaching of Bouton in view of Ganapathi similarly describe a simulation of database with network configuration. Therefore, one of ordinary skilled in the art may further incorporate the teaching of Ganapathi into the teaching combination to further obtain the capability to perform the simulation and collect traffic data and performance data thereby obtain historical data to further enhance reinforcement learning with historical data.)

Regarding claim 7 depends on claim 6, thus the rejection of claim 6 is incorporated.
	Ganapathi teaches the limitation “The device according to claim 6, wherein the DRL agents are configured to receive the data traffic and the performances of the user within the wireless network” (paragraph 92 “As illustrated in FIG. 1a, a system 100 includes a user device 102 that communicates data requests through a network 104. A proxy server 108 may receive the data requests and communicate the requests to a data center 110. The adaptive network performance optimizer 106 may gather information from the proxy server 108 and store information in the historical network data traffic store 112, in an embodiment. ... Then, over time, mobile network traffic may be assigned parameters from this space at random and performance data may be stored in the historical network data traffic store 112”, and paragraph 97 “Once a multitude of raw network traffic data associated with data requests between user devices 102 and the data centers 110 are logged in the historical network data traffic store 112, it becomes possible to simulate wireless networks using probabilistic models” Ganapathi discloses the user device that communicate data requests through a network, thereby network traffic is obtained and stored within the historical network data traffic store as well as performance data, which is then used to generate a simulate wireless networks. The stored network traffic and performance data corresponds to the data traffic and the performances of the user within the wireless network, as claimed.)

Regarding claim 8 depends on claim 6, thus the rejection of claim 6 is incorporated.
Bouton teaches the limitation “The device according to claim 6, wherein the user device comprises a mobile device” (paragraph 2 “The mobile devices, e.g., user equipment (UE) such as mobile phones, tablets, laptops, and similar devices, may frequently travel and shift connection points with the mobile network in a manner that maintains continuous connections for the applications of the mobile devices. Typically, the mobile devices connect to the mobile network via radio access network (RAN) base stations, which provide connectivity to any number of mobile devices for a local area or ‘cell.’”)


Claims 9-10 are rejected under 35 U.S.C. 103 as being unpatentable over Bouton et.al (US 20240022950 A1) in view of Nagarathinam et.al (US 20210200163 A1), further in view of Andersen et.al (NPL: The Dreaming Variational Autoencoder for Reinforcement Learning Environments)

Regarding claim 9 depends on claim 1, thus the rejection of claim 1 is incorporated.
Bouton/Nagarathinam does not teach the limitation “The device according to claim 1, wherein for augmenting, the device is further configured to use an autoencoder to augment the states”. However, Andersen teaches this limitation (page 3 section 4 “The Dreaming Variational Autoencoder (DVAE) is an end-to-end solution for generating probable future states ˆst+n from an arbitrary state-space S using state-action pairs” Andersen discloses the Dreaming Variational Autoencoder (DVAE), a neural network based generative modeling architecture for exploration in environments with sparse feedback. The DVAE can generate probable future states from an arbitrary state-space S using state-action pairs, which corresponds to use an autoencoder to augment the states, as claimed.)
Before the effective filing date, it would have been obvious to one of ordinary skilled in the art to combine the teaching of a method and system for optimizing radio access networks using reinforcement learning by Bouton, the teaching of multi agent framework with reinforcement learning to model connection between various electrical equipment while execute reinforcement learning at each agent using data by Nagarathinam, with the teaching of the Dreaming Variational Autoencoder by Andersen. The motivation to do so is referred to in Andersen’s disclosure (page 2 section 1 “environment modeling is a promising exploration technique where the goal is for the model to imitate the behavior of the target environment. This limits the required interaction with the target environment, enabling nearly unlimited access to exploration without the cost of exhausting the target environment. In addition to environment-modeling, a balance between exploration and exploitation must be accounted for, and it is, therefore, essential for the environment model to receive feedback from the RL agent. By combining the ideas of variational autoencoders with deep RL agents, we find that it is possible for agents to learn optimal policies using only generated training data samples.” Andersen discloses the motivation to combining the ideas of variational autoencoders with deep RL agents, thus help the RL agents to learn optimal policies using only generated training data samples. Therefore, one of ordinary skilled in the art may configure to combine the teaching by Bouton/Nagarathinam with the teaching of Andersen for further improvement of using the autoencoder.)

Regarding claim 10 depends on claim 8, thus the rejection of claim 8 is incorporated.
Andersen teaches the limitation “The device according to claim 8, wherein the autoencoder comprises a variational autoencoder, VAE” (page 3 section 4 “The Dreaming Variational Autoencoder (DVAE) is an end-to-end solution for generating probable future states ˆst+n from an arbitrary state-space S using state-action pairs” Andersen discloses the Dreaming Variational Autoencoder (DVAE), a neural network based generative modeling architecture for exploration in environments with sparse feedback. The DVAE can generate probable future states from an arbitrary state-space S using state-action pairs, which corresponds to use an autoencoder to augment the states, as claimed.)


Claims 11-12 are rejected under 35 U.S.C. 103 as being unpatentable over Bouton et.al (US 20240022950 A1) in view of Nagarathinam et.al (US 20210200163 A1), further in view of Levine et.al (NPL: Nonlinear Inverse Reinforcement Learning with Gaussian Processes)

Regarding claim 11 depends on claim 1, thus the rejection of claim 1 is incorporated.

Bouton/Nagarathinam does not teach the limitation “The device according to claim 1, wherein for the reward estimating the device is further configured to use distributional regression”. However, Levine teaches this limitation (page 1 section 1 “Previous IRL algorithms generally learn the reward as a linear combination of features, either by finding a reward under which the expert’s policy has a higher value than all other policies ... or else by maximizing the probability of the reward under a model of near-optimal expert behavior ...  GPIRL is the first method to combine probabilistic reasoning about stochastic expert behavior with the ability to learn the reward as a nonlinear function of features, allowing it to outperform prior methods on tasks with inherently nonlinear rewards and suboptimal examples”, and page 2 section 3 “GPIRL represents the reward as a nonlinear function of feature values. This function is modeled as a Gaussian process, and its structure is determined by its kernel function. The Bayesian GP framework provides a principled method for learning the hyperparameters of this kernel, thereby learning the structure of the unknown reward ... we use Equation 1 to specify a distribution over GP outputs ... In GP regression, we use noisy observations y of the true underlying outputs u. GPIRL directly learns the true outputs u, which represent the rewards” Levine discloses using Gaussian Process Inverse Reinforcement Learning (GPIRL) method to combine probabilistic reasoning about stochastic expert behavior with the ability to learn the reward as a nonlinear function of features. Essentially, Levine discloses using Gaussian processes to learn the reward as a nonlinear function, wherein the Gaussian processes comprise GP regression equation to specify a distribution of rewards, which is analogous to the claimed using distributional regression for reward estimating.)
Before the effective filing date, it would have been obvious to one of ordinary skilled in the art to combine the teaching of a method and system for optimizing radio access networks using reinforcement learning by Bouton, the teaching of multi agent framework with reinforcement learning to model connection between various electrical equipment while execute reinforcement learning at each agent using data by Nagarathinam, with the teaching of Gaussian Process Inverse Reinforcement Learning (GPIRL) method by Levine. The motivation to do so is referred to in Levine’s disclosure (page 1 section 1 “Previous IRL algorithms generally learn the reward as a linear combination of features, either by finding a reward under which the expert’s policy has a higher value than all other policies ... or else by maximizing the probability of the reward under a model of near-optimal expert behavior ...  GPIRL is the first method to combine probabilistic reasoning about stochastic expert behavior with the ability to learn the reward as a nonlinear function of features, allowing it to outperform prior methods on tasks with inherently nonlinear rewards and suboptimal examples” Levine discloses the GPIRL is the first method to combine probabilistic reasoning about stochastic expert behavior with the ability to learn the reward as a nonlinear function of features, allowing it to outperform prior methods on tasks with inherently nonlinear rewards and suboptimal examples. Therefore, one ordinary skilled in the art would have been motivated to incorporate the GPIRL method to perform learning of reward for the reinforcement learning to further improve the reward function within the reinforcement learning framework by Bouton/Nagarathinam.)

Regarding claim 12 depends on claim 11, thus the rejection of claim 11 is incorporated.
Levine teaches the limitation “The device according to claim 11, wherein the device is configured to gaussian process regression, GPR, for the reward estimation” (page 2 section 3 “GPIRL represents the reward as a nonlinear function of feature values. This function is modeled as a Gaussian process, and its structure is determined by its kernel function. The Bayesian GP framework provides a principled method for learning the hyperparameters of this kernel, thereby learning the structure of the unknown reward ... we use Equation 1 to specify a distribution over GP outputs ... In GP regression, we use noisy observations y of the true underlying outputs u. GPIRL directly learns the true outputs u, which represent the rewards” Levine discloses using Gaussian Process Inverse Reinforcement Learning (GPIRL) method to combine probabilistic reasoning about stochastic expert behavior with the ability to learn the reward as a nonlinear function of features. Essentially, Levine discloses using Gaussian processes to learn the reward as a nonlinear function, which is analogous to the gaussian process regression, GPR, for the reward estimation, as claimed.)


Claims 13 is rejected under 35 U.S.C. 103 as being unpatentable over Bouton et.al (US 20240022950 A1) in view of Nagarathinam et.al (US 20210200163 A1), further in view of Andersen et.al (NPL: The Dreaming Variational Autoencoder for Reinforcement Learning Environments), further in view of Levine et.al (NPL: Nonlinear Inverse Reinforcement Learning with Gaussian Processes)

Regarding claim 13 depends on claim 1, thus the rejection of claim 1 is incorporated.
Andersen teaches a part of the limitation “The device according to claim 1, wherein the device is configured to augment the states so that a massive number of states is obtained for the DRL agent” (page 3 section 4 “The Dreaming Variational Autoencoder (DVAE) is an end-to-end solution for generating probable future states ˆst+n from an arbitrary state-space S using state-action pairs”, and Page 9 section 6.1 “The DVAE algorithm must be able to generalize over many similar states to model a vast state-space.” Andersen discloses the Dreaming Variational Autoencoder (DVAE), a neural network based generative modeling architecture for exploration in environments with sparse feedback. The DVAE can generate probable future states from an arbitrary state-space S using state-action pairs, which corresponds to use an autoencoder to augment the states, as claimed. The DVAE algorithm is able to generalize over many similar states to model a vast state-space, which corresponds to the massive number of states, as claimed.)
The motivation to combine the teaching by Andersen with the teaching of Bouton/ Nagarathinam is similar to the motivation as recited in claim 9 because the claim recites similar techniques and processing steps.
Levine teaches a part of the limitation “... wherein the device is configured to reward estimate the massive number of states by distributional regression based on similarity of the states” (page 2 section 3 “GPIRL represents the reward as a nonlinear function of feature values. This function is modeled as a Gaussian process, and its structure is determined by its kernel function. The Bayesian GP framework provides a principled method for learning the hyperparameters of this kernel, thereby learning the structure of the unknown reward ... we use Equation 1 to specify a distribution over GP outputs ... In GP regression, we use noisy observations y of the true underlying outputs u. GPIRL directly learns the true outputs u, which represent the rewards ... In GP regression, we use noisy observations y of the true underlying outputs u. GPIRL directly learns the true outputs u, which represent the rewards”, and page 3 section 3 “States distinguished by highly-weighted features can take on different reward values, while those that have similar values for all highly-weighted features take on similar rewards” Levine disclose GPIRL as the function to calculate unknown reward using Gaussian Process and a distribution over Gaussian Process outputs, wherein the rewards are generated for many states such that the states with similar values for all highly-weighted features take on similar rewards, which is analogous to reward estimate the massive number of states by distributional regression based on similarity of the states, as claimed.)
The motivation to combine the teaching by Levine with the teaching of Bouton/Nagarathinam/Andersen is similar to the motivation as recited in claim 11 because the claim recites similar techniques and processing steps.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DUY TU DIEP whose telephone number is (703)756-1738. The examiner can normally be reached M-F 8-4:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Alexey Shmatov can be reached at (571) 270-3428. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/DUY T DIEP/Examiner, Art Unit 2123                                                                                                                                                                                                        
/ALEXEY SHMATOV/Supervisory Patent Examiner, Art Unit 2123
Read full office action
Prosecution Timeline

Oct 12, 2022
Application Filed
Jan 12, 2026
Non-Final Rejection — §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/551,821
Patent 12579428
METHOD FOR INJECTING HUMAN KNOWLEDGE INTO AI MODELS
2y 5m to grant Granted Mar 17, 2026
17/557,096
Patent 12488223
FEDERATED LEARNING FOR TRAINING MACHINE LEARNING MODELS
2y 5m to grant Granted Dec 02, 2025
17/317,908
Patent 12412129
DISTRIBUTED SUPPORT VECTOR MACHINE PRIVACY-PRESERVING METHOD, SYSTEM, STORAGE MEDIUM AND APPLICATION
2y 5m to grant Granted Sep 09, 2025
Study what changed to get past this examiner. Based on 3 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
25%
Grant Probability
30%
With Interview (+5.5%)
4y 2m
Median Time to Grant
Low
PTA Risk
Based on 20 resolved cases by this examiner. Grant probability derived from career allow rate.